Talks and Workshops

Sticky

I enjoy giving talks and workshops on Data Analytics. Here is a list of some of the talks I’ve given. In my Mathematics master I regularly gave talks on technical topics, and previously I worked as a Tutor and Technician in a School in Northern Ireland. I consider the evangelism of data and analytics to be an important part of my job as a professional data scientist!

Upcoming

I’m giving a tutorial called ‘Lies damned lies and statistics’ at PyData London 2016. I’ll be discussing different statistical and machine learning approaches to the same kinds of problems. The aim will be to help those who know either Bayesian statistics or Machine Learning bridge the gap to others.

Slides and Videos from Past Events

In April 2016 I gave an invited talk at the Toulouse Data Science meetup which was a slightly adjusted version of  Map of the Stack‘.

At PyData Amsterdam in March 2016- I gave the second Keynote on a ‘Map of the Stack‘.

PyCon Ireland From the Lab to the Factory (Dublin, Ireland October 2015) – I gave a talk on the business side of delivering data products – a trope I used was it is like ‘going from the lab to the factory’. This was a well-received talk based on the feedback and I gave my audience a collection of tools they could use to solve these challenges.

EuroSciPy 2015 (Cambridge, England Summer 2015): I gave a talk on Probabilistic Programming applied to Sports Analytics – slides are here.

My PyData London tutorial was an extended version of the above talk.

I spoke at PyData in Berlin.
The link is here

The blurb for my PyData Berlin talk is mentioned here.
Abstract: “Probabilistic Programming and Bayesian Methods are called by some a new paradigm. There are numerous interesting applications such as to Quantitative Finance.
I’ll discuss what probabilistic programming is, why should you care and how to use PyMC and PyMC3 from Python to implement these methods. I’ll be applying these methods to studying the problem of ‘rugby sports analytics’ particularly how to model the winning team in the recent Six Nations in Rugby. I will discuss the framework and how I was able to quickly and easily produce an innovative and powerful model as a non-expert.”

In May 2015 I gave a preview of my PyData Talk in Berlin at the Data Science Meetup in Luxembourg on ‘Probabilistic Programming and Rugby Analytics‘ – where I presented a case study and introduction to Bayesian Statistics to a technical audience. My case study was the problem of ‘how to predict the winner of the Six Nations’. I used the PyMC library in Python to build up statistical models as part of the Probabilistic Programming paradigm. This was based on my popular Blog Post which I later submitted to the acclaimed open source textbook Probabilistic Programming and Bayesian Methods for Hackers. I gave this talk using an IPython notebook, which proved to be a great method for presenting this technical material.

In October 2014 I gave a talk at Impactory in Luxembourg – a co-working space and Tech Accelerator. This was an introductory talk to a business audience about ‘Data Science and your business‘. I talked about my experience at different small firms, and large firms and the opportunities for Data Science in various industries.

In October 2014 I also gave a talk at the Data Science Meetup in Luxembourg. This was on ‘Data Science Models in Production‘ discussing my work with a small company on developing a mathematical modelling engine that was the backbone of a ‘data product’. This talk was highly successful and I gave a version of this talk at PyCon Italy – held in Florence – in April 2015. The aim of this talk was to explain what a ‘data product’ was, and discuss some of the challenges of getting data science models into production code. I also talked about the tool choices I made in my own case study. It was well-received, high level and got a great response from the audience. Edit: Those interested can see my video here, it was a really interesting talk to give, and the questions were fascinating.

When I was a freelance consultant in the Benelux I gave a private 5 minute talk on Data Science in the Game industry. Here are the slides. – This is from July 2014

My Mathematical research and talks as a Masters student are all here. I specialized in Statistics and Concentration of Measure. It was from this research that I became interested in Machine Learning and Bayesian Models.

Thesis

My Masters Thesis on ‘Concentration Inequalities and some applications to Statistical Learning Theory‘ is an introduction to the world of Concentration of Measure, VC Theory and I used this to apply to understanding the generalization error of Econometric Forecasting Models.

Interview with a Data Scientist: Mick Cooney

Standard

I’m delighted to feature my friend Mick Cooney here as an interviewee. Mick has many years of experience in Finance and more recently in Insurance, he co-ran the Dublin R meetup which was very successful and helped foster a data science community in Dublin. More recently he’s been working over in London at an Actuarial Consultancy – building out a data science practice.

q1. What project have you worked on do you wish you could go back to,
and do better?

I started my career as a quant in a small startup hedge fund. We
developed time series models to forecast short-term volatility in
equities and equity indices as part of an option trading strategy. It
is a fascinating topic and I still dabble in it. Thinking back on the
work done, I would re-engineer large portions of it. I made a ton of
mistakes on both the modelling and implementation side, and the R
language in particular has progressed in strides since I did the bulk
of the work.

For example, the system automatically generates PDF reports of the
forecasts but it does so by hand creating La-TeX files compiled into
PDF. One of the first things I would do is switch all that over to use
either ‘knitr’ or ‘rmarkdown’. I would also use more ‘reproducible
research’ concepts.

That said, I had worked on the modeling for a long time, so I am
content with the basic model. There are many things still to
investigate or implement.

On the modeling side, I worked on a persistency model using survival
analysis, which is how I learned about the subject in the first
place. As a result, there are a lot of different things I would love
to return to and do differently. In retrospect, I was too quick to
move past the simpler models. We could see the assumptions were not
consistent with the data, and so did not fully explore simpler
approaches. I am now curious to learn what insights those simpler
approaches would yield.

Customer churn is such a universal problem I expect I will be working
on it again in the near future. Hopefully I can apply those lessons
then.

***
q2. What advice do you have to younger analytics professionals and in
particular PhD students in the Sciences?

I think the key advice I would give is the same for everyone – never
stop learning. This may be the availability heuristic at play with me,
but I have never seen a connection between qualifications and analyst
quality. All the good analysts I know have curiosity and
initiative. Academic achievements do not come into it at all.

Initiative manifests in many ways. First, when they encounter a
problem they learn what they need to do and get on with it. Second,
much of their knowledge is self-taught. Finally, and I believe most
importantly, they have an inherent curiosity – the best analysts I
know engage in the field in their own time, mainly because they want
to.

This brings up a related issue I have been pondering for some time. I
am ambitious. I want to be a top data scientist some day. I have no
academic ambition whatsoever, but my goal is to be able to hold my own
in any conversation with anyone in the field.

How do I achieve this? What do I need to do to get to that point?

While probably not as keen as the average fan, I love sport – soccer,
the NFL and Gaelic Football in particular. For anyone who has met me
in person, comparing me to a top athlete seems preposterous, but
there is a lot to be learned from top athletes if you want to excel
at your chosen field. Look at how they prepare and train. These
principles almost certainly apply to other professions too, but it is
more fun to talk about sport. 🙂

When I read about Lionel Messi, Tom Brady or Colm Cooper (for our
non-Irish readers the recently-retired ‘Gooch’ is arguably the
greatest GAA player to ever play the game – he was majestic to watch),
the one thing that always stands out for me is their fanatical
devotion to their chosen career not their obvious talent. All their
team-mates mention how hard they worked despite their abundance of
natural advantages. Players with huge natural talent often coast, but
elite players are the opposite – they work as hard as the fringe
players slogging to just survive the cut.

In our field, we need to work constantly on improving – going to
Meetups, reading about new techniques, watching videos on YouTube and
looking to strengthen areas where you are weak. This is why a natural
interest and curiosity is so invaluable – it makes these necessary
tasks much less of a burden as they are things you would want to do
anyway.

Secondly, top players do the simple things well, almost never making a
mistake. They are fallible of course, and make mistakes, but almost
never on the basics. They are rigorous about practicing the basic
skills and principles, and that is why they are so good. The bread and
butter of their craft is second-nature to them.

This is why I focus so much on basic statistics classes and reread and
re-watch the books and lectures I find useful. I want these things to
be second nature and they are not.

Probability and statistics are so counter-intuitive that I almost
never get things right on gut feeling. I am almost always wrong. So
much so that I gave a talk about probabilistic graphical models about
a year ago and during the questions at the end made an off-hand joke
about going with the opposite of my intuition.

It was said in jest at the time but is sadly true!

One final piece of advice is to help as many people as you can. Help
people with their homework, with some programming, with their computer
problems and with data problems. You get exposed to all sorts of
topics and problems, most of which you will see again in your
career. You also get the added bonus of people thinking you are
selfless and altruistic, despite being self-serving in reality!

***
q3. What do you wish you knew earlier about being a data scientist?

I have two main things I wish I learned early on in my career, and
both are connected philosophically. First, I wish I had learned about
probabilistic thinking, risk management, economics and statistics –
you can never learn enough about these fundamental topics. Secondly, I
wish I learned it is okay to start working with a bad model that you
know is wrong but simple.

To that first point, I spend a long time fighting my natural desire
for a clean, elegant and correct answer to a problem. I would work on
a problem, get to a point that I was confident pointed us in the right
direction, but then realise that ‘proving’ this was right involved a
huge amount of time and effort, assuming it was possible.

I attributed my natural reluctance to pursue this ‘answer’ as
laziness, and felt guilty. I felt I was being unprofessional and
sloppy. But working on forecasting models for trading taught me that
this was not the case. Models are so imperfect, with so many
compromises it is often more optimal to think about other things first
– what are the limitations of the model in practice, what is it
saying, how are you going to use it. Answer those questions first,
THEN worry about improving it.

This is why I always start with simple, stupid, wrong models. They are
quick to produce, they help you learn a lot about what you are doing,
they fail in spectacular ways and they are sometimes all you need. In
terms of costs and benefits, they are hard to beat.

***
q4. How do you respond when you hear the phrase ‘big data’?

I hate it. It has become a meaningless buzzword used as a means of
making sales.

My attitude to the term is best summarised by the interview you had
with Hadley Wickham: there are three categories of data size,
in-memory, on-disk and finally the truly ‘big data’ problems like
recommender systems. I believe the majority of problems can be solved
by appropriate sampling of your data down to a manageable size and
then analysing those subsets.

After all, the whole point of statistics is to make inferences about a
population from a sample of the data.

Once decided on a solution, putting the model into production and
scaling it for your business is a major issue, but is a problem more
belonging to the realm of network and software engineering. That said,
it is important to keep people with a solid understanding of the
concepts stay involved, just in case some ‘optimisations’ ruin the
output.

***
q5. What is the most exciting thing about your field?

Robert McNamara in ‘The Fog of War’ mentioned that you should never
answer the question asked but instead answer the question you wanted
to be asked, so with your forebearance I will first answer a liberal
interpretation of that question: what work gets me excited?

The short answer to that question is all sorts of things do, but they
are often small things related to work I am doing. In the last few
months, I was excited to try out dataexpks (a data exploration package
I am co-creating) on a brand new data set to see what it showed me and
how well my code worked. I love think of ways to use Monte Carlo
simulation to test the output of various regression models, and over
Christmas I was fascinated by a short project trying out methods of
investigating differences between a subpopulation within a larger
population.

I am fascinated by new ways to learn the fundamentals – there are a
few excellent ones out there and I read them all the time. I can never
learn enough as in my experience reality tends to present us with
basic statistical problems in new and unusual ways.

Having multiple perspectives and multiple approaches is invaluable in
those situations.

Regarding your original question as I think you intended, I think the
advances in reinforcement learning techniques probably have the
biggest potential – some of the Atari gameplaying from Deep Mind was
eye-opening. Sadly, if history is any guide, much of it will prove to
be hype, but I imagine some very interesting results to come from the
work.

***
q6. How do you go about framing a data problem – in particular, how do
you avoid spending too long, how do you manage expectations etc. How
do you know what is good enough?

Framing a data problem is a tough one to answer – I am not sure what I
do or how to articulate it. I have had the good fortune to help a lot
of people with their projects and problems, exposing me to a wide
variety of problems. I learned something from all of them and I rely
on that a lot.

I also read a lot of blogs, articles and subscribe to mailing
lists. While rarely having the time to read all this, often all you
need to get started on a problem is a vague memory of some technical
topic that may help and some terminology to Google.

As a result, the first thing I focus on is understanding the problem:
what is being asked? Do we have any data? What does is it look like?
Are there other data available we can use to enrich or use as a
substitute?

Going through that process will suggest approaches to use, and at that
point I draw upon previous experience, however tangential to the
problem..

By keeping this focus, your other questions are straightforward to
answer: if the current model is not likely to improve the answer by an
amount relevant to the goal, it is not worth spending more time
on. Similarly, knowing what is needed will tell you if your current
model is good enough, or often if there is a model that is good enough
– it is possible the level of accuracy required is not feasible.

In the latter case, discovering that early is much better than later –
you know not to waste time, money and resources on a lost cause.

***
q7. You’ve spoken before about the ‘need for apprenticeships’ in Data
Science. Do you have any suggestions on what that would involve? Are
meetups and coaching a good first start?

To explain the point I was making on that note, I think there is a lot
of implicit knowledge in this field, and I have been told a number of
times from people looking for help that people feel overwhelmed by the
sheer amount of knowledge people feel they need to know.

I do not think this is true, but I understand its origin: there is so
many different aspects to working with data it is tough to know where
to start. I always start very simple, but as I mentioned early, it
took a lot of time, thought and effort to get to that point, and it is
not easy to explain these ideas in theory – you have to work on a
number of different datasets to get a feel for how to do this.

As a result, I believe an approach such as mentoring or
apprenticeships are an effective approach to teach people – more
experienced analysts can guide junior members around the various
pitfalls and traps that are easy to fall into. It allows us to
illustrate that fancy and sophisticated techniques and algorithms are
not needed to do interesting work – some of the most interesting work
I have seen involved little more than summary statistics along with
basic models like linear regression and decision trees.

This is hard to learn from a book – almost impossible. The closest
book I read that talks about this is “Data Analysis Using Regression
and Multilevel/Hierarchical Models” by Gelman and Hill, stressing the
importance of starting from simple models. I would love to know if
there are more.

That said, I could only appreciate the point because I was already
experienced, a younger version of myself would have missed the
point. It would not have occurred to me that the right way to do
something is to do the simple and obvious thing.

I am a firm believer in the KISS principle. Keep It Simple, Stupid.

Working in a major trend – Machine Learning

Standard

I saw recently this from the recent Amazon shareholder letter.

“These big trends are not that hard to spot…We’re in the middle of an obvious one right now: machine learning & artificial intelligence” — Jeff Bezos

One of the hard parts about working professionally on these technologies. Is I take them for granted. So I consider this a post to just reflect on the improvements in image processing, computer vision, translation, natural language processing, text understanding, forecasting, risk analysis.

I’ve worked on some of these technologies, and these challenges, and continue to work on extracting information from CVs and matching candidates to the best jobs for them at Elevate Direct. When you’re in the weeds you sometimes forget what you’re working on, and that you’re part of a major trend.

As Matt Turck says

Big Data provides the pipes, and AI provides the smarts.

Building Full-Stack Vertical Data Products

Standard

I’ve been in the Data Science space for a number of years now, I first got interested in AI/Machine Learning in 2009 and have a background typical of a number of people in my field – I come from Physics and Mathematics.

One trend I’ve run into both at Corporates and Startups is that there are many challenges to deploying Data Science in a bureaucratic organisation – or delivering Enterprise Intelligence. Running into this problem led me to be interested in building data products.

One of the first people I saw building AI startups was Bradford Cross – and he’s been writing lately about his predictions for the 2017 in the Machine Learning startups space.

I agree with his precis that we’ll begin to see successful vertically-oriented AI startups solving full-stack industry problems that require subject matter expertise, unique data, and a product that uses AI to deliver its core value proposition.

At Elevate Direct we’re working on this working on the problem of sourcing and hiring contractors – so one of the fundamental problems that companies have which is hiring the best contractor talent out there.

So what are some of the reasons that it can be hard to deploy Data Science internally at a corporate organisation? I think a number of the patterns are related to other patterns we see in terms of software.

  1. Not being capable of building consumer facing software – Large (non-tech) organisations sometimes struggle to build and deliver software internally – I’ve seen a number of organisations fail to do this – their build process can be 6 months.
  2. Organisational anti-patterns – I’ve seen some organisations that rapidly inhibit the ability to deploy product. Some of these anti-patterns are driven by concerns about the risk of deploying software. And often end up with diffuse ownership – where an R and D team can blame the operations team and vice versa.
  3. Building Data Products is risky – Building data products is hard and risky – I think you really need to approach data products in a lean-startup kinda way. Deploy often, if it works it works, if not cut it. Sometimes the middle-management of large corporates is risk-averse and so find these kinds of projects scary. It also needs a lot of expertise –  subject-matter expertise, software expertise, machine learning expertise.
  4. Not allowing talented technical practitioners to use Open Source/ pick the tools – I once worked at a FTSE 100 company that it took me about 6 weeks to be able to install Open Source software tools such as R and Python. It severely restricted my productivity, in that time at a startup my team probably deployed into production, to a customer facing app about 1000 changes. This reminds me of the number 3 here. Don’t restrict the ability of your talented and well-trained people to deliver value. It makes no sense from a business point of view. Data Science produces value only when it produces products or insights for the business or the customers.
  5. Not having a Data Strategy – Data Science is most valuable when it aligns with the business strategy. Too often I’ve seen companies hiring data scientists before they have actual problems for them to work on. I’ve written about this before.
  6. Long term outsourcing deals – This is an insidious one, and one that came from a period of time when “IT didn’t matter”, before big Tech companies proved the value in the consumer space of for example e-commerce. It’s impossible to predict what will be the key tech for the next 10 years, so don’t lock yourself to a vendor for that period of time. Luckily this trend is reversing – we’re seeing the rise of agile, MVP, cloud computing, design thinking, getting closer to the customer. A great article on this re-shoring is here.

I think fundamentally a lot of these anti-patterns come from not knowing how to handle risk correctly. I like the idea in that RedMonk article that big outsourcing is a bit like CDOs in finance. Bundling the risk into one big lump doesn’t make the risk go away.

I learn this day after day working on building data products and tools at Elevate. Being honest about the risks and working hard to de-risk projects and drive down that risk in an agile way is the best we can do.

Finally, I think we’re just getting started building Data Products and deploying data science. It’ll be interesting what we see what other anti-patterns emerge as we grow up as an industry. This is also one of the reasons I’ve joined a startup and why I’m very excited to work on an end-to-end Data Product, which is solving a real-business problem.

Interview with a Data Scientist: Juan Pablo Isaza Aristizábal

Standard
I recently gave a keynote at www.pycon.co the first PyCon conference in Colombia. I spoke on Data Science Models in Production, lessons learned and the cultural aspects.
I interviewed a Colombian Data Scientist – Juan Pablo Isaza Aristizábal
1. What project have you worked on do you wish you could go back to, and do better?
Back in 2015 I was working for Tappsi, a popular app to call taxis. They have a huge problem with fulfilling cab demand on peak hours, because Bogotá has a horrible traffic congestion problem, that is only getting worse. So we were using algorithms to try to fulfil as much demand as we could. One of the projects was to predict demand on a 10 minutes future window for each neighbourhood, so that drivers could head to neighbourhoods with the highest odds. I did the real time data ingestion and machine learning, then we published an MVP of the feature, but the algorithm was too slow. In the end the project was never completed because developers were doing other stuff and I ended up optimising other algorithms that increased metrics with half the hassle and complexity. Afterwards I realised that the lack of experience led me to write the program in the wrong language and I made wrong assumptions that led to low performance.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
I think academia is a great place to learn abstract and complex subjects, while industry is a great place to learn practical and social skills. Its easier to succeed at a working environment if you find a balance between being academic enough without forgetting practical aspects and communication skills with non-tech people. At the university there are brilliant people at cutting edge topics, although they might not know how to deal with more concrete aspects. While in industry you can see fast practical developers that know the latest tools but fall short trying to optimise a SQL query because they don’t know how an indexed query works behind the scenes in a database.
So I would advise you to study as much as you can, but always try to think in the possible applications of what you are learning. Also, communications skills with non tech people is extremely important, I have seen a couple of guys having meetings with sales people and try to explain statistical tests and p-values with no success.
3. What do you wish you knew earlier about being a data scientist?
I wish I had an earlier chance of working on startups. Being in Colombia made it difficult to work in the tech industry; before I had a couple of jobs not related to data science or software development.
4. How do you respond when you hear the phrase ‘big data‘?
I find the term a little misleading and simplistic. Its popular because it’s easy to understand for the general public, while algorithms are not. Although technically the term might just be a synonym with a couple of tools such Amazon redshift and Hadoop. Big data  has enabled new and exciting applications but by no means it’s the only or biggest factor contributing to current advances in the field; new algorithms such as deep neural networks, reinforcement learning and a strong open source community has enabled a lot of improvement over the last few years.
5. What is the most exciting thing about your field?
For me is the excitement of science, which I have always embraced since I was a little kid, and the development speed and practicality of engineering. Being able to take an idea and transforming it to a working prototype in a few days is an amazing feeling; specially machine learning applications are really exciting to work with.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
Usually I take an iterative approach, starting with the most obvious relations and the easiest data to handle. Trying to get to the answer in a series of incremental steps as I refine each input and expand the data set. Its similar to how you would build a MVP, first it is simple, then it becomes better with each version, finally the customer or user says: that’s good enough!
7. Can you talk a bit about the state of the tech industry and data science in Colombia? What would you change? What gives you hope?
Data science is coming as a byproduct of software development; there isn’t much, but still we are improving at giant steps. In the last few years startups such as Tappsi, Domicilios, Mercadoni, Rappi or Bunny have become more commonplace than in the past. What gives me hope is the peace deal with FARC guerrilla group, this crucial event will make many more foreigners come here as well as investment that can power new ideas.
I would shift the local focus of many startups for a more global one. This is difficult because the market is small and the economy is far behind other nations, making our problems and solutions different from the more advances economies. Still there are Colombian startups with global focus such as Bunny or VOIQ.
Thanks and best regards!

Cookies

Standard

Based on http://larahogan.me/donuts/

Taken from her blog

Years ago, I found that whenever something awesome happened in my career – maybe I got published, or promoted, or launched a project – I wouldn’t take the time to celebrate the achievement. I’m an achiever by nature, the kind who feels like every day starts at zero. Not deliberately marking these moments left me feeling like I wasn’t actually accomplishing anything. “Oh cool, that A List Apart article went up,” I would think, then move on with my day.

Once I realized that this was happening, I decided to be deliberate about marking achievements by eating one donut. Well, sometimes more than one, if it’s a really big deal. The act of donut-eating has actually helped me feel like I’m accomplishing my career goals. As I started to share this idea with more people, I found that it resonated with others, especially young career-driven women who are routinely achieving goals and furthering their career but don’t take the time to note their own success.

I decided to start celebrating in a public way so that more people may be inspired to find their own ways of marking their career achievements. These are those donuts.

I’ve decided to use chocolate chip cookies instead.

cookie_1.jpg

Cookie 13/2/17 to celebrate giving the keynote at PyCon Colombia

AI in the Enterprise (the problem)

Standard

I was recently chatting to a friend who works as a Data Science consultant in the London Area – and a topic dear to my heart came up. How to successfully do ‘AI’ (or Data Science) in the enterprise. Now I work for an Enterprise SaaS company in the recruitment space, so I’ve got a certain amount of professional interest in doing this successfully.

My aim in this essay is to outline what the problem is, and provide some solutions.

Firstly it’s worth reflecting on the changes we’ve seen in Consumer apps – Spotify, Google, Amazon, etc – all of these apps have personalised experiences which are enhanced by machine learning techniques depending on the labelled data that consumers provide.

I’ll quote what Daniel Tuckelang (formerly of Linkedin) said about the challenges of doing this in the enterprise.

First, most enterprise data still lives in silos, whereas the intelligence comes from joining across data sets. Second, the enterprise suffers from weak signals — there’s little in the way of the labels or behavioral data that consumer application developers take for granted. Third, there’s an incentive problem: everyone promotes data reuse and knowledge sharing, but most organizations don’t reward it

I’ve personally seen this when working with enterprises, and being a consultant. The data is often very noisy, and while there are techniques to overcome that such as ‘distant supervision‘ it does make things harder than say building Ad-Tech models in the consumer space or customer churn models. Where the problem is more explicitly solvable by supervised techniques.

In my experience and the experience of others. Enterprises are much more likely to try buy in off-the-shelf solutions, but (to be sweepingly general) they still don’t have the expertise to understand/validate/train the models.There are often individuals in small teams here & there who’ve self-taught or done some formal education, but they’re not supported. (My friend Martin Goodson highlights this here)  There needs to be a cultural shift. At a startup you might have a CTO who’s willing to trust a bunch of relatively young data science chaps to try figure out an ML-based solution that does something useful for the company without breaking anything. And it’s also worth highlighting that there’s a difference in risk aversion between enterprises (with established practices  etc) and the more exploratory or R and D mindset of a startup.

The somewhat more experienced of us these days tend to have a reasonable idea of what can be done, what’s feasible, and furthermore how to convince the CEO that it’s doing something useful for his valuation.

Startups are far more willing to give things a go, there’s an existential threat. And not to forget that often Venture Capitalists and the assorted machinery expect Artificial Intelligence, and this is encouraged.

Increasingly I speculate that established companies now outsource their R and D to startups, hence the recent acquisitions like the one by GE Digital.

So I see roughly speaking two solutions to this problem. Two ways to de-risk data science projects in the enterprise.

1) Build it as an internal consultancy with two goals: identifying problems which can be solved with data solutions, and exposing other departments to new cultural thinking & approaches. I know of one large retailer who implemented this by doing 13 week agile projects, they’d do a consultation, then choose one team to build a solution for.

2) Start putting staff through training schemes similar to what is offered by General Assembly (there are others), but do it whole teams at a time, the culture of code review and programmatic analysis has to come back and be implemented at work. Similarly, give the team managers additional training in agile project management etc.

The first can have varied success – you need the right problems, and the right internal customers – and the second I’ve never seen implemented.

I’d love to hear some of the solutions you have seen. I’d be glad to chat about this.

Acknowledgements: I’d like to thank the following people for their conversations: John Sandall, Martin Goodson, Eddie Bell, Ian Ozsvald, Mick Delaney and I’m sorry about anyone else I’ve forgotten.

 

2016: In Review

Standard

I’m mostly writing this for me, but maybe it will be interesting to you too! Here’s are some things that happened in 2016. (just to me personally, mostly about programming) This is based on the excellent post by Julia Evans.

Open Source

I continued being involved in PyMC3. This has taught me a lot about programming, the challenges of shipping software. The code reviews by Thomas Wiecki and the others have been amazing.

I helped pick the new logo, worked on becoming a fiscally sponsored project by NumFOCUS. For those of you who don’t know NumFOCUS is an organisation that supports diversity in open source, open source projects and the conferences associated with Open Source. It largely focuses on the Python ecosystem but has branched out to other projects.

Learning about this has taught me a lot about the governance aspects of OSS – and our responsibilities to encourage more people into this ecosystem. I consider that an important part of my duties as a member of the Open Source world.

Talks

  • spoke at PyData London – About Statistics with Python.
  • spoke at the Toulouse Data Science Meetup – I spoke about the PyData ecosystem
  • I keynoted at PyData Amsterdam – I spoke about the current PyData ecosystem and what various tools like Dask, NumPy, Numba, etc are all for.
  • Gave a talk at the Bayesian Mixer in London on the state of PyMC3 I spoke a bit about the new tools in Variational Inference, which has been a research topic of mine for the past year. I wish I had time to finally write some slides on that.

Doing the PyData keynote was kind of exciting/scary (me??? keynote???) and I think it turned out well and I’m happy I did it. I love the PyData community and I’m happy with the talk I gave.

It’s been fun to experience some of the other places that are doing Data Science and forming communities. At each of these events I’ve met a lot of cool people. It’s great to see our industry grow up!

In 2017 I’ll be keynoting in Colombia in Feb at their PyCon Colombia conference. I’m excited to give this talk. I want to goto a conference like NIPS/KDD/ICLR/ICML to stay a bit closer to some of the improvements in the Machine Learning world from Academia/ Industry.

cool: Writing for Hakka Labs

  • I was honoured to be featured on Hakka Labs, Hakka Labs run the excellent Data Eng Conference and some awesome content on their blog. I wrote about Three Things I learned about Machine Learning, this is an ongoing journey where I realise how little I know.

cool: Blog

Some of my favourite posts this year have been.

  • A map of the PyData Stack  – This was an idea floated with Thomas Wiecki before. I finally got around to doing this for my keynote, the aim was to give some people a ‘map of the pydata stack’ and what different tools were for.
  • I interviewed one of my heroes – Greg Linden who helped devise the first Collaborative Filtering algorithm in production at Amazon.
  • I did some other interviews – I liked this one too with Masaaki Horikoshi one of the most prolific contributors to the PyData ecosystem.

I’ll continue to do some interviews over the next year, and hopefully add them to a revised book.

cool: moving to London

I moved to London in late March. I’ve found it very exciting to be close to the Machine Learning community and Data Science community out there. It was a hectic few months adjusting to new job(s) however I’m glad I made the move.

I think everyone should spend sometime in a major city when they’re young.

I hope to blog a bit more about work in the next few months.

cool: Teaching Data Science

My friend John Sandall  mentioned a Teaching Assistant gig at General Assembly.

I helped about 20 students learn more about Data Science, they came from various backgrounds and sharing my own experiences – reminded me that 1) I knew stuff and 2) teaching is hard.

I recommend to all Data Scientists and Engineers if they get the time to teach. It’s a great experience and I learned a lot about what was easy and hard in Machine Learning.

conclusions?

some things that worked:

  • asking a lot of questions about how computers work (not a surprise)
  • working on a team of people who know more stuff than me, and listening to what they have to say
  • asking for advice from people who are more experienced than me.
  • at work, figuring out what’s important to do and then doing the work to get it done, especially if that work is boring / tedious
  • working on one thing at a time (or at least not too many things)
  • getting a bit better at software “process” things like design documents and project plans
  • learning how to mentor junior data scientists – this is something I’m continuing to do
  • learning more about leading teams in ML – which is hard. I’ll not probably be doing too much people stuff over the next few months.