Talks and Workshops

Featured

I enjoy giving talks and workshops on Data Analytics. Here is a list of some of the talks I’ve given. In my Mathematics master I regularly gave talks on technical topics, and previously I worked as a Technician in a School in Northern Ireland. I consider the evangelism of data and analytics to be an important part of my job as a professional analyst!

Upcoming

I’ll be speaking at PyData in Berlin and in London.
The blurb for my upcoming PyData Berlin talk is mentioned here.
Abstract: “Probabilistic Programming and Bayesian Methods are called by some a new paradigm. There are numerous interesting applications such as to Quantitative Finance.
I’ll discuss what probabilistic programming is, why should you care and how to use PyMC and PyMC3 from Python to implement these methods. I’ll be applying these methods to studying the problem of ‘rugby sports analytics’ particularly how to model the winning team in the recent Six Nations in Rugby. I will discuss the framework and how I was able to quickly and easily produce an innovative and powerful model as a non-expert.”

My PyData London tutorial will be an extended version of the above talk – but will be more hands-on than the talk version.

Slides and Videos from Past Events

In May 2015 I gave a preview of my PyData Talk in Berlin at the Data Science Meetup on ‘Probabilistic Programming and Rugby Analytics‘ – where I presented a case study and introduction to Bayesian Statistics to a technical audience. My case study was the problem of ‘how to predict the winner of the Six Nations’. I used the PyMC library in Python to build up statistical models as part of the Probabilistic Programming paradigm. This was based on my popular Blog Post which I later submitted to the acclaimed open source textbook Probabilistic Programming and Bayesian Methods for Hackers. I gave this talk using an IPython notebook, which proved to be a great method for presenting this technical material.

In October 2014 I gave a talk at Impactory in Luxembourg – a co-working space and Tech Accelerator. This was an introductory talk to a business audience about ‘Data Science and your business‘. I talked about my experience at different small firms, and large firms and the opportunities for Data Science in various industries.

In October 2014 I also gave a talk at the Data Science Meetup in Luxembourg. This was on ‘Data Science Models in Production‘ discussing my work with a small company on developing a mathematical modelling engine that was the backbone of a ‘data product’. This talk was highly successful and I gave a version of this talk at PyCon in Florence in April 2015. The aim of this talk was to explain what a ‘data product’ was, and discuss some of the challenges of getting data science models into production code. I also talked about the tool choices I made in my own case study. It was well-received, high level and got a great response from the audience. I expect a video to go up soon!

When I was a freelance consultant in the Benelux I gave a private 5 minute talk on Data Science in the Game industry. Here are the slides. – This is from July 2014

Talk: Can Probabilistic Programming be applied to Rugby?

Yesterday evening I gave a talk at the Data Science Meetup in Luxembourg.

This is part of my preparation for the talk at PyData the Python conference for Data Enthusiasts in Berlin.

A few remarks – my slides from last night are here in IPython notebook format.

I used for the presentation the excellent RISE library from https://twitter.com/damian_avila this easily converts IPython notebooks into Reveal JS format and I really recommend it.

Abstract of my Talk: Probabilistic Programming and Bayesian Methods are called by some a new paradigm. There are numerous interesting applications such as to Quantitative Finance.
I’ll discuss what probabilistic programming is, why should you care and how to use PyMC from Python to implement these methods. I’ll be applying these methods to studying the problem of ‘rugby sports analytics’ particularly how to model the winning team in the recent Six Nations in Rugby. I will discuss the framework and how I was able to quickly and easily produce an innovative and powerful model as a non-expert.

The talk also serves as a useful example of Probabilistic Programming, why it is useful and how to use PyMC to model an event rather than say a domain specific language like STAN.

Interview with a Data Scientist: Eddie Bell

Eddie Bell is a Lead Data Scientist at http://www.lyst.com a Fashion recommendation website.

Eddie has a PhD in Mathematics, and before he saw the light and joined Lyst he used to work in Finance!

1. What project have you worked on do you wish you could go back to, and do better?

At one point we moved all our data processing infrastructure to storm.
We’re a python shop with very little Java experience and it was an
absolute nightmare. Dealing with maven and dependency hell, trying to
deploy automatically, testing, VM parameters. The actual model worked
well in storm but we just weren’t prepared for all the supporting
infrastructure.

In the end we moved back to python and used celery instead which
suited us perfectly. This storm transition cost me 3 months of my
life. If I could go back in time then i’d just stick with python.

I guess the take home message is: when you start a new project, you
really have to think about the cost of using a new technology.
Although learning a new technology is fun you should first try solving
the problem with a technology you are already familiar with.

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

I would say there are three important areas.
1) Absolutely to learn to program. The better you can program the more independent you can be as a data scientist. 2) Theoretical foundations, stats and linear algebra mostly 3) Communication, you need to communicate well with your colleagues and the community.

3. What do you wish you knew earlier about being a data scientist?

How much of the job involves taking with business. You have to be able
translate business goals into machine learning solutions. You also
have to be able to tell people why some ideas are not possible to
implement. But you have to be very careful not to rule out their crazy
ideas, they might just teach you something! For example, last year
someone asked if I could generate descriptions from images. I laughed
and said it was impossible but now people are actually doing it
(http://deeplearning.cs.toronto.edu/i2t).

4. How do you respond when you hear the phrase ‘big data’?

Haha, shudder because it doesn’t really mean anything.

 5. What is the most exciting thing about your field?

I’m all about building production machine learning system so for me
applications of deep learning are the most exciting. Deep models are
not magic bullets but they can achieve impressive results.

 6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

My hand-wavy answer is ‘intuition’ but more practically, agile
development has a concept called MVPs (minimum viable products). MVPs
let you iterate quickly and so failures cost you less. The same can be
applied to machine learning; first try to solve a problem on a simple
data set with a simple model. If that shows promise then you can
develop a more complex models with bigger and better data.

Interviews with Data Scientists – The collection

In the last year or so I’ve put together a collection of interviews with Data Scientists.

Here is a list of them

  • Thomas Levi – Plenty of Fish an Online Dating Website
  • James Long – A data artisan with experience in Re-insurance
  • Radim Rehurek – A data science and engineering consultant famous for the Gensim Topic Modelling library
  • Ignacio Elola – The data guru at Import.io a platform for enabling easy access to web data.
  • Eddie Bell – Lead Data Scientist at Lyst – a Fashion recommendation website

Interview with a Data Scientist: Radim Rehurek

I recently met Radim at a Python Conference in Florence
You can visit his website at http://radimrehurek.com
Radim has a PhD in Computer Science and helped design the excellent Gensim library.
So I sent him an email and he answered these questions. I lightly edited his answers.

A key piece of information to add is that Radim has been an independent consultant for a number of years. Radim has over 10 years of experience in industry. He’s trained and mentored others for a number of years in machine learning and data processing, and his experience includes content targeting, game dev, digital libraries and search engines.
So he’s well qualified to comment on Data Science – especially given his experience running his own consultancy and specialising in text analytics, NLP and search.
1. What project have you worked on do you wish you could go back to, and do better?
We all learn constantly. But I have no failures gnawing at my conscience, no.

Or, to go a bit “meta”: the mental knobs and levers that decide where to push harder and where to let go, they have changed over time, yes. I’d spend more focus on understanding the global business perspective now, and less on technicalities, optimizations. I guess that’s a natural progression.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
I wouldn’t presume to give out advice. Everybody has different goals and priorities in life.

Or, to quote a classic, “Try and be nice to people, avoid eating fat, read a good book every now and then, get some walking in, and try and live together in peace and harmony with people of all creeds and nations.”

Plus, I’d add that presentation and sales skills matter (more than you think). Some cultures are innately better at it than others; I prepared some advanced infographic for you, illustrating the painful difference:

3. What do you wish you knew earlier about being a data scientist?
How to value the initial problem cracking and scoping (the “business analysis”) properly.
You know, that stuff that happens before you write any code or design algorithms or whatever concrete work. Before the contract is even signed (I used to think).

When a new client came and wanted a quote for a project, I’d think really hard about the problem, research around, come up with a viable solution. (And generally, when people come to consultants for help, it’s not because the problem is well scoped and easy to solve.)
In retrospect, this is completely insane. That’s the most valuable part of consulting! But I thought that’s expected, that I should already know this stuff (I’m an expensive consultant right?!).

By the time I created a proposal, the problem was practically solved — broken into actionable steps, with reasonable time estimates and all. The client could just have a chuckle, go “good bye and thank you very much”, and hand my proposal out as specs to their own developers.
I’ve stopped doing that.
Silver lining: I got good at estimating all kinds projects, in sundry domains and verticals :)
4. How do you respond when you hear the phrase ‘big data’?
Depends who says it and why. No predetermined generic reaction.

My view on hype in general: my anti-bullshit radar is notoriously biased. I’m very conservative. I only opened a twitter account a year and a half ago! (come and say hi btw)

But we are in the middle of a data revolution, no doubt about it. So as long as you don’t capitalize Big Data, I’m good. It’s my job as a consultant to manage client’s expectations of fads and choose the right technology.
5. What is the most exciting thing about your field?
Building stuff that makes a difference in the Real World™.

I left academia because it was too academic, industry employment because it was too menial and personally inert… Now I live and consult on the ethereal intersection of both.

On a related note, in my experience, a well tuned system (search, recommendation, entity detection, query correction, classification, whatever) beats an application of the “latest exciting research paper” any day of the week. Beating a baseline of a system that has been tuned by domain experts and battle-hardened by years of experience is damn hard. Spectacular math formulas and latest tech are cute and good PR, but the devil is always in the details.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
With honest communication (doh).

I learned to ask clients for sample data upfront, right at the beginning. Mock data is fine. Forces the client to concretize what they have and want, often with surprising (to them) results.

Also, clarify up front that problem analysis (framing the right questions, understanding their business domain, its constraints) is a paid part of the process. See above on “business analysis”. Contrary to popular opinion, the actual machine learning algorithm is a tiny, tiny component of a successful data mining project. Anybody can whip out a Naive Bayes classifier in a few hours from scratch (and NB-level stuff is all that many projects realistically need, despite what they read on latest TechCrunch or HackerNews).

The “what’s good enough” & “how the data flows” & “how components integrate and update” and other tricky questions usually come out of the analysis, iterating over solutions and communicating with the client, fairly naturally.

Some quick book recommendations

I don’t have time to write a long post – so I’ll just mention en passant some books that I’ve read recently.

The Myths of Innovation is a brilliant explanation of the challenges of innovation and how it actually works in the real world.

Make it Stick is a great summary of recent research on how we learn. Since we are all knowledge workers now – and thus all ‘continuous learners’ I highly recommend you read this book before scoping your next learning challenge. Some of it is common sense, and some of it is academic – but it was a great compendium of academic research, and I’ve made copious notes!

Speaking at PyData Track at PyCon Sei in Florence Italy

I’m happy to be a part of the PyData speaking community by speaking at my first PyCon.

Here is the abstract and then some remarks :)

One of the biggest challenges we have as data scientists is getting our models into production. I’ve worked with Java developers to get models into production and there aren’t always the same libraries in Java as there are in Python. Example try porting Scikitlearn code to Java. Possible solution: PMML or you write spec.

An even better solution: I will explain how to use Science Ops from YhatHQ to build better data products. Specifically I will talk about how to use a Python, Pandas etc to build a model. Test it locally and then deploy it so thatdevelopers can get an easy to use RESTful API. I will remark some of my experiences from working with it, and give a use case and some architectural remarks. I’ll also give a run down of alternatives to Science Ops that I’ve found.

Pre Requisites – some experience with Pandas and the scientific Python would be beneficial. This talk is aimed at Data Science enthusiasts or professionals.

Firstly you can check out www.pydata.it for the PyData focused schedule

Secondly: My slides are here https://speakerdeck.com/springcoil/data-products-or-getting-models-into-production which you can look at before the talk if you wish.

Finally: I provide here a link to the code I’ll mention in the talk – this is a simple example of how you would build an ODE using the PyData stack. The code isn’t excellent, but it is functional and easy to read.

https://gist.github.com/springcoil/dacc5dcadc11d4165473