What I’ve learned recently


I’ve recently been attending a lot of conferences, and speaking at them.

However – this talk was insightful and I wish I saw it in person.

I think the key takeaway for me, was the idea of putting up your stuff on a wall and having people talk about it. Communicating results is really really important and a hard skill to learn. ūüôā

Talks and Workshops


I enjoy giving talks and workshops on Data Analytics. Here is a list of some of the talks I’ve given. In my Mathematics master I regularly gave talks on technical topics, and previously I worked as a Tutor and Technician in a School in Northern Ireland. I consider the evangelism of data and analytics to be an important part of my job as a professional data scientist!


None planned so far

Slides and Videos from Past Events

Keynote at PyCon Colombia Feb 2017. The slides are here.

I gave a tutorial called ‘Lies damned lies and statistics’ at PyData London 2016. I’ll be discussing different statistical and machine learning approaches to the same kinds of problems.
Continue reading

Talk: Can Probabilistic Programming be applied to Rugby?


Yesterday evening I gave a talk at the Data Science Meetup in Luxembourg.

This is part of my preparation for the talk at PyData the Python conference for Data Enthusiasts in Berlin.

A few remarks – my slides from last night are here in IPython notebook format.

I used for the presentation the excellent RISE library from https://twitter.com/damian_avila this easily converts IPython notebooks into Reveal JS format and I really recommend it.

Abstract of my Talk: Probabilistic Programming and Bayesian Methods are called by some a new paradigm. There are numerous interesting applications such as to Quantitative Finance.
I’ll discuss what probabilistic programming is, why should you care and how to use PyMC from Python to implement these methods. I’ll be applying these methods to studying the problem of ‘rugby sports analytics’ particularly how to model the winning team in the recent Six Nations in Rugby. I will discuss the framework and how I was able to quickly and easily produce an innovative and powerful model as a non-expert.

The talk also serves as a useful example of Probabilistic Programming, why it is useful and how to use PyMC to model an event rather than say a domain specific language like STAN.

Interview with a Data Scientist: Eddie Bell


Eddie Bell is a Lead Data Scientist at http://www.lyst.com a Fashion recommendation website.

Eddie has a PhD in Mathematics, and before he saw the light and joined Lyst he used to work in Finance!

1. What project have you worked on do you wish you could go back to, and do better?

At one point we moved all our data processing infrastructure to storm.
We’re a python shop with very little Java experience and it was an
absolute nightmare. Dealing with maven and dependency hell, trying to
deploy automatically, testing, VM parameters. The actual model worked
well in storm but we just weren’t prepared for all the supporting

In the end we moved back to python and used celery instead which
suited us perfectly. This storm transition cost me 3 months of my
life. If I could go back in time then i’d just stick with python.

I guess the take home message is: when you start a new project, you
really have to think about the cost of using a new technology.
Although learning a new technology is fun you should first try solving
the problem with a technology you are already familiar with.

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

I would say there are three important areas.
1) Absolutely to learn to program. The better you can program the more independent you can be as a data scientist. 2) Theoretical foundations, stats and linear algebra mostly 3) Communication, you need to communicate well with your colleagues and the community.

3. What do you wish you knew earlier about being a data scientist?

How much of the job involves taking with business. You have to be able
translate business goals into machine learning solutions. You also
have to be able to tell people why some ideas are not possible to
implement. But you have to be very careful not to rule out their crazy
ideas, they might just teach you something! For example, last year
someone asked if I could generate descriptions from images. I laughed
and said it was impossible but now people are actually doing it

4. How do you respond when you hear the phrase ‘big data’?

Haha, shudder because it doesn’t really mean anything.

 5. What is the most exciting thing about your field?

I’m all about building production machine learning system so for me
applications of deep learning are the most exciting. Deep models are
not magic bullets but they can achieve impressive results.

 6. How do you go about framing a data problem Рin particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

My hand-wavy answer is ‚Äėintuition‚Äô but more practically, agile
development has a concept called MVPs (minimum viable products). MVPs
let you iterate quickly and so failures cost you less. The same can be
applied to machine learning; first try to solve a problem on a simple
data set with a simple model. If that shows promise then you can
develop a more complex models with bigger and better data.

Interviews with Data Scientists – The collection


In the last year or so I’ve put together a collection of interviews with Data Scientists.

Here is a list of them

  • Thomas Levi¬†– Plenty of Fish an Online Dating Website
  • Jeroen Latour – Data Scientist at Booking.com from September
  • James Long – A data artisan with experience in Re-insurance
  • Radim Rehurek – A data science and engineering consultant famous for the Gensim Topic Modelling library
  • Ignacio Elola – The data guru at Import.io a platform for enabling easy access to web data.
  • Eddie Bell¬†– Lead Data Scientist at Lyst – a Fashion recommendation website
  • Andrew Clegg¬†– Data Scientist at Etsy – an online craft marketplace
  • Keith Bawden¬†– Former Business Intelligence Manager at Amazon and Head of Engineering Japan at Groupon
  • Matt Hall¬†– Interview with Matt Hall a scientific programmer running a geophysical consultancy in Canada.
  • Alejandro Correra Bahnsen – with my friend who was co-organizer of the Data Science meetup in Luxembourg.
  • Trent McConaghy¬†– One of my panelists at the PyData Meetup in Berlin. Trent is a successful entrepreneur and AI researcher.
  • Hadley Wickham¬†– An interview with the legendary Hadley Wickham creator of R data tools for power users including ggplot2.
  • Jon Sedar – An interview with Data Science consultant Jon Sedar about the consulting sides of data science.
  • Cameron Davidson-Pilon¬†– Lead Data Scientist at Shopify, creator of Lifelines, lead author on Bayesian Methods for Hackers
  • Ian Huston¬†– Senior Data Scientist at Pivotal. He’s active in the Data Science and PyData scenes in London and throughout Europe, and has been great at contributing blog posts to the community on the communication skills aspects.
  • Shane Lynn – Co-founder of Kill Biller an Irish startup. His background includes consulting and a technical PhD
  • Vanessa Sabino – Lead Data Scientists (focusing on Marketing) for Shopify. She comes from Brazil and talks about how she thought Tech only happened above the equator.
  • Ian Ozsvald – All round cool speaker and data geek. He runs a successful boutique data science consultancy in London. He also organizes the PyData Meetups and Conferences in London.
  • Peadar Coyle¬†– Peadar is the author of this blog. By popular demand he submitted himself to the questions too.

Interview with a Data Scientist: Radim Rehurek

I recently met Radim at a Python Conference in Florence
You can visit his website at http://radimrehurek.com
Radim has a PhD in Computer Science and helped design the excellent Gensim library.
So I sent him an email and he answered these questions. I lightly edited his answers.
A key piece of information to add is that Radim has been an independent consultant for a number of years. Radim has over 10 years of experience in industry. He’s trained and mentored others for a number of years in machine learning and data processing, and his experience includes content targeting, game dev, digital libraries and search engines.
So he’s well qualified to comment on Data Science – especially given his experience running his own consultancy and specialising in text analytics, NLP and search.
1. What project have you worked on do you wish you could go back to, and do better?
We all learn constantly. But I have no failures gnawing at my conscience, no.
Or, to go a bit “meta”: the mental knobs and levers that decide where to push harder and where to let go, they have changed over time, yes. I’d spend more focus on understanding the global business perspective now, and less on technicalities, optimizations. I guess that’s a natural progression.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
I wouldn’t presume to give out advice. Everybody has different goals and priorities in life.
Or, to quote a classic, “Try and be nice to people, avoid eating fat, read a good book every now and then, get some walking in, and try and live together in peace and harmony with people of all creeds and nations.”
Plus, I’d add that presentation and sales skills matter (more than you think). Some cultures are innately better at it than others; I prepared some advanced infographic for you, illustrating the painful difference:
3. What do you wish you knew earlier about being a data scientist?
How to value the initial problem cracking and scoping (the “business analysis”) properly.
You know, that stuff that happens before you write any code or design algorithms or whatever concrete work. Before the contract is even signed (I used to think).
When a new client came and wanted a quote for a project, I’d think really hard about the problem, research around, come up with a viable solution. (And generally, when people come to consultants for help, it’s not because the problem is well scoped and easy to solve.)
In retrospect, this is completely insane. That’s the most valuable part of consulting! But I thought that’s expected, that I should already know this stuff (I’m an expensive consultant right?!).
By the time I created a proposal, the problem was practically solved — broken into actionable steps, with reasonable time estimates and all. The client could just have a chuckle, go “good bye and thank you very much”, and hand my proposal out as specs to their own developers.
I’ve stopped doing that.
Silver lining: I got good at estimating all kinds projects, in sundry domains and verticals ūüôā
4. How do you respond when you hear the phrase ‘big¬†data’?
Depends who says it and why. No predetermined generic reaction.
My view on hype in general: my anti-bullshit radar is notoriously biased. I’m very conservative. I only opened a twitter account a year and a half ago! (come and say hi¬†btw)
But we are in the middle of a data revolution, no doubt about it. So as long as you don’t¬†capitalize Big Data, I’m good. It’s my job as a consultant to manage client’s expectations of fads and choose the right technology.
5. What is the most exciting thing about your field?
Building stuff that makes a difference in the Real¬†World‚ĄĘ.
I left academia because it was too academic, industry employment because it was too menial and personally inert… Now I live and consult on the ethereal intersection of both.
On a related note, in my experience, a well tuned system (search, recommendation, entity detection, query correction, classification, whatever) beats an application of the “latest exciting research paper” any day of the week. Beating a baseline of a system that has been tuned by domain experts and battle-hardened by years of experience is damn hard. Spectacular math formulas and latest tech are cute and good PR, but the devil is always in the details.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
With honest communication (doh).
I learned to ask clients for sample data upfront, right at the beginning. Mock data is fine. Forces the client to concretize what they have and want, often with surprising (to them) results.
Also, clarify up front that problem analysis (framing the right questions, understanding their business domain, its constraints) is a paid part of the process. See above on “business analysis”. Contrary to popular opinion, the actual machine learning algorithm is a tiny, tiny component of a successful data mining project. Anybody can whip out a Naive Bayes classifier in a few hours from scratch (and NB-level stuff is all that many projects realistically need, despite what they read on latest TechCrunch or HackerNews).
The “what’s good enough” & “how the data flows” & “how components integrate and update” and other tricky questions usually come out of the analysis, iterating over solutions and communicating with the client, fairly naturally.