Talks and Workshops


I enjoy giving talks and workshops on Data Analytics. Here is a list of some of the talks I’ve given. In my Mathematics master I regularly gave talks on technical topics, and previously I worked as a Teacher in a School in Northern Ireland. I consider the evangelism of data and analytics to be an important part of my job as a professional analyst!

Slides and Videos from Past Events

PyCon Ireland From the Lab to the Factory – I gave a talk on the business side of delivering data products – a trope I used was it is like ‘going from the lab to the factory’. This was a well-received talk based on the feedback and I gave my audience a collection of tools they could use to solve these challenges.

EuroSciPy 2015: I gave a talk on Probabilistic Programming applied to Sports Analytics – slides are here.

My PyData London tutorial was an extended version of the above talk – but will be more hands-on than the talk version.

I spoke at PyData in Berlin.
The link is here

The blurb for my upcoming PyData Berlin talk is mentioned here.
Abstract: “Probabilistic Programming and Bayesian Methods are called by some a new paradigm. There are numerous interesting applications such as to Quantitative Finance.
I’ll discuss what probabilistic programming is, why should you care and how to use PyMC and PyMC3 from Python to implement these methods. I’ll be applying these methods to studying the problem of ‘rugby sports analytics’ particularly how to model the winning team in the recent Six Nations in Rugby. I will discuss the framework and how I was able to quickly and easily produce an innovative and powerful model as a non-expert.”

In May 2015 I gave a preview of my PyData Talk in Berlin at the Data Science Meetup on ‘Probabilistic Programming and Rugby Analytics‘ – where I presented a case study and introduction to Bayesian Statistics to a technical audience. My case study was the problem of ‘how to predict the winner of the Six Nations’. I used the PyMC library in Python to build up statistical models as part of the Probabilistic Programming paradigm. This was based on my popular Blog Post which I later submitted to the acclaimed open source textbook Probabilistic Programming and Bayesian Methods for Hackers. I gave this talk using an IPython notebook, which proved to be a great method for presenting this technical material.

In October 2014 I gave a talk at Impactory in Luxembourg – a co-working space and Tech Accelerator. This was an introductory talk to a business audience about ‘Data Science and your business‘. I talked about my experience at different small firms, and large firms and the opportunities for Data Science in various industries.

In October 2014 I also gave a talk at the Data Science Meetup in Luxembourg. This was on ‘Data Science Models in Production‘ discussing my work with a small company on developing a mathematical modelling engine that was the backbone of a ‘data product’. This talk was highly successful and I gave a version of this talk at PyCon in Florence in April 2015. The aim of this talk was to explain what a ‘data product’ was, and discuss some of the challenges of getting data science models into production code. I also talked about the tool choices I made in my own case study. It was well-received, high level and got a great response from the audience. Edit: Those interested can see my video here, it was a really interesting talk to give, and the questions were fascinating.

When I was a freelance consultant in the Benelux I gave a private 5 minute talk on Data Science in the Game industry. Here are the slides. – This is from July 2014

My Mathematical research and talks as a Masters student are all here. I specialized in Statistics and Concentration of Measure. It was from this research that I became interested in Machine Learning and Bayesian Models.


My Masters Thesis on ‘Concentration Inequalities and some applications to Statistical Learning Theory‘ is an introduction to the world of Concentration of Measure, VC Theory and I used this to apply to understanding the generalization error of Econometric Forecasting Models.

Fun with MetOffice API’s

As a data scientist I often have to extract data from RESTful API’s this was something I’m admittedly not very good at.

So I decided to look at the MetOffice Datapoint API  which provides weather information in the United Kingdom. You can visit that website to get your own API key if you wish.

Being from near Newry in Northern Ireland I decided to write a simple API call to extract data from that API.

The aim of this short gist is just to illustrate how you’d write a short OOP friendly snippet in Python.

In the future, I’ll write up something on how to mock out this API since that’s a big interest for me. Pytest looks like one of the best ways to do that.

These kinds of API’s are of interest if your model depends on weather information – which a lot of models do.

Thanks to Maciej Kula for providing some much needed code review.

What I’ve been working on

This is just a little wrapper post to include some of the things I’ve worked on lately.

  • I wrote up a short piece on Exploring new numpy features including the new matrix operator
  • I wrote up some PyMC3 examples on my Github – this includes some Bayesian Logistic Regression and some classical examples of conversion modelling.
  • I wrote up some Pandas examples on my github using some of the new features and time series analysis – lots of this isn’t new or original and it needs some editing.
  • My series of Interviews with Data Scientists is ongoing, on this blog and on Dataconomy. This has been more successful and interesting than I imagined. This resource is of use to any stakeholders working with data scientists or engineers and to practitioners themselves.
  • I have a paper submitted to the EuroSciPy Proceedings, I’ll put a link up once it is published on the Arxiv :)
  • I wrote up some notes on Make It Stick which was an excellent book on successful learning.
  • I wrote up some Natural Language Processing and Text Analytics code based on a corpus I created from my Interviews. It is available here – pull requests and remarks are useful.
  • A blast from the past: I wanted to draw attention to some code I wrote a few years ago – probably during my Masters in Mathematics on modelling Stochastic Processes and Options Pricing

Exploring the new NumPy features: Rewrite Python for Data Analysis

The new version of NumPy 1.10 contains the new Python @ operator. This is for matrix multiplication and greatly simplifies some code. This also appeals to me as a Math geek because it makes it really easy to write code down based on what you read in a paper. This makes implementing a linear algebra model much simpler.

I had a look at the NumPy chapter in Python for Data Analysis and reimplemented some parts of it using the new operator. I found this a good learning experience

Marketing data with PyMC3

My friend Erik put up an example of conversion analysis with PyMC2 recently. I decided to reproduce this with PyMC3.

We want a good model with uncertainty estimates of various marketing channels.

I’ll restate his assumptions for the model and then show the gist.

Let’s make some assumptions about the model:

  1. The cost per transaction is an unknown with some prior (I just picked uniform)
  2. The expected number of transaction is the total budget divided by the (unknown) cost per transaction
  3. The actual observed number of transactions is a Poisson of the expected number of transactions

Here we can see that it is possible to get a good model of a conversion analysis using MCMC.

I think in the future I like Erik will use PyMC2 and PyMC3 more often for simple analysis like this. As I’ve repeatedly said this is a powerful method for generating a generative story that can be explained easily to stakeholders. We can also bring their ‘human intelligence’ into the model generation process. It may be possible that the head of marketing knows that the prior is not uniform – for example.

I will definitely use it for some further funnel analysis – in particular when the number of data points is very small and the model is very complex. I’m keen to hear other examples of PyMC3 in the wild.

Interview with a Data Scientist: Brad Klingenberg


Brad Klingenberg is the Director of Styling Algorithms at Stitch Fix in San Francisco. His team uses data and algorithms to improve the selection of merchandise sent to clients. Prior to joining Stitch Fix Brad worked with data and predictive analytics at financial and technology companies. He studied applied mathematics at the University of Colorado at Boulder and earned his PhD in Statistics at Stanford University in 2012.


1. What project have you worked on do you wish you could go back to, and do better?


Nearly everything! A common theme would be not taking the framing of a problem for granted. Even seemingly basic questions like how to measure success can have subtleties. As a concrete example, I work at Stitch Fix, an online personal styling service for women. One of the problems that we study is predicting the probability that a client will love an item that we select and send to her. I have definitely tricked myself in the past by trying to optimize a measure of prediction error like AUC.

This is trickier than it seems because there are some sources of variance that are not useful for making recommendations. For example, if I can predict the marginal probability that a given client will love any item then that model may give me a great AUC when making predictions over many clients, because some clients may be more likely to love things than others and the model will capture this. But if the model has no other information it will be useless for making recommendations because it doesn’t even depend on the item. Despite its AUC, such a model is therefore useless for ranking items for a given client. It is important to think carefully about what you are really measuring.


2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences and Social Sciences?


Focus on learning the basic tools of applied statistics. It can be tempting to assume that more complicated means better, but you will be well-served by investing time in learning workhorse tools like basic inference, model selection and linear models with their modern extensions. It is very important to be practical. Start with simple things.

Learn enough computer science and software engineering to be able to get things done. Some tools and best practices from engineering, like careful version control, go a long ways. Try to write clean, reusable code. Popular tools in R and Python are great for starting to work with data. Learn about convex optimization so you can fit your own models when you need to – it’s extremely useful to be able to cast statistical estimates as the solution to optimization problems.

Finally, try to get experience framing problems. Talk with colleagues about problems they are solving. What tools did they choose? Why? How should did they measure success? Being comfortable with ambiguity and successfully framing problems is a great way to differentiate yourself. You will get better with experience – try to seek out opportunities.


3. What do you wish you knew earlier about being a data scientist?


I have always had trouble identifying as a data scientist – almost everything I do with data can be considered applied statistics or (very) basic software engineering. When starting my career I was worried that there must be something more to it – surely, there had to be some magic that I was missing. There’s not. There is no magic. A great majority of what an effective data scientist does comes back to the basic elements of looking at data, framing problems, and designing experiments. Very often the most important part is framing problems and choosing a reasonable model so that you can estimate its parameters or make inferences about them.


4. How do you respond when you hear the phrase ‘big data’?


I tend to lose interest. It’s a very over-used phrase. Perhaps more importantly I find it to be a poor proxy for problems that are interesting. It can be true that big data brings engineering challenges, but data science is generally made more interesting by having data with high information content rather than by sheer scale. Having lots of data does not necessarily mean that there are interesting questions to answer or that those answers will be important to your business or application. That said, there are some applications like computer vision where it can be important to have a very large amount of data.


5. What is the most exciting thing about your field?


While “big data” is overhyped, a positive side effect has been an increased awareness of the benefits of learning from data, especially in tech companies. The range of opportunities for data scientists today is very exciting. The abundance of opportunities makes it easier to be picky and to find the problems you are most excited to work on. An important aspect of this is to look in places you might not expect. I work at Stitch Fix, an online personal styling service for women. I never imagined working in women’s apparel, but due to the many interesting problems I get to work on it has been the most exciting work of my career.


6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?


As I mentioned previously, it can be helpful to start framing a problem by thinking about how you would measure success. This will often help you figure out what to focus on. You will also seldom go wrong by starting simple. Even if you eventually find that another approach is more effective a simple model can be a hugely helpful benchmark. This will also help you understand how well you can reasonably expect your ultimate approach to perform. In industry, it is not uncommon to find problems where (1) it is just not worth the effort to do more than something simple, or (2) no plausible method will do well enough to be considered successful. Of course, measuring these trade-offs depends on the context of your problem, but a quick pass with a simple model can often help you make an assessment.


7. How do you explain to C-level execs the importance of Data Science? How do you deal with the ‘educated selling’ parts of the job? In particular – how does this differ from sports and industry?


It is usually better if you are not the first to evangelize the use of data. That said, data scientists will be most successful if they put themselves in situations where they have value to offer a business. Not all problems that are statistically interesting are important to a business. If you can deliver insights, products or predictions that have the potential to help the business then people will usually listen. Of course this is most effective when the data scientist clearly articulates the problem they are solving and what its impact will be.

The perceived importance of data science is also a critical aspect of choosing where to work – you should ask yourself if the company values what you will be working on and whether data science can really make it better. If this is the case then things will be much easier.


8. What is the most exciting thing you’ve been working on lately and tell us a bit about it.


I lead the styling algorithms team at Stitch Fix. Among the problems we work on is making recommendations to our stylists, human experts who curate our recommendations for our clients. Making recommendations with humans in the loop is fascinating problem because it introduces an extra layer of feedback – the selections made by our stylists. Combining this feedback with direct feedback from our clients to make better recommendations is an interesting and challenging problem.


9. What is the biggest challenge of leading a data science team?


Hiring and growing a team are constant challenges, not least because there is not much consensus around what data science even is. In my experience a successful data science team needs people with a variety of skills. Hiring people with a command of applied statistics fundamentals is a key element, but having enough engineering experience and domain knowledge can also be important. At Stitch Fix we are fortunate to partner with a very strong data platform team, and this enables us to handle the engineering work that comes with taking on ever more ambitious problems.

Interview with a Data Scientist: Alice Zheng

I recently caught up with Alice Zheng a Director of Data Science at Dato – Alice is an expert on building scalable Machine Learning models and currently works for who are a company providing tooling to help you build scalable machine learning models easily. She is also a keen advocate of encouraging women in Machine Learning and Computer Science. Alice has a PhD from UC Berkeley and spent some of her post docs at Microsoft Research in Redmond. She is currently based in Washington State in the US.

1. What project have you worked on do you wish you could go back to, and do better?
Too many! The top of the list is probably my PhD thesis. I collaborated with folks in software engineering research and we proposed a new way of using statistics to debug software. They instrumented programs to spit out logs for each run that provide statistics on the state of various program variables. I came up with an algorithm to cluster the failed runs and the variables. The algorithm identifies variables that are most correlated with each subset of failures. Those variables, in turn, can take the programmer very close to the location of the bug in the code.
It was a really fun project. But I’m not happy with the way that I solved the problem. For one thing, the algorithm that I came up with had no theoretical guarantees. I did not appreciate theory when I was younger. But nowadays, I’m starting to feel bad about the lack of rigor in my own work. It’s too easy in machine learning to come up with something that seems to work, maybe even have an intuitive explanation for why it makes sense, and yet not be able to write down a mathematical formula for what the algorithm is actually doing.
Another thing that I wish I had learned earlier is to respect the data more. In machine learning research, the emphasis is on new algorithms and models. But solving real data science problems require having the right data, developing the right features, and finally using the right model. Most of the time, new algorithms and methods are not needed. But a combination of data, features, and model is the key. I wish I’d realized this earlier and spent less time focusing on just one aspect of the whole pipeline.

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Be curious. Go deep. And study the arts.
Being curious gives you breadth. Knowing about other fields pulls you out of a narrow mindset focused on just one area of study. Your work will be more inspired, because you are drawing upon diverse sources of information.
Going deep into a subject gives you depth and expertise, so that you can make the right choices when trying to solve a problem, and so that you might more adequately assess the pros and cons of each approach.
Why study the arts? Well, if I had my druthers, art, music, literature, mathematics, statistics, and computer science would be required courses for K12. They offer completely different ways of understanding the world. They are complementary of each other. Knowing more than one way to see the world makes us more whole as human beings. Science _is_ an art form. Analytics is about problem solving, and it requires a lot of creativity and inspiration. It’s art in a different form.

3. What do you wish you knew earlier about being a data scientist?
Hmm, probably just what I said above–respect the data. Look at it in all different ways. Understand what it means. Data is the first class citizen. Algorithms and models are just helpers. Also, tools are important. Finding and learning to use good tools will save a lot of time down the line.

4. How do you respond when you hear the phrase ‘big data’?
Cringe? Although these days I’ve become de-sensitized. :)
I think a common misconception about “big data” is that, while the total amount of data maybe big, the amount of _useful_ data is very small in comparison. People might have a lot of data that has nothing to do with the questions they want to answer. After the initial stages of data cleaning and pruning, the data often becomes much much smaller. Not big at all.

5. What is the most exciting thing about your field?
So much data is being collected these days. Machine learning is being used to analyze them and draw actionable insights. It is being used to not just understand static patterns but to predict things that have not yet happened. Predicting what items someone is likely to buy or which customers are likely to churn, detecting financial fraud, finding anomalous patterns, finding relevant documents or images on the web. These applications are changing the way people do business, find information, entertain and socialize, and so much of it is powered by machine learning. So it has great practical use.
For me, an extra exciting part of it is to witness applied mathematics at work. Data presents different aspects of reality, and my job as a machine learning practitioner is to piece them together, using math. It is often treacherous and difficult. The saying goes “Lies, lies, and statistics.” It’s completely true; I often arrive at false conclusions and have to start over again. But it is so cool when I’m able to peel away the noise and get a glimpse of the underlying “truth.” When I’m getting nowhere, it’s frustrating. But when I get somewhere, it’s absolutely beautiful and gratifying.

6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 
Oh! I know the answer to this question: before embarking on a project, always think about “what will success look like? How would I be able to measure it?” This is a great lesson that I learned from mentors at Microsoft Research. It’s saved me from many a dead end. It’s easy to get excited about a new endeavor and all the cool things you’ll get to try out along the way. But if you don’t set a metric and a goal beforehand, you’ll never know when to stop, and eventually the project will peter out. If your goal IS to learn a new tool or try out a new method, then it’s fine to just explore. But with more serious work, it’s crucial to think about evaluation metrics up front.

7. You spent sometime at other firms before Dato. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?
I think this is a continuous learning experience. Every organization is different, and it’s incredible how much of a leader’s personality gets imprinted upon the whole organization.  I’m fascinated by the art and science behind creating successful organizations. Having been through a couple of very different companies makes me more aware of the differences between them. It’s very much like traveling to a different country: you realize that many of the things you took for granted do not actually need to be so. It makes me appreciate diversity. I also learn more about myself, about what works and what doesn’t work for me.
How to manage cultural challenges? I think the answer to that is not so different between work and life. No matter what the circumstance, we always have the freedom and the responsibility to choose who we want to be. How I work is a reflection of who I am. Being in a new environment can be challenging, but it can also be good. Challenge gets us out of our old patterns and demands that we grow into a new way of being. For me, it’s helpful to keep coming back to the knowledge of who I am, and who I want to be. When faced with a conflict, it’s important to both speak up and to listen. Speaking up (respectfully) affirms what is true for us. Listening is all about trying to see the other person’s perspective. It sounds easy but can be very difficult, especially in high stress situations where both sides hold to their own perspective. But as long as there’s communication, and with enough patience and skill, it’s possible to understand the other side. Once that happens, things are much easier to resolve.

8. How do you explain to C-level execs the importance of Data Science? How do you deal with the ‘educated selling’ parts of the job?
I point to all the successful examples of data science today. With successful companies like Amazon, Google, Netflix, Uber, AirBnB, etc. leading the way, it’s not difficult to convince people that data science is useful. A lot of people are curious and need to learn more before they make the jump. Others may have already bought into it but just don’t have the resources to invest in it yet. The market is not short no demand. It is short on supply: data scientists, good tools, and knowledge. It’s a great time to be part of this ecosystem!