Talks and Workshops


I enjoy giving talks and workshops on Data Analytics. Here is a list of some of the talks I’ve given. In my Mathematics master I regularly gave talks on technical topics, and previously I worked as a Teacher in a School in Northern Ireland. I consider the evangelism of data and analytics to be an important part of my job as a professional analyst!


PyCon Ireland in October TBA

Slides and Videos from Past Events

EuroSciPy 2015: I gave a talk on Probabilistic Programming applied to Sports Analytics – slides are here.

My PyData London tutorial was an extended version of the above talk – but will be more hands-on than the talk version.

I spoke at PyData in Berlin.
The link is here

The blurb for my upcoming PyData Berlin talk is mentioned here.
Abstract: “Probabilistic Programming and Bayesian Methods are called by some a new paradigm. There are numerous interesting applications such as to Quantitative Finance.
I’ll discuss what probabilistic programming is, why should you care and how to use PyMC and PyMC3 from Python to implement these methods. I’ll be applying these methods to studying the problem of ‘rugby sports analytics’ particularly how to model the winning team in the recent Six Nations in Rugby. I will discuss the framework and how I was able to quickly and easily produce an innovative and powerful model as a non-expert.”

In May 2015 I gave a preview of my PyData Talk in Berlin at the Data Science Meetup on ‘Probabilistic Programming and Rugby Analytics‘ – where I presented a case study and introduction to Bayesian Statistics to a technical audience. My case study was the problem of ‘how to predict the winner of the Six Nations’. I used the PyMC library in Python to build up statistical models as part of the Probabilistic Programming paradigm. This was based on my popular Blog Post which I later submitted to the acclaimed open source textbook Probabilistic Programming and Bayesian Methods for Hackers. I gave this talk using an IPython notebook, which proved to be a great method for presenting this technical material.

In October 2014 I gave a talk at Impactory in Luxembourg – a co-working space and Tech Accelerator. This was an introductory talk to a business audience about ‘Data Science and your business‘. I talked about my experience at different small firms, and large firms and the opportunities for Data Science in various industries.

In October 2014 I also gave a talk at the Data Science Meetup in Luxembourg. This was on ‘Data Science Models in Production‘ discussing my work with a small company on developing a mathematical modelling engine that was the backbone of a ‘data product’. This talk was highly successful and I gave a version of this talk at PyCon in Florence in April 2015. The aim of this talk was to explain what a ‘data product’ was, and discuss some of the challenges of getting data science models into production code. I also talked about the tool choices I made in my own case study. It was well-received, high level and got a great response from the audience. Edit: Those interested can see my video here, it was a really interesting talk to give, and the questions were fascinating.

When I was a freelance consultant in the Benelux I gave a private 5 minute talk on Data Science in the Game industry. Here are the slides. – This is from July 2014

My Mathematical research and talks as a Masters student are all here. I specialized in Statistics and Concentration of Measure. It was from this research that I became interested in Machine Learning and Bayesian Models.


My Masters Thesis on ‘Concentration Inequalities and some applications to Statistical Learning Theory‘ is an introduction to the world of Concentration of Measure, VC Theory and I used this to apply to understanding the generalization error of Econometric Forecasting Models.

Interview with a Data Scientist: Erik Bernhardsson

As part of my interviews with Data Scientists I recently caught up with Erik Bernhardsson who is famous in the world of ‘Big Data’ for his open source contributions, his leading of teams at Spotify, and his various talks at various conferences.

1. What project have you worked on do you wish you could go back to, and do better?
Like… everything I ever built. But I think that’s part of the learning experience. Especially working with real users, you never know what’s going to happen. There’s no clear problem formulation, no clear loss function, lots of various data sets to use. Of course you’re going to waste too much time doing something that turns out to nothing. But research is that way. Learning stuff is what matters and kind of by definition you have to do stupid shit before you learned it. Sorry for a super unclear answer :)
The main thing I did wrong for many years was I built all this cool stuff but never really made it into prototypes that other people could play around with. So I learned something very useful about communication and promoting your ideas.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Write a ton of code. Don’t watch TV :)
I really think showcasing cool stuff on Github and helping out other projects is a great way to learn and also to demonstrate market validation of your code.
Seriously, I think everyone can kick ass at almost anything as long as you spend a ridiculous amount of time on it. As long as you’re motivated by something, use that by focusing on something 80% of your time being awake.
I think people generally get motivated by coming up with various proxies for success. So be very careful about choosing the right proxies. I think people in academia often validate themselves in terms of things people in the industry don’t care about and things that doesn’t necessarily correlate with a successful career. It’s easy to fall down into a rabbit hole and become extremely good at say deep learning (or anything), but at a company that means you’re just some expert that will have a hard time getting impact beyond your field. Looking back on my own situation I should have spent a lot more time figuring out how to get other people excited about my ideas instead of perfecting ML algorithms (maybe similar to last question)
3. What do you wish you knew earlier about being a data scientist?
I don’t consider myself a data scientist so not sure :)
There’s a lot of definitions floating around about what a data scientist does. I have had this theory for a long time but just ran into a blog post the other day:
I think it summarizes my own impression pretty well. There’s two camps, one is the “business insights” side, one is the “production ML engineer” side. I managed teams at Spotify on both sides. It’s very different.
If you want to understand the business and generate actionable insights, then in my experience you need pretty much no knowledge of statistics and machine learning. It seems like people think with ML you can generate these super interesting insights about a business but in my experience it’s very rare. Sometimes we had people coming in writing a master’s thesis about churn prediction and you can get a really high AUC but it’s almost impossible to use that model for anything. So it really just boils down to doing lots of highly informed A/B tests. And above all, having deep empathy for user behavior. What I mean is you really need to understand how your users think in order to generate hypotheses to test.
For the other camp, in my experience understanding backend development is super important. I’ve seen companies where there’s a “ML research team” and a “implementation team” and there’s a “throw it over the fence” attitude, but it doesn’t work. Iteration cycles get 100x larger and incentives just get misaligned. So I think for anyone who wants to build cool ML algos, they should also learn backend and data engineering.
4. How do you respond when you hear the phrase ‘big data’?
Love it. Seriously, there’s this weird anti-trend of people bashing big data. I throw up every time I see another tweet like “You can get a machine with 1TB of ram for $xyz. You don’t have big data”. I almost definitely had big data at Spotify. We trained models with 10B parameters on 10TB data sets all the time. There is a lot of those problems in the industry for sure. Unfortunately sampling doesn’t always work.
The other thing I think those people get wrong is the production aspect of it. Things like Hadoop forces your computation into fungible units that means you don’t have to worry about computers breaking down. It might be 10x slower than if you had specialized hardware, but that’s fine because you can have 100 teams running 10000 daily jobs and things rarely crash – especially if you use Luigi :)
But I’m sure there’s a fair amount of snake oil Hadoop consultants who convince innocent teams they need it.
The other part of “big data” is that it’s at the far right of the hype cycle. Have you been to a Hadoop conference? It’s full of people in oversized suits talking about compliance now. At some point we’ll see deep learning or flux architecture or whatever going down the same route.
5. What is the most exciting thing about your field?
Boring answer but I do think the progress in deep learning has been extremely exciting. Seems like every week there’s new cool applications.
I think even more useful is how tools and platforms are maturing. A few years ago every company wrote their own dashboards, A/B test infrastructure, log synchronization, workflow management, etc. It’s great that there’s more open source projects and that more useful tools are emerging.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 
Ideally you can iterate on it with real users and see what the impact is. If not, you need to introduce some proxy metrics. That’s a whole art form in itself.
It’s good enough when the opportunity cost outweighs the benefit :) I.e. the marginal return of time invested is lower than for something else. I think it’s good to keep a backlog full of 100s of ideas so that you can prioritize based on expected ROI at any time. I don’t know if that’s a helpful answer but prioritization is probably the hardest problem to solve and it really just boils down to having some rules of thumb.
How Erik describes himself: I like to work with smart people and deliver great software. After 5+ years at Spotify, I just left for new exciting startup in NYC where I am leading the engineering team.

At Spotify, I built up and lead the team responsible for music recommendations and machine learning. We designed and built many large scale machine learning algorithms we use to power the recommendation features: the radio feature, the “Discover”​ page, “Related Artists”​, and much more. I also authored Luigi, which is a workflow manager in Python with 3,000+ stars on Github – used by Foursquare, Quora, Stripe, Asana, etc.

When I was younger I participated in lots of programming competitions. My team was five times Nordic champions in programming (2003-2010) and I have an IOI gold medal (2003).

Interview with a Data Scientist: Rosaria Silipo

As part of my Interview with Data Scientists project I recently caught up with Rosaria – who is an active member of the Data Mining community.

Bio: Rosaria has been a researcher in applications of Data Mining and Machine Learning for over a decade. Application fields include biomedical systems and data analysis, financial time series (including risk analysis), and automatic speech processing.

She is currently based in Zurich (Switzerland).

  1. What project have you worked on do you wish you could go back to, and do better?

There is not such a thing like the perfect project! As close as you can be to perfection, at some point you need to stop either because the time is over or because the money is over or because you just need to have a productive solution. I am sure I can go back to all my past projects and find something to improve in each of them!

This is actually one of the biggest issues in a data analytics projects: when do we stop? Of course, you need to identify some basic deliverables in the project initial phase, without which the project is not satisfactorily completed.

But once you have passed these deliverable milestones, when do you stop?
What is the right compromise between perfection and resource investment?

In addition, every few years some new technology becomes available which could help re-engineering your old projects, for speed or accuracy or both. So, even the most perfect project solution, after a few years, can surely be improved due to new technologies. This is, for example, the case of the new big data platforms. Most of my old projects would benefit now from a big data based speeding operation. This could help to speed up old models training and deployment, to create more complex data analytics models, and to optimize model paramters better.

  1. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

Use your time to learn! Data Science is a relatively new discipline that combines old knowledge, such as statstics and machine learning, with newer wisdom, like big data platforms and parallel computation. Not many people know everything here, really! So, take your time to learn what you do not know yet from the experts in that area.

Combining a few different pieces of data science knowledge probably makes you unique already in the data science landscape. The more pieces of different knowledge, the bigger of an advantage for you in the data science ecosystem!

One way to get easy hands-on experience on a different range of application fields is to explore the Kaggle challenges

Kaggle has a number of interesting challenges up every months and who knows you might also win some money!

  1. What do you wish you knew earlier about being a data scientist?

This answer is related to the previous one, since my advise to young data scientists sprouts from my earlier experience and failures. My early background is in machine learning. So, when I moved my first steps in the data science world many years ago, I thought that knowledge of machine learning algorithms was all I needed. I wish! I had to learn that data science is the sum of many different skills, including data collection and data cleaning and transformation. The latter, for example, is highly underestimated! In all data science projects I have seen (not only mine), the data processing part takes way more than 50% of the used resources!

Including also data visualization and data presentation. A genial solution is worth nothing if the executives and stakeholders do not understand the results by means of a clear and compact representation! And so on. I guess I wish I took more time early on to learn from colleagues with a different set of skills than mine.

  1. How do you respond when you hear the phrase ‘big data’?

Do you really need big data? Sometimes customers ask for a big data platform just because. Then when you investigate deeper you realize that they really do not have and do not want to have such a big amount of data to take care of every day. A nice traditional DWH (Data Warehouse) solution is definitely enough for them.

Sometimes though, a big data solution is really needed or at least it will be needed

  1. What is the most exciting thing about your field?

Probably, the variety of applications. The whole knowledge of data collection, data warehousing, data analytics, data visualization, results inspection and presentation is transveral to a number of application fields. You would be surprised at how many different applications can be designed using a variation of the same data science technique! Once you have the data science knowledge and a particular application request, all you need is imagination to make the two match and find the best solution.

  1. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

I always propose a first pilot/investigation mini-project at the very beginning. This is for me to get a better idea of the application specs, of the data set, and yes also of the customer. This is a crucial phase, though short. During this part, in fact, I can take the measures of the project in terms of needed time and resources, and I and the customer we can study each other and adjust our expectations about input data and final results. This initial phase, usually involves a sample of the data, an understanding of the data update strategy, some visual investigation, and a first tentative analysis to produce the requested results.

Once this part is successful and expectations have been adjusted on both sides, the real project can start.

  1. You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?

Ah … I am really not a very good example for dealing with stakeholders and executives and successfully manage cultural challenges! Usually, I rely on external collaborators to handle this part for me, also because of time constraints.

I see myself as a technical professional, with little time for talking and convincing. Unfortunately, because this is a big part of each data analytics project.

However, when I have to deal with it myself, I let the facts speak for me: final or intermediate results of current and past projects. This is the easiest way to convince stakeholders that the project is worth the time and the money. For any occurrence, though, I always have at hand a set of slides with previous accomplishements to present to executives if and when needed.

  1. Tell us about something cool you’ve been doing in Data Science lately.

My latest project was about anomaly detection in industry. I found it a very interesting problem to solve, where skills and expertise have to meet creativity. In anomaly detection you have no historical records of anomalies, either because they rarely happen or because they are too expensive to let them happen. What you have is a data set of records of normal functioning of the machine, transactions, system, or whatever it is you are observing. The challenge then is to predict anomalies before they happen and without previous historical examples. That is where the creativity comes in. Traditional machine learning algorithms need a twist in application to provide an adequate solution for this problem.

Interview with a Data Scientist: Erin Shellman

I recently caught up with Erin for an interview. Her interview is full of nice pieces of hard-earned advice and her final answer on Data Governance is gold!
Erin does some great blog posts at her blog, which I recommend. Erin is a programmer + statistician working as a research scientist at Amazon Web Services. Before that she was a Data Scientist in the Nordstrom Data Lab, where she primarily built product recommendations for She mostly codes in Scala, Python and R, but dabbles in Javascript to put data on the internet. Erin loves to teach and speak, and does both often through talks, as co-organizer of PyLadies-Seattle, and as an instructor at the University of Washington’s Professional and Continuing Education program.
1. What project have you worked on do you wish you could go back to, and do better?
Often the goal of data science projects is to automate processes with data–I worked on a lot of projects at Nordstrom with that goal. I think we were pretty naive in those pursuits, often approaching the problems with low empathy and EQ (Emotional Quotient). We built tools, expecting that the teams we were trying to automate would immediately see the value and jump to use them, but we didn’t spend a lot of time listening and trying to understand why some might be hesitant to adopt our tools. Eventually, I started training people and specifically asking them to send bug reports or feature requests. The trainings opened up dialog about our plans and made the other teams more invested, because they could see when their bugs were fixed and their feature implemented. I learned that doing the data work is only half (or less) of the challenge, the other is advocating for your work in such a way that others are similarly compelled.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
If you’re in school right now, use this time to master a programming language (you have more time than you ever will again despite what you may believe). For data science, I’d recommend Python, R or Scala (and if you had to choose one, Python). You absolutely need to be able to produce high-quality code before you walk in the door because chances are you’ll be asked to code early in the interview process.
I also think you shouldn’t spend too much time “training” and learning in your free time, it’s nearly impossible to retain knowledge that way. Instead, spend all your time shoring up the essentials and work on getting a job immediately. You’ll learn so much more on the job than you could ever hope to on your own, plus you’ll be paid. Don’t wait for postings for junior data scientists (I don’t know that I’ve ever even seen one), contact employers you’re interested in working with directly and ask them to make that role for you. You should look for places where you know there’s a solid data team already so you have plenty of people to learn from. Academics tend to have a sort of learned helplessness because they’re so often not in control of their work or careers. This is not the case in industry, if you want something, don’t wait for it to come to you (it won’t). Be an active participant in your future.
3. What do you wish you knew earlier about being a data scientist?
I wish I had spent more time in grad school learning computer science. Often DS (Data Science) jobs end up being almost the same as CS (Computer Science) jobs, and in my case I had to pick up a lot of CS skills on the job.
4. How do you respond when you hear the phrase ‘big data’?
Usually by rolling my eyes so far into the back of my head that they get stuck. I think the return on investment of Any Data is still higher than that of Big Data. Most shops who’re convinced that they need big data technology don’t make use of the data they have already, and adding more data to the pile won’t help the cause.
5. What is the most exciting thing about your field?
The most exciting thing is that I get to learn for a living. Every time I switch jobs or work on something new I have to learn a ton, different technologies and languages, different domains, and different businesses. I especially love that data science is often so close to the business. I love learning about what makes a business successful and providing knowledge to help businesses make better decisions.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 
When I’m approaching a new problem I focus really hard on the inputs and outputs, particularly the output. What exactly are you trying to produce, or trying to answer? This is often a question I pose to business stakeholders to encourage them to think critically about and what they really want to know, how it will be applied, and how to formally articulate it. Basically what I encourage them to do is state a formal hypothesis and the observations required to test that hypothesis. Once we’ve all agreed on the output, what are the inputs? I try to make this as specific as possible, so no “customer data”-level descriptions. Tell me exactly what the inputs are, e.g. annual customer spend, age, and zip code. The more you can reason through the solution in terms of inputs and outputs before you set out to solve the problem the less likely it will be that you’re halfway to answering a question that was ill-posed (I promise, this is 90% of requests), or that you don’t have data to support (this is probably another 5% of requests). It’s also a good way to prevent “stakeholder punting” which is a phrase I made up just now to describe when stakeholders make half-baked requests and then leave them for you to sort out. Data science and research is highly collaborative, and the data scientist shouldn’t be the only one invested in the work.
Once the inputs and outputs are defined, I like to draw flowcharts of the path to completion, and it’s usually easier to start from the bottom. Here’s an example I created for the students in my data mining course. They were working on prediction of a continuous outcome with various regression methods. First we decided on a criteria for model selection, which in this case was the model with the lowest root mean squared error. You can see that the input is a data file, and the output is whichever model had the best predictive accuracy as measured by the lowest RMSE (Root Mean Square Error). For me, diagramming your work like this makes your goal completely concrete.
 Inline image 1
The other really great thing about framing problems this way is that it makes it very easy to estimate effort and communicate to others what is required to complete the projects. For whatever reason, people often assume that while software engineers need 2 weeks to add a minor feature, data scientists need about 6 hours to do complete analyses and make beautiful visualizations. Communicating the amount of work required to complete projects to the requesters is crucial in data science, because most people just don’t know. It’s not something software engineers typically have to do, but providing guidance on the components of a data science project to your stakeholders will reduce your stress in the long-run.
7. What does data governance or data quality mean to you as a data scientist?
Data governance is the collection of processes and protocols to which an organization conforms to insure data accuracy and integrity. Most of the time I’m a data consumer, so I depend on a mature data infrastructure team to create the pipelines I use to collect and analyze data. When I was working on recommendations at Nordstrom, I was a consumer and provider. I provided data in the sense that the output of my recommendation algorithms was data consumed by the web team. Data governance in that context meant writing lots of unit tests to make sure the results of my computations produced correctly formatted entries. It also meant applying business rules, for example, removing entries for products out of stock, or applying brand restrictions.

Interviews with Data Scientists: David J. Hand

I recently reached out as part of my Data Science interview series to David J. Hand.

David has an impressive biography and has contributed a lot to fraud detection and data mining. His answers are insightful and from a statistical point of view. I feel that these academics have a lot to teach us practicing data scientists.

  1. What project have you worked on do you wish you could go back to, and do better?

I think I always have this feeling about most of the things I have worked on – that, had I been able to spend more time on it, I could have done better. Unfortunately, there are so many things crying out for one’s attention that one has to do the best one can in the time available. Quality of projects probably also has a diminishing returns aspect – spend another day/week/year on a project and you reduce the gap between its current quality and perfection by a half. Which means you never achieve perfection.

  1. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

I generally advise PhD students to find a project which interests them, which is solvable or on which significant headway can be made in the time they have available, and which other people (but not too many) care about. That last point means that others will be interested in the results you get, while the qualification means that there are not also thousands of others working on the problem (because that would mean you would probably be pipped to the post).

  1. What do you wish you knew earlier about being a statistician? What do you think industrial data scientists have to learn from this?

I think it is important that people recognise that statistics is not a branch of mathematics. Certainly statistics is a mathematical discipline, but so are engineering, physics, and surveying, and we don’t regard them as parts of mathematics. To be a competent professional statistician one needs to understand the mathematics underlying the tools, but one also needs to understand something about the area in which one is applying those tools. And then there are other aspects: it may be necessary, for example, to use a suboptimal method if this means that others can understand and buy in to what you have done. Industrial data scientists need to recognise the fundamental aim of a data scientist is to solve a problem, and to do this one should adopt the best approach for the job, be it a significance test, a likelihood function, or a Bayesian analysis. Data scientists must be pragmatic, not dogmatic. But I’m sure that most practicing data scientists do recognise this.

  1. How do you respond when you hear the phrase ‘big data’?

Probably a resigned sigh. ‘Big data’ is proclaimed as the answer to humanity’s problems. However, while it’s true that large data sets, a consequence of modern data capture technologies, do hold great promise for interesting and valuable advances, we should not fail to recognise that they also come with considerable technical challenges. The easiest of these lie in the data manipulation aspects of data science (the searching, sorting, and matching of large sets) while the toughest lie in the essentially statistical inferential aspects. The notion that one nowadays has ‘all’ of the data for any particular context is seldom true or relevant. And big data come with the data quality challenges of small data along with new challenges of its own.

  1. What is the most exciting thing about your field?

Where to begin! The eminent statistician John Tukey once said ‘the great thing about statistics is that you get to play in everyone’s back yard’, meaning that statisticians can work in medicine, physics, government, economics, finance, education, and so on. The point is that data are evidence, and to extract meaning, information, and knowledge from data you need statistics. The world truly is the statistician’s oyster.

  1. Do you feel universities will have to adapt to ‘data science’? What do you think will have to be done in say mathematical education to keep up with these trends?

Yes, and you can see that this is happening, with many universities establishing data science courses. Data science is mostly statistics, but with a leavening of relevant parts of computer science – some knowledge of databases, search algorithms, matching methods, parallel processing, and so on.


Professor David J. Hand

Imperial College, London

Bio: David Hand is Senior Research Investigator and Emeritus Professor of Mathematics at Imperial College, London, and Chief Scientific Advisor to Winton Capital Management. He is a Fellow of the British Academy, and a recipient of the Guy Medal of the Royal Statistical Society. He has served (twice) as President of the Royal Statistical Society, and is on the Board of the UK Statistics Authority. He has published 300 scientific papers and 26 books. He has broad research interests in areas including classification, data mining, anomaly detection, and the foundations of statistics. His applications interests include psychology, physics, and the retail credit industry – he and his research group won the 2012 Credit Collections and Risk Award for Contributions to the Credit Industry. He was made OBE for services to research and innovation in 2013.

Interview with a Data Scientist: Peadar Coyle

Peadar Coyle is a Data Analytics professional based in Luxembourg. His intellectual background is in Mathematics and Physics, and he currently works for Vodafone in one of their Supply Chain teams.

He is passionate about data science and the lead author of this project. He also contributes to Open Source projects and speaks at EuroSciPy, PyData and PyCon.

His expertise is largely in the statistical side of Data Science.

Peadar was asked by various of his interviewees to share his own interview, so he does humbly. 

  1. What project have you worked on do you wish you could go back to, and do better?

I agree that it is better to look forward rather than look backward. And my skills have frankly improved since I first started doing what we could call professional data analysis (which was probably just before starting my Masters a few years ago).

One project I did which springs to mind (and not naming names) is where there was a huge breakdown in communication and misaligned incentives. There needed to be more communication on that project and it overran the initial allotted time. I also spent not enough time communicating up front the risks and opportunities with the stakeholders.

The data was a lot messier than expected, and management had committed to delivered results in 2 weeks. This was impossible, the data cleaning and exploration phase took too long. Now I would focus on quicker wins. I also rushed to the ‘modelling’ phase without really understanding the data. I think such terms ‘understanding the data’ sound a bit academic to some stakeholders, but you need to clearly and articulately explain how important the data generation process is, and the uncertainty in that data. 

Some of this comes from experience – now I focus on adding value as quickly as possible and keeping things simple. There I fell to the siren call of ‘do more analysis’ rather than thinking about how the analysis is conveyed.

  1. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

I don’t have a PhD but I have recently been giving advice to people in that situation.

My advice is that having a portfolio of work if possible is great, or at least move towards doing an online course on Machine Learning or something cool like that.

The PyData videos are a good start too to watch. I’d recommend if you can to do any outreach or communication skills courses. There are many such courses at a lot of universities around the world, it’ll just help you understand the needs of others.

I think frankly that the most important skill for a Data Science is the ‘tactical application of empathy’ and that is something that working in a team really helps you develop. One thing I feel my Masters let me down on – as is common in Pure Mathematics – was a shortage of experience working in a team.

  1. What do you wish you knew earlier about being a data scientist?

The focus on communication skills, the need to add value every day. The fact that budget or a project can be terminated at any moment.

Adding value every day means showing results and sharing them, talking to people about stuff. Share visualizations, and share results – a lot of data science is about relationships and empathy. In fact I think that the tactical application of empathy is the greatest skill of our times.

You need to get out there and speak to the domain specialist, and understand what they understand. I believe that the best algorithms incorporate human as well as machine intelligence.

  1. How do you respond when you hear the phrase ‘big data’?

I too like the distinction of the small, medium and big data. I don’t worry so much about the terminology, and I focus on understanding exactly what my stakeholder wants from it.

I think though that it is often a distraction. I did one proof of concept as a consultant, that was an operational disaster. We didn’t have the resources to support a dev ops culture, nor did we have the capabilities to support a Hadoop cluster. Even worse the problem really could be solved more intelligently by being in RAM. But I got excited by the new tools, without understanding what they were really for.

I think this is a challenge, part of myself maturing as an engineering/ data scientist is appreciating the limits of tools and avoiding the hype. Most companies don’t need a cluster, and the mean size of a cluster will remain one for a long time. Don’t believe the salesmen, and ask the experts in your community about what is needed.

In short: I do feel it is strongly misleading but it is certainly here to stay.

  1. How did you end up being a data analyst? What is the most exciting thing about your field?

My academic and professional career have a bit of weird path. I started at Bristol in a Physics and Philosophy program. It was a really exciting time, and I learned a lot (some of it non-academic). I went into that program because I wanted to learn everything. At various points – especially in 2009-2010 the terminology of ‘data science’ began to pick up, and when I went into grad school in 2010, I was ‘aware’ of the discipline. I took a lot of financial maths classes at Luxembourg, just to keep that option open, yet I still in my heart wanted to be an academic.

I eventually after some soul searching realized that academic opportunities were going to be too difficult to get, and that I could earn more in industry. So I did a few industrial internships including one at, and towards the end of my Masters – I did a 6 month internship at a ‘small’ e-commerce company called

I learned a lot at and it was here that I realized i needed to work a lot harder on my software engineering skills. I’ve been working on them in my working life and through contributing to open source software and my various speaking engagements. I strongly recommend to any wanna data geeks to come to these and share your own knowledge :)

The most exciting thing about my field relates to the first statement about physics and philosophy – we truly are drowning in data, and we really with the computational resources we have now have the ability to answer or simulate certain questions in a business context. The web is a microscope, and your ERP system tells you more about your business than you can actually imagine – I’m very excited to help companies exploit their data.

  1. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

I like the OSEMIC framework (which I developed myself) and the CoNVO framework (which comes from Thinking with Data by Max Schron – I recommend the following video for an intro and the book itself.)

Let me explain – at the beginning of an ‘engagement’ I look for the Context, Need, Vision and Outcome of the project. Outcome means the delivery and asking these questions by having a conversation with stakeholders is a really good way to get to solving the ‘business problem’.

A lot of this after a few years in the business still feels like an art rather than a science.

I like explaining to people the Data Science process – obtain data, scrub data, explore, model, interpret and communicate.

I think a lot of people get these kinds of notions and a lot of my conversations recently at work have been about data quality – and data quality really needs domain knowledge. It is amazing how easy it is to misinterpret a number – especially around things like unit conversion etc.

  1. You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?

I would see a lot of the stuff above. One challenge is that some places aren’t ready for a data scientist nor do they know how to use one. I would avoid such places, and look for work elsewhere.

Some of this is a lack of vision, and one reason I do a lot of talks is to do ‘educated selling’ about the gospel of data-informed decision making and how the new tools such as the PyData stack and R are helping us extract more and more value out of data.

I’ve also found that visualizations help a lot, humans react to stories and pictures more than to numbers.

My advice to new-starters is over communicate, and learn some soft skills. The frameworks I mentioned help a bit in structuring and explaining a project to stakeholders. I recommend also reading this interview series, I learned a lot from it too :)

Interview with a Data Scientist: Ian Ozsvald

Ian Ozsvald is a Data Scientist based in London. He’s a friend and an inspiration to all us data geeks. He’s a co-organizer of PyData in London and speaks a lot on the data science circuit. He’s also very tall :)

I include a bio at the bottom.

1. What project have you worked on do you wish you could go back to, and do better?
My most frustrating project was (thankfully) many years ago. A client gave me a classification task for a large number of ecommerce products involving NLP. We defined an early task to derisk the project and the client provided representative data, according to the specification that I’d laid out. I built a set of classifiers that performed as well as a human and we felt that the project was derisked sufficiently to push on. Upon receiving the next data set I threw up my arms in horror – as a human I couldn’t solve the task on this new, very messy data – I couldn’t imagine how the machine would solve it. The client explained that they wanted the first task to succeed so they gave me the best data they could find and since we’d solved that problem, now I could work on the harder stuff. I tried my best to explain the requirements of the derisking project but fear I didn’t give a deep enough explanation to why I needed fully-representative dirty data rather than cherry-picked good data. After this I got *really* tough when explaining the needs for a derisking phase.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

You probably want an equal understanding of statistics, linear algebra and engineering, with multiple platforms and languages plus visualisation skills. You probably want 5+ years experience in each industrial domain you’ll work in. None of this however is realistic. Instead focus on some areas that interest you and that pay well-enough and deepen your skills so that you’re valuable. Next go to open source conferences and speak, talk at meetups and generally try to share your knowledge – this is a great way of firming up all the dodgy corners of your knowledge. By speaking at open source events you’ll be contributing back to the ecosystem that’s provided you with lots of high quality free tools. For me I speak, teach and keynote at conferences like PyDatas, PyCons, EuroSciPys and EuroPythons around the world and co-run London’s most active data community at PyDataLondon. Also get involved in supporting the projects you use – by answering questions and submitting new code you’ll massively improve the quality of your knowledge.

3. What do you wish you knew earlier about being a data scientist?
 I wish I knew how much I’d miss not paying attention to classes in statistics and linear algebra! I also wish I’d appreciated how much easier conversations with clients were if you have lots of diagrams from past projects and projects related to their data – people tend to think visually, they don’t work well from lists of numbers.
4. How do you respond when you hear the phrase ‘big data’?

Most clients don’t have a Big Data problem and even if they’re storing huge volumes of logs, once you subselect the relevant data you can generally store it on a single machine and probably you can represent it in RAM. For many small and medium sized companies this is definitely the case (and it is definitely-not-the-case for a company like Facebook!). With a bit of thought about the underlying data and its representation you can do things like use sparse arrays in place of dense arrays, use probabilistic counting and hashes in place of reversible data structures and strip out much of the unnecessary data. Cluster-sized data problems can be made to fit into the RAM of a laptop and if the original data already fits on just 1 hard-drive then it almost certainly only needs a single machine for analysis. I co-wrote O’Reilly’s High Performance Python and one of the goals of the book was to show that many number-crunching problems work well using just 1 machine and Python, without the complexity and support-cost of a cluster.

5. What is the most exciting thing about your field?

We’re stuck in a world of messy, human-created data. Cleaning it and joining it is currently a human-level activity, I strongly suspect that we can make this task machine-powered using some supervised approaches so less human time is spent crafting regular expressions and data transformations. Once we start to automate data cleaning and joining I suspect we’ll see a new explosion in the breadth of data science projects people can tackle.

6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 

To my mind the trick is figuring out a) how good the client’s data is and b) how valuable it could be to their business if put to work. You can justify any project if the value is high enough but first you have to derisk it and you want to do that as quickly and cheaply as possible. With 10 years of gut-feel experience I have some idea about how to do this but it feels more like art than science for the time being. Always design milestones that let you deliver lumps of value, this helps everyone stay confident when you hit the inevitable problems.

7. You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?
Justify the business value behind your work and make lots of diagrams (stick them on the wall!) so that others can appreciate what you’re doing. Make bits of it easy to understand and explain why it is valuable and people will buy into it. Don’t hide behind your models, instead speak to domain experts and learn about their expertise and use your models to backup and automate their judgement, you’ll want them on your side.
8. You have a cool startup can you comment on how important it is as a CEO to make a company such as that data-driven or data-informed?

My consultancy ( helps companies to exploit their data so we’re entirely data-driven! If a company has figured out that it has a lot of data and it could steal a march on its competitors by exploiting this data, that’s where we step in. A part of the reason I speak internationally is to help companies think about the value in their data based on the projects we’ve worked on previously.


My name is Ian Ozsvald. I’m an Entrepreneurial Geek, 30-late-ish, living in London (after 10 years in Brighton and a year in Latin America).

I take on work in my Artificial Intelligence consultancy (Mor Consulting Ltd.) and I also authorThe Artificial Intelligence Cookbook – learn how to add clever algorithms to your software to make it smarter! One of my mobile products is SocialTies (built with RadicalRobot).

I co-founded in 2005, it is all about tutorial screencasts that teach you programming, see About ShowMeDo for more info.  This was my second company and I’m rather proud to say that it is financially self-sufficient, growing and is full of very useful user-generated (and us-generated) content.  100,000 users and 1TB of data served per month say that we built some very useful indeed. In 5 years ShowMeDo has educated over 3 million people about open source tools.

I’m also co-founder of the £5 Apps Meetup, OpenCoffee Sussex and the BrightonDigital mail list (RIP).

Previously I’ve worked as Senior Programmer at Algorithmix (now Corpora) and the MASA Group, and these jobs came via my MSc in Artificial Intelligence at Sussex University.  See myLinkedIn profile.

Interviews with Data Scientists: Vanessa Sabino

Time for another Interview with a Data Scientist.
I caught up with Vanessa Sabino who is a lead data scientist in another one of Shopify’s teams. 
1. What project have you worked on do you wish you could go back to, and do better?
1. Working as practitioner in a company, as opposed to consulting, means I always have the option of going back and improving past projects, as long as the time spent on this task can be justified. There are always new ideas to try and new libraries being published, so as a team lead I try to balance the time spent on higher priority tasks, which for my team currently is ETL work to improve our data warehouse, with exploratory analysis of our data sets and creating and improving models that add value to our business users.

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
2. My advice is to not underestimate the importance of communication skills, which goes from listening, in order to understand exactly what the data means and the context in which it is used, to presenting your results in a way that demonstrates impact and resonates with your audience.
3. What do you wish you knew earlier about being a data scientist?
3. I wish I knew 20 years ago how to be a data scientist! When I was finishing high school and I had to decide what to do in university, I had some interest in Computer Science, but I had no idea what a career in that area would be like. The World Wide Web was just starting, and living in Brazil, I had the impression that all software developing companies were north of the Equator. So I decided to study Business, imagining I’d be able to spend my days using spreadsheets to optimize things. During the course I learned about data warehouses, business intelligence, statistics, data mining and decision science, but when it was over it was not clear how to get a job where I could apply this knowledge. I went to work on a IT consulting company, where I had the opportunity to improve my software developing skills, but I missed working with numbers, so after two years I left to start a new undergrad in Applied Mathematics, followed by a Masters in Computer Science. Then I continued working as a software developer, now in web companies, and that’s when I started learning about the vast amount of online behavior data they were collecting and the techniques being used to leverage its potential. “Data scientist” is a new name for something that covers many different traditional roles, and a better understanding of the related terms would have allowed me to make this career move sooner.

4. How do you respond when you hear the phrase ‘big data’?
4. I prefer to work closer to data analysis than to data engineering, so in an ideal world I’d have a small data set with a level of detail just right to summarize everything that I can extract from that data. Whatever size the data is, if someone is calling it big data it probably means that the tool they are using to manipulate it is no longer meeting certain expectations, and they are struggling with the technology in order to get their job done. I find it a little frustrating when you write correct code that should be able to transform a certain input to the desired output, but things don’t work as expected due to a lack of computing resources, which means you have to do extra work to get what you want. And the new solution only lasts until your data outgrows it again. But that’s just the way it is, and being in the boundary of what you can handle means you’ll be learning and growing in order to overcome the next challenges.

5. What is the most exciting thing about your field?
5. I’m excited about the opportunities to collaborate in a wide range of projects. Nowadays everyone wants to improve things with data informed decisions, so you get to apply your skills to many areas and you learn a lot in the process.

6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 
6. I always like to start with simple proof of concepts and iterate from there, using feedback from stakeholders to identify where are the biggest gains so that I can pivot the project in the right direction. But the most important thing in this process is to constantly ask “why”, in particular when dealing with requests. This helps you validate the understanding of the problem and enables you to offer better alternatives that the business user might not be aware of when they make a request.
And for the bio:
Vanessa Sabino started her career as a system analyst in 2000, and in 2010 she jumped at the opportunity to start working with Digital Analytics, which brought together her educational background in Business, Applied Mathematics, and Computer Science. She gained experience from Internet companies in Brazil before moving to Canada, where she is now a data analysis lead for Shopify, transforming data into Marketing insights.