Interviews with Data Scientists: David J. Hand


I recently reached out as part of my Data Science interview series to David J. Hand.

David has an impressive biography and has contributed a lot to fraud detection and data mining. His answers are insightful and from a statistical point of view. I feel that these academics have a lot to teach us practicing data scientists.

  1. What project have you worked on do you wish you could go back to, and do better?

I think I always have this feeling about most of the things I have worked on – that, had I been able to spend more time on it, I could have done better. Unfortunately, there are so many things crying out for one’s attention that one has to do the best one can in the time available. Quality of projects probably also has a diminishing returns aspect – spend another day/week/year on a project and you reduce the gap between its current quality and perfection by a half. Which means you never achieve perfection.

  1. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

I generally advise PhD students to find a project which interests them, which is solvable or on which significant headway can be made in the time they have available, and which other people (but not too many) care about. That last point means that others will be interested in the results you get, while the qualification means that there are not also thousands of others working on the problem (because that would mean you would probably be pipped to the post).

  1. What do you wish you knew earlier about being a statistician? What do you think industrial data scientists have to learn from this?

I think it is important that people recognise that statistics is not a branch of mathematics. Certainly statistics is a mathematical discipline, but so are engineering, physics, and surveying, and we don’t regard them as parts of mathematics. To be a competent professional statistician one needs to understand the mathematics underlying the tools, but one also needs to understand something about the area in which one is applying those tools. And then there are other aspects: it may be necessary, for example, to use a suboptimal method if this means that others can understand and buy in to what you have done. Industrial data scientists need to recognise the fundamental aim of a data scientist is to solve a problem, and to do this one should adopt the best approach for the job, be it a significance test, a likelihood function, or a Bayesian analysis. Data scientists must be pragmatic, not dogmatic. But I’m sure that most practicing data scientists do recognise this.

  1. How do you respond when you hear the phrase ‘big data’?

Probably a resigned sigh. ‘Big data’ is proclaimed as the answer to humanity’s problems. However, while it’s true that large data sets, a consequence of modern data capture technologies, do hold great promise for interesting and valuable advances, we should not fail to recognise that they also come with considerable technical challenges. The easiest of these lie in the data manipulation aspects of data science (the searching, sorting, and matching of large sets) while the toughest lie in the essentially statistical inferential aspects. The notion that one nowadays has ‘all’ of the data for any particular context is seldom true or relevant. And big data come with the data quality challenges of small data along with new challenges of its own.

  1. What is the most exciting thing about your field?

Where to begin! The eminent statistician John Tukey once said ‘the great thing about statistics is that you get to play in everyone’s back yard’, meaning that statisticians can work in medicine, physics, government, economics, finance, education, and so on. The point is that data are evidence, and to extract meaning, information, and knowledge from data you need statistics. The world truly is the statistician’s oyster.

  1. Do you feel universities will have to adapt to ‘data science’? What do you think will have to be done in say mathematical education to keep up with these trends?

Yes, and you can see that this is happening, with many universities establishing data science courses. Data science is mostly statistics, but with a leavening of relevant parts of computer science – some knowledge of databases, search algorithms, matching methods, parallel processing, and so on.


Professor David J. Hand

Imperial College, London

Bio: David Hand is Senior Research Investigator and Emeritus Professor of Mathematics at Imperial College, London, and Chief Scientific Advisor to Winton Capital Management. He is a Fellow of the British Academy, and a recipient of the Guy Medal of the Royal Statistical Society. He has served (twice) as President of the Royal Statistical Society, and is on the Board of the UK Statistics Authority. He has published 300 scientific papers and 26 books. He has broad research interests in areas including classification, data mining, anomaly detection, and the foundations of statistics. His applications interests include psychology, physics, and the retail credit industry – he and his research group won the 2012 Credit Collections and Risk Award for Contributions to the Credit Industry. He was made OBE for services to research and innovation in 2013.


Interview with a Data Scientist: Peadar Coyle


Peadar Coyle is a Data Analytics professional based in Luxembourg. His intellectual background is in Mathematics and Physics, and he currently works for Vodafone in one of their Supply Chain teams.

He is passionate about data science and the lead author of this project. He also contributes to Open Source projects and speaks at EuroSciPy, PyData and PyCon.

His expertise is largely in the statistical side of Data Science.

Peadar was asked by various of his interviewees to share his own interview, so he does humbly. 

  1. What project have you worked on do you wish you could go back to, and do better?

I agree that it is better to look forward rather than look backward. And my skills have frankly improved since I first started doing what we could call professional data analysis (which was probably just before starting my Masters a few years ago).

One project I did which springs to mind (and not naming names) is where there was a huge breakdown in communication and misaligned incentives. There needed to be more communication on that project and it overran the initial allotted time. I also spent not enough time communicating up front the risks and opportunities with the stakeholders.

The data was a lot messier than expected, and management had committed to delivered results in 2 weeks. This was impossible, the data cleaning and exploration phase took too long. Now I would focus on quicker wins. I also rushed to the ‘modelling’ phase without really understanding the data. I think such terms ‘understanding the data’ sound a bit academic to some stakeholders, but you need to clearly and articulately explain how important the data generation process is, and the uncertainty in that data. 

Some of this comes from experience – now I focus on adding value as quickly as possible and keeping things simple. There I fell to the siren call of ‘do more analysis’ rather than thinking about how the analysis is conveyed.

  1. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

I don’t have a PhD but I have recently been giving advice to people in that situation.

My advice is that having a portfolio of work if possible is great, or at least move towards doing an online course on Machine Learning or something cool like that.

The PyData videos are a good start too to watch. I’d recommend if you can to do any outreach or communication skills courses. There are many such courses at a lot of universities around the world, it’ll just help you understand the needs of others.

I think frankly that the most important skill for a Data Science is the ‘tactical application of empathy’ and that is something that working in a team really helps you develop. One thing I feel my Masters let me down on – as is common in Pure Mathematics – was a shortage of experience working in a team.

  1. What do you wish you knew earlier about being a data scientist?

The focus on communication skills, the need to add value every day. The fact that budget or a project can be terminated at any moment.

Adding value every day means showing results and sharing them, talking to people about stuff. Share visualizations, and share results – a lot of data science is about relationships and empathy. In fact I think that the tactical application of empathy is the greatest skill of our times.

You need to get out there and speak to the domain specialist, and understand what they understand. I believe that the best algorithms incorporate human as well as machine intelligence.

  1. How do you respond when you hear the phrase ‘big data’?

I too like the distinction of the small, medium and big data. I don’t worry so much about the terminology, and I focus on understanding exactly what my stakeholder wants from it.

I think though that it is often a distraction. I did one proof of concept as a consultant, that was an operational disaster. We didn’t have the resources to support a dev ops culture, nor did we have the capabilities to support a Hadoop cluster. Even worse the problem really could be solved more intelligently by being in RAM. But I got excited by the new tools, without understanding what they were really for.

I think this is a challenge, part of myself maturing as an engineering/ data scientist is appreciating the limits of tools and avoiding the hype. Most companies don’t need a cluster, and the mean size of a cluster will remain one for a long time. Don’t believe the salesmen, and ask the experts in your community about what is needed.

In short: I do feel it is strongly misleading but it is certainly here to stay.

  1. How did you end up being a data analyst? What is the most exciting thing about your field?

My academic and professional career have a bit of weird path. I started at Bristol in a Physics and Philosophy program. It was a really exciting time, and I learned a lot (some of it non-academic). I went into that program because I wanted to learn everything. At various points – especially in 2009-2010 the terminology of ‘data science’ began to pick up, and when I went into grad school in 2010, I was ‘aware’ of the discipline. I took a lot of financial maths classes at Luxembourg, just to keep that option open, yet I still in my heart wanted to be an academic.

I eventually after some soul searching realized that academic opportunities were going to be too difficult to get, and that I could earn more in industry. So I did a few industrial internships including one at, and towards the end of my Masters – I did a 6 month internship at a ‘small’ e-commerce company called

I learned a lot at and it was here that I realized i needed to work a lot harder on my software engineering skills. I’ve been working on them in my working life and through contributing to open source software and my various speaking engagements. I strongly recommend to any wanna data geeks to come to these and share your own knowledge 🙂

The most exciting thing about my field relates to the first statement about physics and philosophy – we truly are drowning in data, and we really with the computational resources we have now have the ability to answer or simulate certain questions in a business context. The web is a microscope, and your ERP system tells you more about your business than you can actually imagine – I’m very excited to help companies exploit their data.

  1. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

I like the OSEMIC framework (which I developed myself) and the CoNVO framework (which comes from Thinking with Data by Max Schron – I recommend the following video for an intro and the book itself.)

Let me explain – at the beginning of an ‘engagement’ I look for the Context, Need, Vision and Outcome of the project. Outcome means the delivery and asking these questions by having a conversation with stakeholders is a really good way to get to solving the ‘business problem’.

A lot of this after a few years in the business still feels like an art rather than a science.

I like explaining to people the Data Science process – obtain data, scrub data, explore, model, interpret and communicate.

I think a lot of people get these kinds of notions and a lot of my conversations recently at work have been about data quality – and data quality really needs domain knowledge. It is amazing how easy it is to misinterpret a number – especially around things like unit conversion etc.

  1. You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?

I would see a lot of the stuff above. One challenge is that some places aren’t ready for a data scientist nor do they know how to use one. I would avoid such places, and look for work elsewhere.

Some of this is a lack of vision, and one reason I do a lot of talks is to do ‘educated selling’ about the gospel of data-informed decision making and how the new tools such as the PyData stack and R are helping us extract more and more value out of data.

I’ve also found that visualizations help a lot, humans react to stories and pictures more than to numbers.

My advice to new-starters is over communicate, and learn some soft skills. The frameworks I mentioned help a bit in structuring and explaining a project to stakeholders. I recommend also reading this interview series, I learned a lot from it too 🙂

Interview with a Data Scientist: Ian Ozsvald


Ian Ozsvald is a Data Scientist based in London. He’s a friend and an inspiration to all us data geeks. He’s a co-organizer of PyData in London and speaks a lot on the data science circuit. He’s also very tall 🙂

I include a bio at the bottom.

1. What project have you worked on do you wish you could go back to, and do better?
My most frustrating project was (thankfully) many years ago. A client gave me a classification task for a large number of ecommerce products involving NLP. We defined an early task to derisk the project and the client provided representative data, according to the specification that I’d laid out. I built a set of classifiers that performed as well as a human and we felt that the project was derisked sufficiently to push on. Upon receiving the next data set I threw up my arms in horror – as a human I couldn’t solve the task on this new, very messy data – I couldn’t imagine how the machine would solve it. The client explained that they wanted the first task to succeed so they gave me the best data they could find and since we’d solved that problem, now I could work on the harder stuff. I tried my best to explain the requirements of the derisking project but fear I didn’t give a deep enough explanation to why I needed fully-representative dirty data rather than cherry-picked good data. After this I got *really* tough when explaining the needs for a derisking phase.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

You probably want an equal understanding of statistics, linear algebra and engineering, with multiple platforms and languages plus visualisation skills. You probably want 5+ years experience in each industrial domain you’ll work in. None of this however is realistic. Instead focus on some areas that interest you and that pay well-enough and deepen your skills so that you’re valuable. Next go to open source conferences and speak, talk at meetups and generally try to share your knowledge – this is a great way of firming up all the dodgy corners of your knowledge. By speaking at open source events you’ll be contributing back to the ecosystem that’s provided you with lots of high quality free tools. For me I speak, teach and keynote at conferences like PyDatas, PyCons, EuroSciPys and EuroPythons around the world and co-run London’s most active data community at PyDataLondon. Also get involved in supporting the projects you use – by answering questions and submitting new code you’ll massively improve the quality of your knowledge.

3. What do you wish you knew earlier about being a data scientist?
 I wish I knew how much I’d miss not paying attention to classes in statistics and linear algebra! I also wish I’d appreciated how much easier conversations with clients were if you have lots of diagrams from past projects and projects related to their data – people tend to think visually, they don’t work well from lists of numbers.
4. How do you respond when you hear the phrase ‘big data’?

Most clients don’t have a Big Data problem and even if they’re storing huge volumes of logs, once you subselect the relevant data you can generally store it on a single machine and probably you can represent it in RAM. For many small and medium sized companies this is definitely the case (and it is definitely-not-the-case for a company like Facebook!). With a bit of thought about the underlying data and its representation you can do things like use sparse arrays in place of dense arrays, use probabilistic counting and hashes in place of reversible data structures and strip out much of the unnecessary data. Cluster-sized data problems can be made to fit into the RAM of a laptop and if the original data already fits on just 1 hard-drive then it almost certainly only needs a single machine for analysis. I co-wrote O’Reilly’s High Performance Python and one of the goals of the book was to show that many number-crunching problems work well using just 1 machine and Python, without the complexity and support-cost of a cluster.

5. What is the most exciting thing about your field?

We’re stuck in a world of messy, human-created data. Cleaning it and joining it is currently a human-level activity, I strongly suspect that we can make this task machine-powered using some supervised approaches so less human time is spent crafting regular expressions and data transformations. Once we start to automate data cleaning and joining I suspect we’ll see a new explosion in the breadth of data science projects people can tackle.

6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 

To my mind the trick is figuring out a) how good the client’s data is and b) how valuable it could be to their business if put to work. You can justify any project if the value is high enough but first you have to derisk it and you want to do that as quickly and cheaply as possible. With 10 years of gut-feel experience I have some idea about how to do this but it feels more like art than science for the time being. Always design milestones that let you deliver lumps of value, this helps everyone stay confident when you hit the inevitable problems.

7. You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?
Justify the business value behind your work and make lots of diagrams (stick them on the wall!) so that others can appreciate what you’re doing. Make bits of it easy to understand and explain why it is valuable and people will buy into it. Don’t hide behind your models, instead speak to domain experts and learn about their expertise and use your models to backup and automate their judgement, you’ll want them on your side.
8. You have a cool startup can you comment on how important it is as a CEO to make a company such as that data-driven or data-informed?

My consultancy ( helps companies to exploit their data so we’re entirely data-driven! If a company has figured out that it has a lot of data and it could steal a march on its competitors by exploiting this data, that’s where we step in. A part of the reason I speak internationally is to help companies think about the value in their data based on the projects we’ve worked on previously.


My name is Ian Ozsvald. I’m an Entrepreneurial Geek, 30-late-ish, living in London (after 10 years in Brighton and a year in Latin America).

I take on work in my Artificial Intelligence consultancy (Mor Consulting Ltd.) and I also authorThe Artificial Intelligence Cookbook – learn how to add clever algorithms to your software to make it smarter! One of my mobile products is SocialTies (built with RadicalRobot).

I co-founded in 2005, it is all about tutorial screencasts that teach you programming, see About ShowMeDo for more info.  This was my second company and I’m rather proud to say that it is financially self-sufficient, growing and is full of very useful user-generated (and us-generated) content.  100,000 users and 1TB of data served per month say that we built some very useful indeed. In 5 years ShowMeDo has educated over 3 million people about open source tools.

I’m also co-founder of the £5 Apps Meetup, OpenCoffee Sussex and the BrightonDigital mail list (RIP).

Previously I’ve worked as Senior Programmer at Algorithmix (now Corpora) and the MASA Group, and these jobs came via my MSc in Artificial Intelligence at Sussex University.  See myLinkedIn profile.

Interviews with Data Scientists: Vanessa Sabino

Time for another Interview with a Data Scientist.
I caught up with Vanessa Sabino who is a lead data scientist in another one of Shopify’s teams. 
1. What project have you worked on do you wish you could go back to, and do better?
1. Working as practitioner in a company, as opposed to consulting, means I always have the option of going back and improving past projects, as long as the time spent on this task can be justified. There are always new ideas to try and new libraries being published, so as a team lead I try to balance the time spent on higher priority tasks, which for my team currently is ETL work to improve our data warehouse, with exploratory analysis of our data sets and creating and improving models that add value to our business users.

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
2. My advice is to not underestimate the importance of communication skills, which goes from listening, in order to understand exactly what the data means and the context in which it is used, to presenting your results in a way that demonstrates impact and resonates with your audience.
3. What do you wish you knew earlier about being a data scientist?
3. I wish I knew 20 years ago how to be a data scientist! When I was finishing high school and I had to decide what to do in university, I had some interest in Computer Science, but I had no idea what a career in that area would be like. The World Wide Web was just starting, and living in Brazil, I had the impression that all software developing companies were north of the Equator. So I decided to study Business, imagining I’d be able to spend my days using spreadsheets to optimize things. During the course I learned about data warehouses, business intelligence, statistics, data mining and decision science, but when it was over it was not clear how to get a job where I could apply this knowledge. I went to work on a IT consulting company, where I had the opportunity to improve my software developing skills, but I missed working with numbers, so after two years I left to start a new undergrad in Applied Mathematics, followed by a Masters in Computer Science. Then I continued working as a software developer, now in web companies, and that’s when I started learning about the vast amount of online behavior data they were collecting and the techniques being used to leverage its potential. “Data scientist” is a new name for something that covers many different traditional roles, and a better understanding of the related terms would have allowed me to make this career move sooner.

4. How do you respond when you hear the phrase ‘big data’?
4. I prefer to work closer to data analysis than to data engineering, so in an ideal world I’d have a small data set with a level of detail just right to summarize everything that I can extract from that data. Whatever size the data is, if someone is calling it big data it probably means that the tool they are using to manipulate it is no longer meeting certain expectations, and they are struggling with the technology in order to get their job done. I find it a little frustrating when you write correct code that should be able to transform a certain input to the desired output, but things don’t work as expected due to a lack of computing resources, which means you have to do extra work to get what you want. And the new solution only lasts until your data outgrows it again. But that’s just the way it is, and being in the boundary of what you can handle means you’ll be learning and growing in order to overcome the next challenges.

5. What is the most exciting thing about your field?
5. I’m excited about the opportunities to collaborate in a wide range of projects. Nowadays everyone wants to improve things with data informed decisions, so you get to apply your skills to many areas and you learn a lot in the process.

6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 
6. I always like to start with simple proof of concepts and iterate from there, using feedback from stakeholders to identify where are the biggest gains so that I can pivot the project in the right direction. But the most important thing in this process is to constantly ask “why”, in particular when dealing with requests. This helps you validate the understanding of the problem and enables you to offer better alternatives that the business user might not be aware of when they make a request.
And for the bio:
Vanessa Sabino started her career as a system analyst in 2000, and in 2010 she jumped at the opportunity to start working with Digital Analytics, which brought together her educational background in Business, Applied Mathematics, and Computer Science. She gained experience from Internet companies in Brazil before moving to Canada, where she is now a data analysis lead for Shopify, transforming data into Marketing insights.

Interview with a Data Scientist: Shane Lynn

I caught up with Irish startup co-founder and ex-Analytics Manager Shane to discuss Data Science.
 Shane described himself the following way –
I’m co-founder of KillBiller, a company that helps mobile operators to gain new customers. We provide a mobile phone plan comparison service in Ireland that allows people to use their own call, text, and data usage information to find the best value mobile tariff for their individual needs. In this position, I’m finding my way as a tech-startup founder, learning the actual ropes of creating a profitable business, and stretching my tech muscles on a complex and scaleable python backend on the Amazon cloud. Its a blast.
I would like to add that his blog posts and contributions are really cool, and I’m glad to see contributions to the data science community that aren’t just from the West Coast of the USA.
1. What project have you worked on do you wish you could go back to, and do better?
Maybe every one?! I think that data science projects always have a bit of unfinished business. Its a key part of the trade to be able to identify when enough is enough, and when extra time would actually lead to tangible results. Is 4 hours tuning a model worth an extra 0.01 % in accuracy? Maybe in some cases, but not most. Unfortunately, I think that a huge amount of real data science business cases leave you with a little “ooh i could have tried…” or “oh we might have optimised…”.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

“The more I learn, the more I realise how much I don’t know.” There seems to be a never ending list of new technologies and new techniques to get your head around. I would say to budding professionals that if you can get a solid understanding of basic key techniques in your repertoire to start with, you’ll do better than learning buzz words about the latest trends. While the headline-grabbing bleeding edge research will always seem to sparkle, the reality of data science in business is that people are still using proven techniques that work reliably and simply – think regression and k-means over deep-learning and natural language processing. Get the basics right first.

3. What do you wish you knew earlier about being a data scientist?
Data preparation. I know you see it written down, but there is no exaggeration at all in the phrase – you’ll spend 80% of your time preparing the data. I’m sure everyone says it, and should know it, but its a key part of the work, and a very important step in the information discovery process.
4. How do you respond when you hear the phrase ‘big data’?
That depends on where it comes from. At a business conference from a sales man – sometimes with rolling eye. At a tech meetup in Dublin – maybe with some interest. I think that Big Data has been hyped to death, and the reality is that, for now, there’s very few companies that actually require a large scale Hadoop deployment. I’ve worked with some of the largest companies on data science projects, and to date, have been able to process the data required on a single machine. However, I’m aware that that is an Irish specific viewpoint, where naturally our population and market size reduces the volume of data in many fields. However, I do think that Big Data is ultimately a function of the IT department, data scientists will simple lever the tools to extract meaningful excerpts or subsets for analysis.
5. What is the most exciting thing about your field?
Its ever changing, ever growing, and moving quickly. While its daunting sometimes to think of the speed of progress, its also extremely exciting to be involved in a world where new ideas, tools, and techniques are being spread on a weekly basis. There’s a huge amount of enthusiasm out there in the community and a plethora of new opportunities to be explored.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
I tend to start to tackle each problem after I’ve had a good look at the data behind it. Perhaps an extract, perhaps a MVP type model, but just enough to grasp the state of the data, the amount of cleansing required, and to identify potential problems and benefits. Its extremely difficult to accurately estimate the outcome of a data science problem before you start working – so a few hours of exploration are very worthwhile. Time spent is usually limited naturally by time and budget, and you can relatively quickly get to a point where negligible gains are being made for additional time investment.
7. You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?
There’s a political landscape in every company that you’ll join. Take the time to learn the ropes and learn how your company deals with these items. I find that frequent and realistic updates on progress and expectations are key to managing the various parties. Don’t hide the dirty bits or the issues. And probably budget three times the time that you initially think for each task – there’s always hidden issues!
8. You have a cool startup can you comment on how important it is as a CEO to make a company such as that data-driven or data-informed?
I’m working on KillBiller, an Irish startup that makes difficult decisions easy. KillBiller automatically audits your mobile phone usage and works out exactly what you would spend on every mobile network and tariff. We’ve saved almost 20,000 people money on their phone bills!
In our case, we’re all about data – processing peoples mobile usage, doing it securely, accurately, quickly, and presenting the results in a meaningful way. In addition, a data-driven approach to the startup  world has its advantages – having a solid understanding of your marketing effectiveness, website traffic, user retention, and route to revenue allows us to make decisions backed on science over intuition.
More information about Shane can be found at his blog

Interview with a Data Scientist: Ian Huston

Ian started out as a theoretical physicist, moved into data science a few years ago and is now part of the data science team at Pivotal Labs, the agile software consulting arm of Pivotal. Ian has worked on a variety of customer engagements at Pivotal including catastrophe risk modelling, fashion & consumer analytics, factory production quality and online marketing. Ian has been building analytical and numerical models for about 10 years and started out building high performance computing models of the earliest moments of the universe after the Big Bang.
1. What project have you worked on do you wish you could go back to, and do better?
First of all, thanks for the opportunity to be part of this interview series! I think if you are continually learning you always look back on past work with a view to what could have been done better. Having said that I don’t think there is one particular commercial project that I would pick out to redo, but maybe I don’t have enough perspective yet. I imagine most people who have done a PhD would probably like to redo some of the technical parts but were just relieved to get it finished at the time.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
When you are doing a PhD you have a very narrow focus and it can be hard to see where your skills and experience might be valuable outside academia. I would recommend trying to get a bit of an outside perspective, go to industry meetups and any ‘post-academia’ workshops that are available in your university.
It’s helpful to try to understand what someone hiring in industry is looking out for. For me, someone leaving academia doesn’t need to have full technical ability in the new area (e.g. machine learning) but should have made an effort to start down that learning path, and they should make it easy for me to see that. I’ve seen people leaving academia just submit the same academic CV to an industry data science role as they would use for a postdoc physics research position. I would suggest asking someone in the field you want to enter to critique your CV to avoid this kind of mistake.
3. What do you wish you knew earlier about being a data scientist?
I don’t think you can overstate how much of data science is really about working with people of all different technical levels and backgrounds. Coming from a theoretical physics background, which can be quite a solitary environment, I knew that data science and especially consulting would be very different. Every day I am reminded that my role is often more about managing relationships and understanding people’s needs than just writing code.
 4. How do you respond when you hear the phrase ‘big data’?
I still cringe a little, but I understand that it is a useful short-hand for a change in behaviour and scale that some parts of the tech industry are still not ready for. I like the more recent categorisation into small-, medium- and big-data as I think many companies really have medium data problems, where processing on a laptop in-memory is not feasible, but they don’t yet need a 10,000 core cluster. There is clearly a lot you can do before you start operating at the very largest scales of places like Facebook or Google. When you do start reaching those scales however, the problems are very different and the ‘big data’ technologies like Hadoop and massively parallel processing databases really come in to their own.
5. What is the most exciting thing about your field?
For me the most exciting thing is that we haven’t figured out all the ways predictive analytics and data science can help solve business problems. There are some well worn paths now, but each day new applications of machine learning and predictive algorithms are discovered, and new areas of industry become interested.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 
At Pivotal Labs I have learned a lot from our software development team about how to iterate quickly to minimise the risk in a project. For me, open and clear communication is the key to managing expectations and making sure that the project is providing value. If you can show some value very quickly and then build on that iteratively, you can have a continual dialogue about progress and expectations will not easily get out of sync.
A lot of people in this field have a perfectionist streak, so knowing when to stop and what ‘good enough’ looks like is an important skill. Does the time and effort needed to eke out that next 1% in accuracy really provide enough value or is the current performance just as good given the way the model will be applied?
7. You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?
Cultural challenges can be difficult, and even the differences between European and American attitudes to data protection can lead to internal problems in an organisation. As a data scientist, you often get into the ‘ugly baby’ scenario, where you have to explain to a leadership team or organisation that their carefully collected data is not quite as nice as they thought, or that their idea to run their niche business based on real time Twitter feedback is not going to be possible with the signal-to-noise ratio that is present. I think empathy is a very important trait and the way we hire people tries to select for this. If you can see the situation from the other person’s viewpoint it helps enormously when trying to resolve difficult situations.
8. How do you explain to C-level execs the importance of Data Science? How do you deal with the ‘educated selling’ parts of the job?
Some C-level execs really understand the value that data science can bring. The US has had a bit of a head start in this, and with successful projects under their belts they are ready to use data science more widely in their organisations. In Europe we are still in that learning phase I think, so making a success of that first project is important. Showing value early and often during a project can really help to drive understanding and appreciation of the possibilities that data science can provide.
A lot of people have now heard of data science and machine learning, and there are success stories in the mainstream and industry press. A few years ago this wasn’t the case and you had to spend a long time explaining at a relatively basic level how data science could be useful. You still have to do some of that, but it’s a bit easier and you can point to main-stream examples which helps a lot. As a lay-person it’s still very difficult to understand why one type of analysis is easy and another is very difficult. Randall Monroe captured this well in XKCD 1425 ( but with the progress in computer vision recently, even this example is nearly out of date!
I really enjoy the interview series so thank you for the opportunity to take part!

Interviews with a Data Scientist: Cameron Davidson-Pilon

Cameron is an open source contributor, a pythonista and a data geek –  he’s developed various cool libraries. His blog is worth a read, and I personally recommend his screencasts.
He’s got a strong Mathematical background like myself, and he currently is Lead Data Analyst in a Data Science job for Shopify. He’s possibly most famous in the Python community for his excellent Bayesian Methods for Hackers. I also had the honour of contributing to that project.
1. What project have you worked on do you wish you could go back to, and do better?
1. For sure, it was my projects during 2012 when I first started to enter Kaggle competitions. The two in particular I wish I could redo were the Twitter Psychopaths challenge and the US Census Return Rate challenge. In both challenges I made some serious high-level errors (but that’s the point of these challenges, to discover mistakes before they happen when it really matters!) I’ve detailed my mistake in the US Census challenge in my latest PyData presentation “Mistakes I’ve Made”, . Basically I ignored population variance and replaced it with machine learning egotism. Oh, I also remembered another project I would really love to go back to. In 2011, when I was doing research into stochastic processes, I started my first Python library (if you could even call it that) called PyProcess. You can still see it here:
Notice that it is, embarrassingly, one large file filled with Python classes. The first iteration didn’t even use Numpy! I would love to go back and redo the entire thing, but two things hold me back: 1) It was a lot of work to test each stochastic process and make sure they were doing the right, and 2) I’m do far out of the field now.
(Editor note: I personally used PyProcess during some of my Financial Mathematics coursework and always meant to try to add to the project, but never did)
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
2. If you’re not already learning and using Python or Scala, do that. Similarly, if you’re not already learning some software engineering, do that. What are some examples of data science software engineering? – writing (close to) professional level code – thinking proper abstractions, writing testable pieces, thinking about reusability. – having code reviewed, and reviewing code yourself – writing tests Why do I emphasize programming and software development so much? At a high level, data science is about using computers to do statistics for you. If you can’t properly use the former, then your most important tool in your toolbox is missing.
3. What do you wish you knew earlier about being a data scientist?
3. I wish I, and the rest of the field, knew about data cleaning. This is an important part of the whole data story and is glossed over. Specifically, the ETL pipeline (extract-transform-load). What I use to do is use SQL for the T part, but this caused too many problems (untestable, unmaintainable, unscalable). Now that is done prior to me even using the data for anything remotely complicated. This saves me time later, and allows the entire team to scale and benefit from my work (yes, I am still writing ETLs – I expect all my team members to, too). The problem is, you can’t really teach ETLs until you have the data problem. Small companies (I mean really small companies) and tutorials online can assume data is fine. Not until one is submerged in changing data does the ETL process start to make sense. So, though I wish I knew this earlier, I probably couldn’t have learned anyways!
4. How do you respond when you hear the phrase ‘big data’?
4. Sure, “Big Data” is a buzzword, but I think the issue with the name “Big data” comes down to two camps: are you seeing “Big data” as a solution (probably wrong) or as a problem (probably right). For example, two common questions an organization might have are 1) find the number of unique visitors to our site in the part month, and 2) find me the median of this dataset. If you data is simply too big for memory, which is a good definition of big data, then we can’t solve either of these problems naively. What is really interesting about big data as a problem is the abundance of cool new algorithms and data structures being invented to solve these problems. For example, HyperLogLog estimates the number of unique values in a set of data too big for memory. And TDigest estimates the percentiles of data too big for memory (and hence can’t be sorted).
5. What is the most exciting thing about your field?
5. I’ve already mentioned the interesting new algorithms for big data problems, so I won’t go over them again, but I do think they are very exciting. Another exciting thing the new problems being discovered, and the solutions being used. For example, the recommendation problem of what to recommend visitors to a site is a new problem that has massive impact, and is being solved by data. I can’t imagine Fisher or Pearson ever asking the question “what should I recommend next to this user?”. In a similar vein, we *are* seeing the reemergence of classical statistics again. Classical techniques like survival analysis, clinical trials, and logistic regression are seeing a major comeback because new problems have been identified.
6. How do you go about framing a data problem? 
6. Honestly, I try to turn it into a binomial problem. I use the beta-binomial model as a large crutch far too often, but it’s a really good initial model of a problem. If I can turn the problem into a binomial problem, then I have lots of tools I can work with: Bayesian analysis, sample-size appropriate ranking techniques, Bayesian Bandits, etc. If I can’t turn it into a binomial problem, I go through the rest of my toolbox: survival analysis, lifetime value, Bayesian modeling, classification, association analysis, etc. If I still can’t find an appropriate solution, then I have to expand my scope (and often learn a new tool while doing that).