Why Code review? Or why should I care as a data scientist.

Standard

The insightful Data Scientist Trey Causey talks about Software Development Skills for Data Scientists I’m going to write about my views on Code Review – as a Data Scientist with a few years experience, and experience delivering Data Products at organizations of varying sizes. I’m not perfect and I’m still maturing as an Engineer.

A good thorough introduction to Code Review comes from the excellent team at Lyst I suggest that as follow up reading!

The fundamental nugget is that ‘code reviews allow you to more effectively collaborate with your peers‘ and a lot of new Engineers and Data Scientists don’t know how to do that. This is one reason why I wrote ‘soft skills for data scientists‘. This article talks about a technical skill but I consider this a kind of ‘technical communication’.

Here are some views on ‘why code review’ – I share them here as reference, largely to remind myself. I steal a lot of these from this video series.

  • Peer to peer quality engineering and training 

As a Data Science community that is forming – and with us coming from various backgrounds there’s a lot of invaluable knowledge from others in the team. Don’t waste your chance at getting that 🙂

  • Catches bugs easily

There are many bugs that we all write when we write code.

Keeps team members on the same page

  • Domain knowledge 
    How do we share knowledge about our domain to others without sharing code?
  • Project style and architecture
    I’m a big believer in using structured projects like Cookiecutter Data Science and I’m sure there exist alternatives in other languages. Before hand I had a messy workflow like hacked together IPython notebooks and no idea what was what – refactoring code into modules is a good practice for a reason 🙂
  • Programming skills
    I learn a lot myself by reading other peoples code – a lot of the value of being part of an open source project like PyMC3 – is that I learn a lot from reading peoples code 🙂

Other good practices

  • PEP8 and Pylint (according to team standards)
  • Code review often, but by request of the author only

I think it’s a good idea (I think Roland Swingler mentioned this to me)

To not obsess too much about style – having a linter doing that is better, otherwise code reviews can become overly critical and pedantic. This can stop people sharing code and leads to criticism that can shake Junior Engineers in particular – who need psychological safety. As I mature as an Engineer and a Data Scientist I’m aware of this more and more 🙂

Keep code small

  • < 20 minutes, < 100 lines is best
  • Large code reviews make suggestions harder and can lead to bikeshedding

These are my own lessons so far and are based on experience writing code as a Data Scientist – I’d love to hear your views.

Advertisements

Interview with a Data Scientist: Alice Zheng

Standard
I recently caught up with Alice Zheng a Director of Data Science at Dato – Alice is an expert on building scalable Machine Learning models and currently works for www.dato.com who are a company providing tooling to help you build scalable machine learning models easily. She is also a keen advocate of encouraging women in Machine Learning and Computer Science. Alice has a PhD from UC Berkeley and spent some of her post docs at Microsoft Research in Redmond. She is currently based in Washington State in the US.

1. What project have you worked on do you wish you could go back to, and do better?
Too many! The top of the list is probably my PhD thesis. I collaborated with folks in software engineering research and we proposed a new way of using statistics to debug software. They instrumented programs to spit out logs for each run that provide statistics on the state of various program variables. I came up with an algorithm to cluster the failed runs and the variables. The algorithm identifies variables that are most correlated with each subset of failures. Those variables, in turn, can take the programmer very close to the location of the bug in the code.
It was a really fun project. But I’m not happy with the way that I solved the problem. For one thing, the algorithm that I came up with had no theoretical guarantees. I did not appreciate theory when I was younger. But nowadays, I’m starting to feel bad about the lack of rigor in my own work. It’s too easy in machine learning to come up with something that seems to work, maybe even have an intuitive explanation for why it makes sense, and yet not be able to write down a mathematical formula for what the algorithm is actually doing.
Another thing that I wish I had learned earlier is to respect the data more. In machine learning research, the emphasis is on new algorithms and models. But solving real data science problems require having the right data, developing the right features, and finally using the right model. Most of the time, new algorithms and methods are not needed. But a combination of data, features, and model is the key. I wish I’d realized this earlier and spent less time focusing on just one aspect of the whole pipeline.

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Be curious. Go deep. And study the arts.
Being curious gives you breadth. Knowing about other fields pulls you out of a narrow mindset focused on just one area of study. Your work will be more inspired, because you are drawing upon diverse sources of information.
Going deep into a subject gives you depth and expertise, so that you can make the right choices when trying to solve a problem, and so that you might more adequately assess the pros and cons of each approach.
Why study the arts? Well, if I had my druthers, art, music, literature, mathematics, statistics, and computer science would be required courses for K12. They offer completely different ways of understanding the world. They are complementary of each other. Knowing more than one way to see the world makes us more whole as human beings. Science _is_ an art form. Analytics is about problem solving, and it requires a lot of creativity and inspiration. It’s art in a different form.

3. What do you wish you knew earlier about being a data scientist?
Hmm, probably just what I said above–respect the data. Look at it in all different ways. Understand what it means. Data is the first class citizen. Algorithms and models are just helpers. Also, tools are important. Finding and learning to use good tools will save a lot of time down the line.

4. How do you respond when you hear the phrase ‘big data’?
Cringe? Although these days I’ve become de-sensitized. 🙂
I think a common misconception about “big data” is that, while the total amount of data maybe big, the amount of _useful_ data is very small in comparison. People might have a lot of data that has nothing to do with the questions they want to answer. After the initial stages of data cleaning and pruning, the data often becomes much much smaller. Not big at all.

5. What is the most exciting thing about your field?
So much data is being collected these days. Machine learning is being used to analyze them and draw actionable insights. It is being used to not just understand static patterns but to predict things that have not yet happened. Predicting what items someone is likely to buy or which customers are likely to churn, detecting financial fraud, finding anomalous patterns, finding relevant documents or images on the web. These applications are changing the way people do business, find information, entertain and socialize, and so much of it is powered by machine learning. So it has great practical use.
For me, an extra exciting part of it is to witness applied mathematics at work. Data presents different aspects of reality, and my job as a machine learning practitioner is to piece them together, using math. It is often treacherous and difficult. The saying goes “Lies, lies, and statistics.” It’s completely true; I often arrive at false conclusions and have to start over again. But it is so cool when I’m able to peel away the noise and get a glimpse of the underlying “truth.” When I’m getting nowhere, it’s frustrating. But when I get somewhere, it’s absolutely beautiful and gratifying.

6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 
Oh! I know the answer to this question: before embarking on a project, always think about “what will success look like? How would I be able to measure it?” This is a great lesson that I learned from mentors at Microsoft Research. It’s saved me from many a dead end. It’s easy to get excited about a new endeavor and all the cool things you’ll get to try out along the way. But if you don’t set a metric and a goal beforehand, you’ll never know when to stop, and eventually the project will peter out. If your goal IS to learn a new tool or try out a new method, then it’s fine to just explore. But with more serious work, it’s crucial to think about evaluation metrics up front.

7. You spent sometime at other firms before Dato. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?
I think this is a continuous learning experience. Every organization is different, and it’s incredible how much of a leader’s personality gets imprinted upon the whole organization.  I’m fascinated by the art and science behind creating successful organizations. Having been through a couple of very different companies makes me more aware of the differences between them. It’s very much like traveling to a different country: you realize that many of the things you took for granted do not actually need to be so. It makes me appreciate diversity. I also learn more about myself, about what works and what doesn’t work for me.
How to manage cultural challenges? I think the answer to that is not so different between work and life. No matter what the circumstance, we always have the freedom and the responsibility to choose who we want to be. How I work is a reflection of who I am. Being in a new environment can be challenging, but it can also be good. Challenge gets us out of our old patterns and demands that we grow into a new way of being. For me, it’s helpful to keep coming back to the knowledge of who I am, and who I want to be. When faced with a conflict, it’s important to both speak up and to listen. Speaking up (respectfully) affirms what is true for us. Listening is all about trying to see the other person’s perspective. It sounds easy but can be very difficult, especially in high stress situations where both sides hold to their own perspective. But as long as there’s communication, and with enough patience and skill, it’s possible to understand the other side. Once that happens, things are much easier to resolve.

8. How do you explain to C-level execs the importance of Data Science? How do you deal with the ‘educated selling’ parts of the job?
I point to all the successful examples of data science today. With successful companies like Amazon, Google, Netflix, Uber, AirBnB, etc. leading the way, it’s not difficult to convince people that data science is useful. A lot of people are curious and need to learn more before they make the jump. Others may have already bought into it but just don’t have the resources to invest in it yet. The market is not short no demand. It is short on supply: data scientists, good tools, and knowledge. It’s a great time to be part of this ecosystem!

Interview with a Data Scientist: Maria Rosario Mestre

Standard

I recently caught with with Maria Rosario Mestre – she shared her personal views on Data Science – like all these interviewee subjects – these do not reflect her employers views.

Linkedin profile picture

Biography – Maria: 

I completed a PhD in signal processing at Cambridge developing models of user behaviour using brain data. After the PhD I joined Skimlinks as a data scientist, where I model online user behaviour and work on much larger datasets. My main role is implementing large-scale machine learning models processing terabytes of data.


What project have you worked on do you wish you could go back to, and do better?

I think that pretty much applies to any project you do as a data scientist. When you’re developing algorithms that become a service used by someone either internally or externally, I think it is best to use an iterative approach where you wait for some feedback from the client before doing any further improvements. I am a true believer of “lean data science”.


What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

I guess it depends what the advice is for . If is it for PhD students thinking about a career as a data scientist in industry, then I would strongly recommend them to get some experience working on real-world data at some point during the PhD. It is quite common in academia to work mainly on synthetic data. In addition to that, I would say it is important to keep a curious and open mind about the research carried out by other people, since it is very easy to only stay focused on your specific research project. For analytics professionals, I would say that learning how to code is quite useful, especially in a scripting language like Python. Knowing some classical statistics is also very helpful, if you want to learn how to apply a scientific approach to any type of data analysis.


What do you wish you knew earlier about being a data scientist?

There is not much I can think about, but maybe I wish I had spent more time using version control platforms, such as github. During my PhD I had a very rudimentary version control method: copying my whole project into a different folder with today’s date. It was definitely not the best way of managing my project. In my current role we work on a shared codebase and we need to keep track of changes, so I had to start using github. I wish I had taken more time to learn how to use it properly before diving into it, as it would have saved me a lot of time.


How do you respond when you hear the phrase ‘big data’?

I say that’s boring, now it’s all about “massive data”! Now seriously, I have experienced big data at Skimlinks, where we run daily jobs on terabytes of data using Spark. I think “big data” is a real thing, but people sometimes believe they have it when they don’t, or if they have it, then they think they need to do something about it, but don’t know what. I don’t think that you should approach “big data” as a solution in search of a problem. You should always think of the problem first that you’re trying to solve, see if your data scale qualifies as “big data”, and then finally start using big data tools once you have defined all these parameters. It is a waste of time and resources to start using these tools just because they are fashionable and you’re scared of missing out.


What is the most exciting thing about your field?

I find solving real problems exciting, and if these problems are hard, then it’s double as exciting. As a data scientist, you have to solve hard problems all the time, mainly because real data is never like in the textbooks! It’s always biased, with missing columns or wrong values. Then, I also find it exciting to solve problems with large-scale data. It is very easy to use out-of-the-box Python libraries to run a machine learning algorithm, but what happens when you have to adapt that algorithm to run on 500 gigabytes? That’s when you need to start thinking creatively using the tools you already know to solve a new problem. You might even be the first person to solve such a problem!

In more general terms, I think that machine learning will have a huge impact on our daily lives. We have already started seeing the effects now that we are always connected and use increasingly intelligent apps, but I think this is only the beginning.


How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 

This is a great question, and one that I keep asking myself. As I said earlier, I believe in lean data science. What this means is that I believe you need to start with a very clear objective you are trying to solve and use an iterative approach over it, always gathering feedback from the end user. If possible, the end goal should be stated in clear objective metrics, like increasing the accuracy of a classifier by 10%, or make better recommendations in 20% of the cases. You know it’s good enough when the end user is happy. I also believe that sometimes when you look at a problem from a lot of different angles and don’t seem to make a lot of progress, it is good to document all the attempts, leave it on the side, and get back to it later with a fresh pair of eyes.


How do you explain to C-level execs the importance of Data Science? How do you deal with the ‘educated selling’ parts of the job?

As a data scientist, your role is not only to develop algorithms, but also to be an evangelist in your own company on the use of data science, and generally the scientific method. If you want to convince business people that data science is important, then the best you can do is talk business. You need to think of data science projects in terms of the value they can add to your business, either because they can increase conversion rates, or keep some customers happy, or make someone’s job in the company much easier… You can start by running small experiments and gather some results to show to the executives in your company. However, data science is not the solution to any problem, and sometimes a simple rule-based model could do the job just as well. It is important not to oversell what you can do, and be realistic about what you can offer.


What is the most exciting thing you’ve been working on lately and tell us a bit about?

Skimlinks is about to launch a new product in the coming weeks, and the data science team has been heavily involved in its making. I cannot say much about it unfortunately, but these are exciting times for the company. From a technical point of view, the last thing that I have done which was exciting was classifying 1.2 billion data points using Spark. I broke a personal record in terms of the size of the data involved.


What is the biggest challenge of building a data science team?

I would have to ask my manager, since I have never built a team myself. I have been involved in the hiring process though, and I think it is sometimes difficult to find the right combination of skills across the team. You want some people who have experience working with data, others than may be stronger in engineering. It is also important to manage people’s expectations about the role, since data scientists spend a lot of time doing data processing and setting up data pipelines before they can apply machine learning algorithms. It’s all part of the job!

Interview with a Data Scientist: Trey Causey

Standard
Trey Causey is a blogger with experience as a professional data scientist in sports analytics and e-commerce. He’s got some fantastic views about the state of the industry, and I was privileged to read this.
1. What project have you worked on do you wish you could go back to, and do better?
The easy and honest answer would be to say all of them. More concretely, I’d love
to have had more time to work on my current project, the NYT 4th Down Bot before
going live. The mission of the bot is to show fans that there is an analytical
way to go about deciding what to do on 4th down (in American football), and that
the conventional wisdom is often too conservative. Doing this means you have to
really get the “obvious” calls correct as close to 100% of the time as possible,
but we all know how easy it is to wander down the path to overfitting in these
circumstances…
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences and Social Sciences?
Students should take as many methods classes as possible. They’re far more generalizable
than substantive classes in your discipline. Additionally, you’ll probably meet
students from other disciplines and that’s how constructive intellectual cross-fertilization
happens. Additionally, learn a little bit about software engineering (as distinct
from learning to code). You’ll never have as much time as you do right now for things
like learning new skills, languages, and methods.
For young professionals, seek out someone more senior than yourself, either at your
job or elsewhere, and try to learn from their experience. A word of warning, though,
it’s hard work and a big obligation to mentor someone, so don’t feel too bad if
you have hard time finding someone willing to do this at first. Make it worth
their while and don’t treat it as your “right” that they spend their valuable
time on you. I wish this didn’t even have to be said.
3. What do you wish you knew earlier about being a data scientist?
 
It’s cliche to say it now, but how much of my time would be spent getting data,
cleaning data, fixing bugs, trying to get pieces of code to run across multiple
environments, etc. The “nuts and bolts” aspect takes up so much of your time but
it’s what you’re probably least prepared for coming out of school.
4. How do you respond when you hear the phrase ‘big data’?
Indifference.
5. What is the most exciting thing about your field?
Probably that it’s just beginning to even be ‘a field.’ I suspect in five years
or so, the generalist ‘data scientist’ may not exist as we see more differentiation
into ‘data engineer’ or ‘experimentalist’ and so on. I’m excited about the
prospect of data scientists moving out of tech and into more traditional
companies. We’ve only really scratched the surface of what’s possible or,
amazingly, not located in San Francisco.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
A difficult question along the lines of “how long is a piece of string?” I think
the key is to communicate early and often, define success metrics as much as
possible at the *beginning* of a project, not at the end of a project. I’ve found
that “spending too long” / navel-gazing is a trope that many like to level at data
scientists, especially former academics, but as often as not, it’s a result of
goalpost-moving and requirement-changing from management. It’s important to manage
up, aggressively setting expectations, especially if you’re the only data scientist
at your company.
7. How do you explain to C-level execs the importance of Data Science? How do you deal with the ‘educated selling’ parts of the job? In particular – how does this differ from sports and industry?
Honestly, I don’t believe I’ve met any executives who were dubious about the
value of data or data science. The challenge is often either a) to temper
unrealistic expectations about what is possible in a given time frame (we data
scientists mostly have ourselves to blame for this) or b) to convince them to
stay the course when the data reveal something unpleasant or unwelcome.
8. What is the most exciting thing you’ve been working on lately and tell us a bit about it.
I’m about to start a new position as the first data scientist at ChefSteps, which
I’m very excited about, but I can’t tell you about what I’ve been working on there
as I haven’t started yet. Otherwise, the 4th Down Bot has been a really fun
project to work on. The NYT Graphics team is the best in the business and is
full of extremely smart and innovative people. It’s been amazing to see the
thought and time that they put into projects.
9. What is the biggest challenge of leading a data science team?
I’ve written a lot about unrealistic expectations that all data scientists
be “unicorns” and be experts in every possible field, so for me the hardest
part of building a team is finding the right people with complementary skills
that can work together amicably and constructively. That’s not special to
data science, though.

Interview with a Data Scientist: Nathalie Hockham

Standard
1038670
(Linkedin picture)
I was very happy to interview Natalie about her data science stuff – as she gave a really cool Machine Learning focused talk at PyData in London this year, which was full of insights into the challenges of doing Machine Learning with Imbalanced data sets.
Natalie leads the data team at GoCardless, a London startup specialising in online direct debit. She cut her teeth as a PhD student working on biomedical control systems before moving into finance, and eventually fintech. She is particularly interested in signal processing and machine learning and is presently swotting up on data engineering concepts, some knowledge of which is a must in the field.

What project have you worked on do you wish you could go back to, and do better?

Before I joined a startup, I was working as an analyst on the trading floor of one of the oil majors. I spent a lot of time building out models to predict futures timespreads based on our understanding of oil stocks around the world, amongst other things. The output was a simple binary indication of whether the timespreads were reasonably priced, so that we could speculate accordingly. I learned a lot about time series regression during this time but worked exclusively with Excel and eViews. Given how much I’ve learned about open source languages, code optimisation, and process automation since working at GoCardless, I’d love to go back in time and persuade the old me to embrace these sooner.

What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Don’t underestimate the software engineers out there! These guys and girls have been coding away in their spare time for years and it’s with their help that your models are going to make it into production. Get familiar with OOP as quickly as you can and make it your mission to learn from the backend and platform engineers so that you can work more independently.

What do you wish you knew earlier about being a data scientist?

It’s not all machine learning. I meet with some really smart candidates every week who are trying to make their entrance into the world of data science and machine learning is never far from the front of their minds. The truth is machine learning is only a small part of what we do. When we do undertake projects that involve machine learning, we do so because they are beneficial to the company, not just because we have a personal interest in them. There is so much other work that needs to be done including statistical inference, data visualization, and API integrations. And all this fundamentally requires spending vast amounts of time cleaning data.


How do you respond when you hear the phrase ‘big data’?

I haven’t had much experience with ‘big data’ yet but it seems to have superseded ‘machine learning’ on the hype scale. It definitely sounds like an exciting field – we’re just some way off going down this route at GoCardless.

What is the most exciting thing about your field?
Working in data is a great way to learn about all aspects of a business, and the lack of engineering resource that characterizes most startups means that you are constantly developing your own skill set. Given how quickly the field is progressing, I can’t see myself reaching saturation in terms of what I can learn for a long time yet. That makes me really happy.

How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
Our 3 co-founders all started out as management consultants and the importance of accurately defining a problem from the outset has been drilled into us. Prioritisation is key – we mainly undertake projects that will generate measurable benefits right now. Before we start a project, we check that the problem actually exists (you’d be surprised how many times we’ve avoided starting down the wrong path because someone has given us incorrect information). We then speak to the relevant stakeholders and try to get as much context as possible, agreeing a (usually quantitative) target to work towards. It’s usually easy enough to communicate to people what their expectations should be. Then the scoping starts within the data team and the build begins. It’s important to recognise that things may change over the course of a project so keeping everyone informed is essential. Our system isn’t perfect yet but we’re improving all the time.

How do you explain to C-level execs the importance of Data Science? How do you deal with the ‘educated selling’ parts of the job?
Luckily, our management team is very embracing of data in general. Our data team naturally seeks out opportunities to meet with other data professionals to validate the work we’re doing. We try hard to make our work as transparent as possible to the rest of the company by giving talks and making our data widely available, so that helps to instill trust. Minor clashes are inevitable every now and then, which can put projects on hold, but we often come back to them later when there is a more compelling reason to continue.

What is the most exciting thing you’ve been working on lately and tell us a bit about GoCardless.
We’ve recently overhauled our fraud detection system, which meant working very closely with the backend engineers for a prolonged period of time – that was a lot of fun.
GoCardless is an online direct debit provider, founded in 2011. Since then, we’ve grown to 60+ employees, with a data team of 3. Our data is by no means ‘big’ but it can be complex and derives from a variety of sources. We’re currently looking to expand our team with the addition of a data engineer, who will help to bridge the gap between data and platform.

What is the biggest challenge of leading a data science team?

The biggest challenge has been making sure that everyone is working on something they find interesting most of the time. To avoid losing great people, they need to be developing all the time. Sometimes this means bringing forward projects to provide interest and raise morale. Moreover, there are so many developments in the field that its hard to keep track, but attending meetups and interacting with other professionals means that we are always seeking out opportunities to put into practice the new things that we have learned.

Interviews with Data Scientists: NLP for the win

Standard

Recently I decided to do some quick Data Analysis of my interviews with data scientists.

It seems natural when you collect a lot of data to explore it and do some data analysis on it.

You can access the code here.
The code isn’t in much depth but it is a simple example of how to use NLTK, and a few other libraries in Python to do some quick data analysis of ‘unstructured’ data.

First question:

What does a word cloud of the data look like?

Word cloud of my Corpus based on interviews published on Dataconomy

Word cloud of my Corpus based on interviews published on Dataconomy

Here we can see above that science, PHD, science, big etc all pop up a lot – which is not surprising given the subject matter.

Then I leveraged NLTK to do some word frequency analysis. Firstly I removed stop words, and punctuation.

I got the following result – unsurprisingly the most common word was data followed by science, however the other words are of interest – since they indicate what professional data scientists talk about in regards their work.

Source: All interviews published on Dataconomy by me until the end of last week – which was the end of September 2015.

barchart_nlp

Interview with a Data Scientist: Erik Bernhardsson

Standard

As part of my interviews with Data Scientists I recently caught up with Erik Bernhardsson who is famous in the world of ‘Big Data’ for his open source contributions, his leading of teams at Spotify, and his various talks at various conferences.

1. What project have you worked on do you wish you could go back to, and do better?
Like… everything I ever built. But I think that’s part of the learning experience. Especially working with real users, you never know what’s going to happen. There’s no clear problem formulation, no clear loss function, lots of various data sets to use. Of course you’re going to waste too much time doing something that turns out to nothing. But research is that way. Learning stuff is what matters and kind of by definition you have to do stupid shit before you learned it. Sorry for a super unclear answer 🙂
The main thing I did wrong for many years was I built all this cool stuff but never really made it into prototypes that other people could play around with. So I learned something very useful about communication and promoting your ideas.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Write a ton of code. Don’t watch TV 🙂
I really think showcasing cool stuff on Github and helping out other projects is a great way to learn and also to demonstrate market validation of your code.
Seriously, I think everyone can kick ass at almost anything as long as you spend a ridiculous amount of time on it. As long as you’re motivated by something, use that by focusing on something 80% of your time being awake.
I think people generally get motivated by coming up with various proxies for success. So be very careful about choosing the right proxies. I think people in academia often validate themselves in terms of things people in the industry don’t care about and things that doesn’t necessarily correlate with a successful career. It’s easy to fall down into a rabbit hole and become extremely good at say deep learning (or anything), but at a company that means you’re just some expert that will have a hard time getting impact beyond your field. Looking back on my own situation I should have spent a lot more time figuring out how to get other people excited about my ideas instead of perfecting ML algorithms (maybe similar to last question)
3. What do you wish you knew earlier about being a data scientist?
I don’t consider myself a data scientist so not sure 🙂
There’s a lot of definitions floating around about what a data scientist does. I have had this theory for a long time but just ran into a blog post the other day: https://medium.com/@rchang/my-two-year-journey-as-a-data-scientist-at-twitter-f0c13298aee6
I think it summarizes my own impression pretty well. There’s two camps, one is the “business insights” side, one is the “production ML engineer” side. I managed teams at Spotify on both sides. It’s very different.
If you want to understand the business and generate actionable insights, then in my experience you need pretty much no knowledge of statistics and machine learning. It seems like people think with ML you can generate these super interesting insights about a business but in my experience it’s very rare. Sometimes we had people coming in writing a master’s thesis about churn prediction and you can get a really high AUC but it’s almost impossible to use that model for anything. So it really just boils down to doing lots of highly informed A/B tests. And above all, having deep empathy for user behavior. What I mean is you really need to understand how your users think in order to generate hypotheses to test.
For the other camp, in my experience understanding backend development is super important. I’ve seen companies where there’s a “ML research team” and a “implementation team” and there’s a “throw it over the fence” attitude, but it doesn’t work. Iteration cycles get 100x larger and incentives just get misaligned. So I think for anyone who wants to build cool ML algos, they should also learn backend and data engineering.
4. How do you respond when you hear the phrase ‘big data’?
Love it. Seriously, there’s this weird anti-trend of people bashing big data. I throw up every time I see another tweet like “You can get a machine with 1TB of ram for $xyz. You don’t have big data”. I almost definitely had big data at Spotify. We trained models with 10B parameters on 10TB data sets all the time. There is a lot of those problems in the industry for sure. Unfortunately sampling doesn’t always work.
The other thing I think those people get wrong is the production aspect of it. Things like Hadoop forces your computation into fungible units that means you don’t have to worry about computers breaking down. It might be 10x slower than if you had specialized hardware, but that’s fine because you can have 100 teams running 10000 daily jobs and things rarely crash – especially if you use Luigi 🙂
But I’m sure there’s a fair amount of snake oil Hadoop consultants who convince innocent teams they need it.
The other part of “big data” is that it’s at the far right of the hype cycle. Have you been to a Hadoop conference? It’s full of people in oversized suits talking about compliance now. At some point we’ll see deep learning or flux architecture or whatever going down the same route.
5. What is the most exciting thing about your field?
Boring answer but I do think the progress in deep learning has been extremely exciting. Seems like every week there’s new cool applications.
I think even more useful is how tools and platforms are maturing. A few years ago every company wrote their own dashboards, A/B test infrastructure, log synchronization, workflow management, etc. It’s great that there’s more open source projects and that more useful tools are emerging.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 
Ideally you can iterate on it with real users and see what the impact is. If not, you need to introduce some proxy metrics. That’s a whole art form in itself.
It’s good enough when the opportunity cost outweighs the benefit 🙂 I.e. the marginal return of time invested is lower than for something else. I think it’s good to keep a backlog full of 100s of ideas so that you can prioritize based on expected ROI at any time. I don’t know if that’s a helpful answer but prioritization is probably the hardest problem to solve and it really just boils down to having some rules of thumb.
How Erik describes himself: I like to work with smart people and deliver great software. After 5+ years at Spotify, I just left for new exciting startup in NYC where I am leading the engineering team.

At Spotify, I built up and lead the team responsible for music recommendations and machine learning. We designed and built many large scale machine learning algorithms we use to power the recommendation features: the radio feature, the “Discover”​ page, “Related Artists”​, and much more. I also authored Luigi, which is a workflow manager in Python with 3,000+ stars on Github – used by Foursquare, Quora, Stripe, Asana, etc.

When I was younger I participated in lots of programming competitions. My team was five times Nordic champions in programming (2003-2010) and I have an IOI gold medal (2003).