Interview with a Data Scientist: Trey Causey

Standard
Trey Causey is a blogger with experience as a professional data scientist in sports analytics and e-commerce. He’s got some fantastic views about the state of the industry, and I was privileged to read this.
1. What project have you worked on do you wish you could go back to, and do better?
The easy and honest answer would be to say all of them. More concretely, I’d love
to have had more time to work on my current project, the NYT 4th Down Bot before
going live. The mission of the bot is to show fans that there is an analytical
way to go about deciding what to do on 4th down (in American football), and that
the conventional wisdom is often too conservative. Doing this means you have to
really get the “obvious” calls correct as close to 100% of the time as possible,
but we all know how easy it is to wander down the path to overfitting in these
circumstances…
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences and Social Sciences?
Students should take as many methods classes as possible. They’re far more generalizable
than substantive classes in your discipline. Additionally, you’ll probably meet
students from other disciplines and that’s how constructive intellectual cross-fertilization
happens. Additionally, learn a little bit about software engineering (as distinct
from learning to code). You’ll never have as much time as you do right now for things
like learning new skills, languages, and methods.
For young professionals, seek out someone more senior than yourself, either at your
job or elsewhere, and try to learn from their experience. A word of warning, though,
it’s hard work and a big obligation to mentor someone, so don’t feel too bad if
you have hard time finding someone willing to do this at first. Make it worth
their while and don’t treat it as your “right” that they spend their valuable
time on you. I wish this didn’t even have to be said.
3. What do you wish you knew earlier about being a data scientist?
 
It’s cliche to say it now, but how much of my time would be spent getting data,
cleaning data, fixing bugs, trying to get pieces of code to run across multiple
environments, etc. The “nuts and bolts” aspect takes up so much of your time but
it’s what you’re probably least prepared for coming out of school.
4. How do you respond when you hear the phrase ‘big data’?
Indifference.
5. What is the most exciting thing about your field?
Probably that it’s just beginning to even be ‘a field.’ I suspect in five years
or so, the generalist ‘data scientist’ may not exist as we see more differentiation
into ‘data engineer’ or ‘experimentalist’ and so on. I’m excited about the
prospect of data scientists moving out of tech and into more traditional
companies. We’ve only really scratched the surface of what’s possible or,
amazingly, not located in San Francisco.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
A difficult question along the lines of “how long is a piece of string?” I think
the key is to communicate early and often, define success metrics as much as
possible at the *beginning* of a project, not at the end of a project. I’ve found
that “spending too long” / navel-gazing is a trope that many like to level at data
scientists, especially former academics, but as often as not, it’s a result of
goalpost-moving and requirement-changing from management. It’s important to manage
up, aggressively setting expectations, especially if you’re the only data scientist
at your company.
7. How do you explain to C-level execs the importance of Data Science? How do you deal with the ‘educated selling’ parts of the job? In particular – how does this differ from sports and industry?
Honestly, I don’t believe I’ve met any executives who were dubious about the
value of data or data science. The challenge is often either a) to temper
unrealistic expectations about what is possible in a given time frame (we data
scientists mostly have ourselves to blame for this) or b) to convince them to
stay the course when the data reveal something unpleasant or unwelcome.
8. What is the most exciting thing you’ve been working on lately and tell us a bit about it.
I’m about to start a new position as the first data scientist at ChefSteps, which
I’m very excited about, but I can’t tell you about what I’ve been working on there
as I haven’t started yet. Otherwise, the 4th Down Bot has been a really fun
project to work on. The NYT Graphics team is the best in the business and is
full of extremely smart and innovative people. It’s been amazing to see the
thought and time that they put into projects.
9. What is the biggest challenge of leading a data science team?
I’ve written a lot about unrealistic expectations that all data scientists
be “unicorns” and be experts in every possible field, so for me the hardest
part of building a team is finding the right people with complementary skills
that can work together amicably and constructively. That’s not special to
data science, though.
Advertisements

Interview with a Data Scientist: Nathalie Hockham

Standard
1038670
(Linkedin picture)
I was very happy to interview Natalie about her data science stuff – as she gave a really cool Machine Learning focused talk at PyData in London this year, which was full of insights into the challenges of doing Machine Learning with Imbalanced data sets.
Natalie leads the data team at GoCardless, a London startup specialising in online direct debit. She cut her teeth as a PhD student working on biomedical control systems before moving into finance, and eventually fintech. She is particularly interested in signal processing and machine learning and is presently swotting up on data engineering concepts, some knowledge of which is a must in the field.

What project have you worked on do you wish you could go back to, and do better?

Before I joined a startup, I was working as an analyst on the trading floor of one of the oil majors. I spent a lot of time building out models to predict futures timespreads based on our understanding of oil stocks around the world, amongst other things. The output was a simple binary indication of whether the timespreads were reasonably priced, so that we could speculate accordingly. I learned a lot about time series regression during this time but worked exclusively with Excel and eViews. Given how much I’ve learned about open source languages, code optimisation, and process automation since working at GoCardless, I’d love to go back in time and persuade the old me to embrace these sooner.

What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Don’t underestimate the software engineers out there! These guys and girls have been coding away in their spare time for years and it’s with their help that your models are going to make it into production. Get familiar with OOP as quickly as you can and make it your mission to learn from the backend and platform engineers so that you can work more independently.

What do you wish you knew earlier about being a data scientist?

It’s not all machine learning. I meet with some really smart candidates every week who are trying to make their entrance into the world of data science and machine learning is never far from the front of their minds. The truth is machine learning is only a small part of what we do. When we do undertake projects that involve machine learning, we do so because they are beneficial to the company, not just because we have a personal interest in them. There is so much other work that needs to be done including statistical inference, data visualization, and API integrations. And all this fundamentally requires spending vast amounts of time cleaning data.


How do you respond when you hear the phrase ‘big data’?

I haven’t had much experience with ‘big data’ yet but it seems to have superseded ‘machine learning’ on the hype scale. It definitely sounds like an exciting field – we’re just some way off going down this route at GoCardless.

What is the most exciting thing about your field?
Working in data is a great way to learn about all aspects of a business, and the lack of engineering resource that characterizes most startups means that you are constantly developing your own skill set. Given how quickly the field is progressing, I can’t see myself reaching saturation in terms of what I can learn for a long time yet. That makes me really happy.

How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
Our 3 co-founders all started out as management consultants and the importance of accurately defining a problem from the outset has been drilled into us. Prioritisation is key – we mainly undertake projects that will generate measurable benefits right now. Before we start a project, we check that the problem actually exists (you’d be surprised how many times we’ve avoided starting down the wrong path because someone has given us incorrect information). We then speak to the relevant stakeholders and try to get as much context as possible, agreeing a (usually quantitative) target to work towards. It’s usually easy enough to communicate to people what their expectations should be. Then the scoping starts within the data team and the build begins. It’s important to recognise that things may change over the course of a project so keeping everyone informed is essential. Our system isn’t perfect yet but we’re improving all the time.

How do you explain to C-level execs the importance of Data Science? How do you deal with the ‘educated selling’ parts of the job?
Luckily, our management team is very embracing of data in general. Our data team naturally seeks out opportunities to meet with other data professionals to validate the work we’re doing. We try hard to make our work as transparent as possible to the rest of the company by giving talks and making our data widely available, so that helps to instill trust. Minor clashes are inevitable every now and then, which can put projects on hold, but we often come back to them later when there is a more compelling reason to continue.

What is the most exciting thing you’ve been working on lately and tell us a bit about GoCardless.
We’ve recently overhauled our fraud detection system, which meant working very closely with the backend engineers for a prolonged period of time – that was a lot of fun.
GoCardless is an online direct debit provider, founded in 2011. Since then, we’ve grown to 60+ employees, with a data team of 3. Our data is by no means ‘big’ but it can be complex and derives from a variety of sources. We’re currently looking to expand our team with the addition of a data engineer, who will help to bridge the gap between data and platform.

What is the biggest challenge of leading a data science team?

The biggest challenge has been making sure that everyone is working on something they find interesting most of the time. To avoid losing great people, they need to be developing all the time. Sometimes this means bringing forward projects to provide interest and raise morale. Moreover, there are so many developments in the field that its hard to keep track, but attending meetups and interacting with other professionals means that we are always seeking out opportunities to put into practice the new things that we have learned.

Interview with a Data Scientist: Thomas Wiecki

Standard

I interviewed Thomas Wiecki recently – Thomas is Data Science Lead at Quantopian Inc which is a crowd-sourced hedge fund and algotrading platform. Thomas is a cool guy and came to give a great talk in Luxembourg last year – which I found so fascinating that I decided to learn some PyMC3 🙂

1. What project have you worked on do you wish you could go back to, and do better?
While I was doing my masters in CS I got a stipend to develop an object recognition framework. This was before deep learning dominated every benchmark data set and bag-of-features was the way to go. I am proud of the resulting software, called Pynopticon (https://code.google.com/p/pynopticon/wiki/Introduction), even though it never gained any traction. I spent a lot of time developing a streamed data piping mechanism that was pretty general and flexible. This was in anticipation of the large size of data sets. In retrospect though it was overkill and I should have spent less time coming up with the best solution and instead spend time improving usability! Resources are limited and a great core is not worth a whole lot if the software is difficult to use. The lesson I learned is to make something useful first, place it into the hands of users, and then worry about performance.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Spend time learning the basics. This will make more advanced concepts much easier to understand as it’s merely an extension of core principles and integrates much better into an existing mental framework. Moreover, things like math and stats, at least for me, require time and continuous focus to dive in deep. The benefit of taking that time, however, is a more intuitive understanding of the concepts. So if possible, I would advise people to study these things while still in school as that’s where you have the time and mental space. Things like the new data science tools or languages are much easier to learn and have a greater risk of being ‘out-of-date’ soon. More concretely, I’d start with Linear Algebra (the Strang lectures are a great resource) and Statistics (for something applied I recommend Kruschke’s Doing Bayesian Analysis, for fundamentals “The Elements of Statistical Learning” is a classic).
3. What do you wish you knew earlier about being a data scientist?
How important non-technical skills are. Communication is key, but so are understanding business requirements and constraints. Academia does a pretty good job of training you for the former (verbal and written), although mostly it is assumed that communicate to an expert audience. This certainly will not be the case in industry where you have to communicate your results (as well as how you obtained them) to people with much more diverse backgrounds. This I find very challenging.
As to general business skills, the best way to learn is probably to just start doing it. That’s why my advice for grad-students who are looking to move to industry would be to not obsess over their technical skills (or their Kaggle score) but rather try to get some real-world experience.
4. How do you respond when you hear the phrase ‘big data’?
As has been said before, it’s quite an overloaded term. On one side, it’s a buzzword in business where I think the best interpretation is that ‘big data’ actually means that data is a ‘big deal’ — i.e. the fact that more and more people realize that by analyzing their data they can have an edge over the competition and make more money.
Then there’s the more technical interpretation where it means that data increases in size and some data sets do not fit into RAM anymore. I’m still undecided of whether this is actually more of a data engineering problem (i.e. the infrastructure to store the data, like hadoop) or an actual data science problem (i.e. how to actually perform analyses on large data). A lot of times, as a data scientist I think you can get by by sub-sampling the data (Andreas Müller has a great talk of how to do ML on large data sets https://www.youtube.com/watch?v=l43VIw5xhTg).
Then again, more data also has the potential to allow us to build more complex models that capture reality more accurately, but I don’t think we are there yet. Currently, if you have little data, you can only do very simple things. If you have medium data, you are in the sweet spot where you can do more complex analyses like Probabilistic Programming. However, with “big data”, the advanced inference algorithms fail to scale so you’re back to doing very simple things. This “big data needs big models” narrative is expressed in a talk by Michael Betancourt: https://www.youtube.com/watch?v=pHsuIaPbNbY
5. What is the most exciting thing about your field?
The fast pace the field is moving. It seems like every week there is another cool tool announced. Personally I’m very excited about the blaze ecosystem including dask which has a very elegant approach to distributed analytics which relies on existing functionality in well established packages like pandas, instead of trying to reinvent the wheel. But also data visualization is coming along quite nicely where the current frontier seems to be interactive web-based plots and dashboards as worked on by bokeh, plotly and pyxley.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 
I try to keep the loop between data analysis and communication to consumers very tight. This also extends to any software to perform certain analyses which I try to place into the hands of others even if it’s not perfect yet. That way there is little chance to ween off track too far and there is a clearer sense of how usable something is. I suppose it’s borrowing from the agile approach and applying it to data science.

Interviews with Data Scientists: NLP for the win

Standard

Recently I decided to do some quick Data Analysis of my interviews with data scientists.

It seems natural when you collect a lot of data to explore it and do some data analysis on it.

You can access the code here.
The code isn’t in much depth but it is a simple example of how to use NLTK, and a few other libraries in Python to do some quick data analysis of ‘unstructured’ data.

First question:

What does a word cloud of the data look like?

Word cloud of my Corpus based on interviews published on Dataconomy

Word cloud of my Corpus based on interviews published on Dataconomy

Here we can see above that science, PHD, science, big etc all pop up a lot – which is not surprising given the subject matter.

Then I leveraged NLTK to do some word frequency analysis. Firstly I removed stop words, and punctuation.

I got the following result – unsurprisingly the most common word was data followed by science, however the other words are of interest – since they indicate what professional data scientists talk about in regards their work.

Source: All interviews published on Dataconomy by me until the end of last week – which was the end of September 2015.

barchart_nlp

Interview with a Data Scientist: Erik Bernhardsson

Standard

As part of my interviews with Data Scientists I recently caught up with Erik Bernhardsson who is famous in the world of ‘Big Data’ for his open source contributions, his leading of teams at Spotify, and his various talks at various conferences.

1. What project have you worked on do you wish you could go back to, and do better?
Like… everything I ever built. But I think that’s part of the learning experience. Especially working with real users, you never know what’s going to happen. There’s no clear problem formulation, no clear loss function, lots of various data sets to use. Of course you’re going to waste too much time doing something that turns out to nothing. But research is that way. Learning stuff is what matters and kind of by definition you have to do stupid shit before you learned it. Sorry for a super unclear answer 🙂
The main thing I did wrong for many years was I built all this cool stuff but never really made it into prototypes that other people could play around with. So I learned something very useful about communication and promoting your ideas.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Write a ton of code. Don’t watch TV 🙂
I really think showcasing cool stuff on Github and helping out other projects is a great way to learn and also to demonstrate market validation of your code.
Seriously, I think everyone can kick ass at almost anything as long as you spend a ridiculous amount of time on it. As long as you’re motivated by something, use that by focusing on something 80% of your time being awake.
I think people generally get motivated by coming up with various proxies for success. So be very careful about choosing the right proxies. I think people in academia often validate themselves in terms of things people in the industry don’t care about and things that doesn’t necessarily correlate with a successful career. It’s easy to fall down into a rabbit hole and become extremely good at say deep learning (or anything), but at a company that means you’re just some expert that will have a hard time getting impact beyond your field. Looking back on my own situation I should have spent a lot more time figuring out how to get other people excited about my ideas instead of perfecting ML algorithms (maybe similar to last question)
3. What do you wish you knew earlier about being a data scientist?
I don’t consider myself a data scientist so not sure 🙂
There’s a lot of definitions floating around about what a data scientist does. I have had this theory for a long time but just ran into a blog post the other day: https://medium.com/@rchang/my-two-year-journey-as-a-data-scientist-at-twitter-f0c13298aee6
I think it summarizes my own impression pretty well. There’s two camps, one is the “business insights” side, one is the “production ML engineer” side. I managed teams at Spotify on both sides. It’s very different.
If you want to understand the business and generate actionable insights, then in my experience you need pretty much no knowledge of statistics and machine learning. It seems like people think with ML you can generate these super interesting insights about a business but in my experience it’s very rare. Sometimes we had people coming in writing a master’s thesis about churn prediction and you can get a really high AUC but it’s almost impossible to use that model for anything. So it really just boils down to doing lots of highly informed A/B tests. And above all, having deep empathy for user behavior. What I mean is you really need to understand how your users think in order to generate hypotheses to test.
For the other camp, in my experience understanding backend development is super important. I’ve seen companies where there’s a “ML research team” and a “implementation team” and there’s a “throw it over the fence” attitude, but it doesn’t work. Iteration cycles get 100x larger and incentives just get misaligned. So I think for anyone who wants to build cool ML algos, they should also learn backend and data engineering.
4. How do you respond when you hear the phrase ‘big data’?
Love it. Seriously, there’s this weird anti-trend of people bashing big data. I throw up every time I see another tweet like “You can get a machine with 1TB of ram for $xyz. You don’t have big data”. I almost definitely had big data at Spotify. We trained models with 10B parameters on 10TB data sets all the time. There is a lot of those problems in the industry for sure. Unfortunately sampling doesn’t always work.
The other thing I think those people get wrong is the production aspect of it. Things like Hadoop forces your computation into fungible units that means you don’t have to worry about computers breaking down. It might be 10x slower than if you had specialized hardware, but that’s fine because you can have 100 teams running 10000 daily jobs and things rarely crash – especially if you use Luigi 🙂
But I’m sure there’s a fair amount of snake oil Hadoop consultants who convince innocent teams they need it.
The other part of “big data” is that it’s at the far right of the hype cycle. Have you been to a Hadoop conference? It’s full of people in oversized suits talking about compliance now. At some point we’ll see deep learning or flux architecture or whatever going down the same route.
5. What is the most exciting thing about your field?
Boring answer but I do think the progress in deep learning has been extremely exciting. Seems like every week there’s new cool applications.
I think even more useful is how tools and platforms are maturing. A few years ago every company wrote their own dashboards, A/B test infrastructure, log synchronization, workflow management, etc. It’s great that there’s more open source projects and that more useful tools are emerging.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 
Ideally you can iterate on it with real users and see what the impact is. If not, you need to introduce some proxy metrics. That’s a whole art form in itself.
It’s good enough when the opportunity cost outweighs the benefit 🙂 I.e. the marginal return of time invested is lower than for something else. I think it’s good to keep a backlog full of 100s of ideas so that you can prioritize based on expected ROI at any time. I don’t know if that’s a helpful answer but prioritization is probably the hardest problem to solve and it really just boils down to having some rules of thumb.
How Erik describes himself: I like to work with smart people and deliver great software. After 5+ years at Spotify, I just left for new exciting startup in NYC where I am leading the engineering team.

At Spotify, I built up and lead the team responsible for music recommendations and machine learning. We designed and built many large scale machine learning algorithms we use to power the recommendation features: the radio feature, the “Discover”​ page, “Related Artists”​, and much more. I also authored Luigi, which is a workflow manager in Python with 3,000+ stars on Github – used by Foursquare, Quora, Stripe, Asana, etc.

When I was younger I participated in lots of programming competitions. My team was five times Nordic champions in programming (2003-2010) and I have an IOI gold medal (2003).

Interview with a Data Scientist: Rosaria Silipo

Standard

As part of my Interview with Data Scientists project I recently caught up with Rosaria – who is an active member of the Data Mining community.

Bio: Rosaria has been a researcher in applications of Data Mining and Machine Learning for over a decade. Application fields include biomedical systems and data analysis, financial time series (including risk analysis), and automatic speech processing.

She is currently based in Zurich (Switzerland).

  1. What project have you worked on do you wish you could go back to, and do better?

There is not such a thing like the perfect project! As close as you can be to perfection, at some point you need to stop either because the time is over or because the money is over or because you just need to have a productive solution. I am sure I can go back to all my past projects and find something to improve in each of them!

This is actually one of the biggest issues in a data analytics projects: when do we stop? Of course, you need to identify some basic deliverables in the project initial phase, without which the project is not satisfactorily completed.

But once you have passed these deliverable milestones, when do you stop?
What is the right compromise between perfection and resource investment?

In addition, every few years some new technology becomes available which could help re-engineering your old projects, for speed or accuracy or both. So, even the most perfect project solution, after a few years, can surely be improved due to new technologies. This is, for example, the case of the new big data platforms. Most of my old projects would benefit now from a big data based speeding operation. This could help to speed up old models training and deployment, to create more complex data analytics models, and to optimize model paramters better.

  1. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

Use your time to learn! Data Science is a relatively new discipline that combines old knowledge, such as statstics and machine learning, with newer wisdom, like big data platforms and parallel computation. Not many people know everything here, really! So, take your time to learn what you do not know yet from the experts in that area.

Combining a few different pieces of data science knowledge probably makes you unique already in the data science landscape. The more pieces of different knowledge, the bigger of an advantage for you in the data science ecosystem!

One way to get easy hands-on experience on a different range of application fields is to explore the Kaggle challenges

Kaggle has a number of interesting challenges up every months and who knows you might also win some money!

  1. What do you wish you knew earlier about being a data scientist?

This answer is related to the previous one, since my advise to young data scientists sprouts from my earlier experience and failures. My early background is in machine learning. So, when I moved my first steps in the data science world many years ago, I thought that knowledge of machine learning algorithms was all I needed. I wish! I had to learn that data science is the sum of many different skills, including data collection and data cleaning and transformation. The latter, for example, is highly underestimated! In all data science projects I have seen (not only mine), the data processing part takes way more than 50% of the used resources!

Including also data visualization and data presentation. A genial solution is worth nothing if the executives and stakeholders do not understand the results by means of a clear and compact representation! And so on. I guess I wish I took more time early on to learn from colleagues with a different set of skills than mine.

  1. How do you respond when you hear the phrase ‘big data’?

Do you really need big data? Sometimes customers ask for a big data platform just because. Then when you investigate deeper you realize that they really do not have and do not want to have such a big amount of data to take care of every day. A nice traditional DWH (Data Warehouse) solution is definitely enough for them.

Sometimes though, a big data solution is really needed or at least it will be needed

  1. What is the most exciting thing about your field?

Probably, the variety of applications. The whole knowledge of data collection, data warehousing, data analytics, data visualization, results inspection and presentation is transveral to a number of application fields. You would be surprised at how many different applications can be designed using a variation of the same data science technique! Once you have the data science knowledge and a particular application request, all you need is imagination to make the two match and find the best solution.

  1. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

I always propose a first pilot/investigation mini-project at the very beginning. This is for me to get a better idea of the application specs, of the data set, and yes also of the customer. This is a crucial phase, though short. During this part, in fact, I can take the measures of the project in terms of needed time and resources, and I and the customer we can study each other and adjust our expectations about input data and final results. This initial phase, usually involves a sample of the data, an understanding of the data update strategy, some visual investigation, and a first tentative analysis to produce the requested results.

Once this part is successful and expectations have been adjusted on both sides, the real project can start.

  1. You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?

Ah … I am really not a very good example for dealing with stakeholders and executives and successfully manage cultural challenges! Usually, I rely on external collaborators to handle this part for me, also because of time constraints.

I see myself as a technical professional, with little time for talking and convincing. Unfortunately, because this is a big part of each data analytics project.

However, when I have to deal with it myself, I let the facts speak for me: final or intermediate results of current and past projects. This is the easiest way to convince stakeholders that the project is worth the time and the money. For any occurrence, though, I always have at hand a set of slides with previous accomplishements to present to executives if and when needed.

  1. Tell us about something cool you’ve been doing in Data Science lately.

My latest project was about anomaly detection in industry. I found it a very interesting problem to solve, where skills and expertise have to meet creativity. In anomaly detection you have no historical records of anomalies, either because they rarely happen or because they are too expensive to let them happen. What you have is a data set of records of normal functioning of the machine, transactions, system, or whatever it is you are observing. The challenge then is to predict anomalies before they happen and without previous historical examples. That is where the creativity comes in. Traditional machine learning algorithms need a twist in application to provide an adequate solution for this problem.

Interview with a Data Scientist: Erin Shellman

Standard
I recently caught up with Erin for an interview. Her interview is full of nice pieces of hard-earned advice and her final answer on Data Governance is gold!
Erin does some great blog posts at her blog, which I recommend. Erin is a programmer + statistician working as a research scientist at Amazon Web Services. Before that she was a Data Scientist in the Nordstrom Data Lab, where she primarily built product recommendations for Nordstrom.com. She mostly codes in Scala, Python and R, but dabbles in Javascript to put data on the internet. Erin loves to teach and speak, and does both often through talks, as co-organizer of PyLadies-Seattle, and as an instructor at the University of Washington’s Professional and Continuing Education program.
 
1. What project have you worked on do you wish you could go back to, and do better?
Often the goal of data science projects is to automate processes with data–I worked on a lot of projects at Nordstrom with that goal. I think we were pretty naive in those pursuits, often approaching the problems with low empathy and EQ (Emotional Quotient). We built tools, expecting that the teams we were trying to automate would immediately see the value and jump to use them, but we didn’t spend a lot of time listening and trying to understand why some might be hesitant to adopt our tools. Eventually, I started training people and specifically asking them to send bug reports or feature requests. The trainings opened up dialog about our plans and made the other teams more invested, because they could see when their bugs were fixed and their feature implemented. I learned that doing the data work is only half (or less) of the challenge, the other is advocating for your work in such a way that others are similarly compelled.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
If you’re in school right now, use this time to master a programming language (you have more time than you ever will again despite what you may believe). For data science, I’d recommend Python, R or Scala (and if you had to choose one, Python). You absolutely need to be able to produce high-quality code before you walk in the door because chances are you’ll be asked to code early in the interview process.
I also think you shouldn’t spend too much time “training” and learning in your free time, it’s nearly impossible to retain knowledge that way. Instead, spend all your time shoring up the essentials and work on getting a job immediately. You’ll learn so much more on the job than you could ever hope to on your own, plus you’ll be paid. Don’t wait for postings for junior data scientists (I don’t know that I’ve ever even seen one), contact employers you’re interested in working with directly and ask them to make that role for you. You should look for places where you know there’s a solid data team already so you have plenty of people to learn from. Academics tend to have a sort of learned helplessness because they’re so often not in control of their work or careers. This is not the case in industry, if you want something, don’t wait for it to come to you (it won’t). Be an active participant in your future.
3. What do you wish you knew earlier about being a data scientist?
I wish I had spent more time in grad school learning computer science. Often DS (Data Science) jobs end up being almost the same as CS (Computer Science) jobs, and in my case I had to pick up a lot of CS skills on the job.
4. How do you respond when you hear the phrase ‘big data’?
Usually by rolling my eyes so far into the back of my head that they get stuck. I think the return on investment of Any Data is still higher than that of Big Data. Most shops who’re convinced that they need big data technology don’t make use of the data they have already, and adding more data to the pile won’t help the cause.
5. What is the most exciting thing about your field?
The most exciting thing is that I get to learn for a living. Every time I switch jobs or work on something new I have to learn a ton, different technologies and languages, different domains, and different businesses. I especially love that data science is often so close to the business. I love learning about what makes a business successful and providing knowledge to help businesses make better decisions.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 
When I’m approaching a new problem I focus really hard on the inputs and outputs, particularly the output. What exactly are you trying to produce, or trying to answer? This is often a question I pose to business stakeholders to encourage them to think critically about and what they really want to know, how it will be applied, and how to formally articulate it. Basically what I encourage them to do is state a formal hypothesis and the observations required to test that hypothesis. Once we’ve all agreed on the output, what are the inputs? I try to make this as specific as possible, so no “customer data”-level descriptions. Tell me exactly what the inputs are, e.g. annual customer spend, age, and zip code. The more you can reason through the solution in terms of inputs and outputs before you set out to solve the problem the less likely it will be that you’re halfway to answering a question that was ill-posed (I promise, this is 90% of requests), or that you don’t have data to support (this is probably another 5% of requests). It’s also a good way to prevent “stakeholder punting” which is a phrase I made up just now to describe when stakeholders make half-baked requests and then leave them for you to sort out. Data science and research is highly collaborative, and the data scientist shouldn’t be the only one invested in the work.
Once the inputs and outputs are defined, I like to draw flowcharts of the path to completion, and it’s usually easier to start from the bottom. Here’s an example I created for the students in my data mining course. They were working on prediction of a continuous outcome with various regression methods. First we decided on a criteria for model selection, which in this case was the model with the lowest root mean squared error. You can see that the input is a data file, and the output is whichever model had the best predictive accuracy as measured by the lowest RMSE (Root Mean Square Error). For me, diagramming your work like this makes your goal completely concrete.
 Inline image 1
The other really great thing about framing problems this way is that it makes it very easy to estimate effort and communicate to others what is required to complete the projects. For whatever reason, people often assume that while software engineers need 2 weeks to add a minor feature, data scientists need about 6 hours to do complete analyses and make beautiful visualizations. Communicating the amount of work required to complete projects to the requesters is crucial in data science, because most people just don’t know. It’s not something software engineers typically have to do, but providing guidance on the components of a data science project to your stakeholders will reduce your stress in the long-run.
7. What does data governance or data quality mean to you as a data scientist?
Data governance is the collection of processes and protocols to which an organization conforms to insure data accuracy and integrity. Most of the time I’m a data consumer, so I depend on a mature data infrastructure team to create the pipelines I use to collect and analyze data. When I was working on recommendations at Nordstrom, I was a consumer and provider. I provided data in the sense that the output of my recommendation algorithms was data consumed by the web team. Data governance in that context meant writing lots of unit tests to make sure the results of my computations produced correctly formatted entries. It also meant applying business rules, for example, removing entries for products out of stock, or applying brand restrictions.