I recently caught up with Erin for an interview. Her interview is full of nice pieces of hard-earned advice and her final answer on Data Governance is gold!
1. What project have you worked on do you wish you could go back to, and do better?
Often the goal of data science projects is to automate processes with data–I worked on a lot of projects at Nordstrom with that goal. I think we were pretty naive in those pursuits, often approaching the problems with low empathy and EQ (Emotional Quotient). We built tools, expecting that the teams we were trying to automate would immediately see the value and jump to use them, but we didn’t spend a lot of time listening and trying to understand why some might be hesitant to adopt our tools. Eventually, I started training people and specifically asking them to send bug reports or feature requests. The trainings opened up dialog about our plans and made the other teams more invested, because they could see when their bugs were fixed and their feature implemented. I learned that doing the data work is only half (or less) of the challenge, the other is advocating for your work in such a way that others are similarly compelled.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
If you’re in school right now, use this time to master a programming language (you have more time than you ever will again despite what you may believe). For data science, I’d recommend Python, R or Scala (and if you had to choose one, Python). You absolutely need to be able to produce high-quality code before you walk in the door because chances are you’ll be asked to code early in the interview process.
I also think you shouldn’t spend too much time “training” and learning in your free time, it’s nearly impossible to retain knowledge that way. Instead, spend all your time shoring up the essentials and work on getting a job immediately. You’ll learn so much more on the job than you could ever hope to on your own, plus you’ll be paid. Don’t wait for postings for junior data scientists (I don’t know that I’ve ever even seen one), contact employers you’re interested in working with directly and ask them to make that role for you. You should look for places where you know there’s a solid data team already so you have plenty of people to learn from. Academics tend to have a sort of learned helplessness because they’re so often not in control of their work or careers. This is not the case in industry, if you want something, don’t wait for it to come to you (it won’t). Be an active participant in your future.
3. What do you wish you knew earlier about being a data scientist?
I wish I had spent more time in grad school learning computer science. Often DS (Data Science) jobs end up being almost the same as CS (Computer Science) jobs, and in my case I had to pick up a lot of CS skills on the job.
4. How do you respond when you hear the phrase ‘big data’?
Usually by rolling my eyes so far into the back of my head that they get stuck. I think the return on investment of Any Data is still higher than that of Big Data. Most shops who’re convinced that they need big data technology don’t make use of the data they have already, and adding more data to the pile won’t help the cause.
5. What is the most exciting thing about your field?
The most exciting thing is that I get to learn for a living. Every time I switch jobs or work on something new I have to learn a ton, different technologies and languages, different domains, and different businesses. I especially love that data science is often so close to the business. I love learning about what makes a business successful and providing knowledge to help businesses make better decisions.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
When I’m approaching a new problem I focus really hard on the inputs and outputs, particularly the output. What exactly are you trying to produce, or trying to answer? This is often a question I pose to business stakeholders to encourage them to think critically about and what they really want to know, how it will be applied, and how to formally articulate it. Basically what I encourage them to do is state a formal hypothesis and the observations required to test that hypothesis. Once we’ve all agreed on the output, what are the inputs? I try to make this as specific as possible, so no “customer data”-level descriptions. Tell me exactly what the inputs are, e.g. annual customer spend, age, and zip code. The more you can reason through the solution in terms of inputs and outputs before you set out to solve the problem the less likely it will be that you’re halfway to answering a question that was ill-posed (I promise, this is 90% of requests), or that you don’t have data to support (this is probably another 5% of requests). It’s also a good way to prevent “stakeholder punting” which is a phrase I made up just now to describe when stakeholders make half-baked requests and then leave them for you to sort out. Data science and research is highly collaborative, and the data scientist shouldn’t be the only one invested in the work.
Once the inputs and outputs are defined, I like to draw flowcharts of the path to completion, and it’s usually easier to start from the bottom. Here’s an example I created for the students in my data mining course. They were working on prediction of a continuous outcome with various regression methods. First we decided on a criteria for model selection, which in this case was the model with the lowest root mean squared error. You can see that the input is a data file, and the output is whichever model had the best predictive accuracy as measured by the lowest RMSE (Root Mean Square Error). For me, diagramming your work like this makes your goal completely concrete.
The other really great thing about framing problems this way is that it makes it very easy to estimate effort and communicate to others what is required to complete the projects. For whatever reason, people often assume that while software engineers need 2 weeks to add a minor feature, data scientists need about 6 hours to do complete analyses and make beautiful visualizations. Communicating the amount of work required to complete projects to the requesters is crucial in data science, because most people just don’t know. It’s not something software engineers typically have to do, but providing guidance on the components of a data science project to your stakeholders will reduce your stress in the long-run.
7. What does data governance or data quality mean to you as a data scientist?
Data governance is the collection of processes and protocols to which an organization conforms to insure data accuracy and integrity. Most of the time I’m a data consumer, so I depend on a mature data infrastructure team to create the pipelines I use to collect and analyze data. When I was working on recommendations at Nordstrom, I was a consumer and provider. I provided data in the sense that the output of my recommendation algorithms was data consumed by the web team. Data governance in that context meant writing lots of unit tests to make sure the results of my computations produced correctly formatted entries. It also meant applying business rules, for example, removing entries for products out of stock, or applying brand restrictions.