This is part of my ongoing series of interviews with Data Scientists. Thomas Levi is a Data Scientist at Plenty of Fish (POF) an online dating website based in Vancouver. Thomas has a background in theoretical physics and at one point did string theory. Recently a lot of his work has involved topic models and other cool algorithms.
I present here a lightly edited version of his interview.
- What project have you worked on do you wish you could go back to, and do better?
You know, that’s a really hard question. The one that comes to mind is a study I did on user churn. The idea was to look at the first few hours or day of a user’s experience on the site and see if I could predict at various windows they would still be active. The goal here was twofold, first to make a reasonably good predictive model and second to identify the key actions users take that lead to them becoming engaged or deleting their account to improve user experience. The initial study I did was actually pretty solid. I had to do a lot of work to collect, clean and process the data (the sets were very large, so parallel querying, wrangling and de-normalization came into play) and then build some relatively simple models. Those models worked well and offered some insights. The study sort of stopped there though as other priorities took over, and the thinking was that I could always go back to it. Then we switched over our data pipe, and inadvertently a large chunk of that data was lost so the studies can’t be repeated or revisited. I wish I could go back and either save the sets, make sure we didn’t lose them, or have done more initial work to advance it. I still hope to get back to that someday, but it’ll take our new data pipe to be fully in place.
- What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Learn statistics. It’s the single best piece of advice I can offer. So many people equate Data Science with Machine Learning, and while that’s a large part of it, stats is just as if not more important. A lot of people don’t realize that machine learning is basically computational techniques for model fitting in statistics. A solid background in statistics really informs choices about model building as well as opening up whole other fields like experiment design/testing. When I interview or speak to people, I’m often surprised by how many people can tell me about deep learning, but not basic regression or how to run a split test properly.
- What do you wish you knew earlier about being a data scientist?
See above, I wish I knew more stats when I first started out. The other thing I’m still learning myself is communication. Coming from an academic background, I had a lot of practice giving talks, and teaching. Nearly all of that however, was to other specialists or at the least others of a similar background. In my current role, I interact a lot more with people who aren’t PhD level scientists, or aren’t even technical. Learning to communicate with them and still get my points across is an ongoing challenge. I wish I had had a bit more practice with those sort of things earlier on.
- How do you respond when you hear the phrase ‘big data’?
Honestly? I shudder. That phrase has become such a buzzword it’s pretty much lost all meaning. People throw it around pretty much everywhere at this point. The other bit that makes me shudder is when people tell me all about the size of their dataset, or how many nodes are in their cluster. Working with a very large amount of data can be exciting, but only insofar as the data itself is. I find there’s a growing culture of people who think the best way to solve a problem is to add more data and more features, which falls into the trap of overfitting and overly complicated models. There’s a reason things like sampling theory and feature selection exist, and it’s important to question if you’re using a “big data” set because you really need it for the problem at hand, or because you want to say you used one. That said, there are some problems and algorithms that require truly huge amounts of input, or some problems where aggregating/summarizing requires processing a very large amount of raw data and then it’s definitely appropriate.
I suppose I should actually define the term as I see it. To me, “big data” is any data size where the processing, storing and querying of the data becomes a difficult problem unto itself. I like that definition because it’s operational, it defines when I need to change up the way I think and approach a problem. I also like it because it scales, while today a data set of a particular size might be a “big data” set, in a few years it won’t be and something else will. My definition will still hold.
- What is the most exciting thing about your field?
What excites me is applying all of this machinery to actual real world problems. To me, it’s always about the insights and applications to our actual human experience. At POF that comes down to seeing how people interact, match up, etc. It’s most exciting to me when those insights bump up against our assumed standard lore. Moving beyond POF, I see the same sort of approach in a lot of other really interesting areas, whether it be from baseball stats, political polling, healthcare etc. There’s a lot of really interesting questions about the human condition that we can start to address with Data Science.
- How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
I think it comes down to the problem itself, and as part of that I mean what the business needs are. There have been times where something that just worked decently was needed on a very short time frame and other times where designing the best system was the deciding factor (e.g. when I built automatic scam and fraud detection which took about six months). For every problem or task at hand, I usually try to scope the requirements and the desired performance constraints and then start iterating. Sometimes the simplest model is accurate enough and the cost benefit of spending a large chunk of time for a small marginal gain really isn’t worth it. The other issue is whether this is a model for a report versus something that has to run in production like scam detection. Designing for production adds on a host of other requirements from performance, specific architectures and much more stringent error handling which greatly increases the time spent. It’s also important to remember that there’s nothing preventing you from going back to a problem. If the original model isn’t quite up to snuff anymore, or someone wants more accurate predictions you just go back. To that end, well documented, version controlled code and notebooks are essential.