I recently met Radim at a Python Conference in Florence
You can visit his website at http://radimrehurek.com
Radim has a PhD in Computer Science and helped design the excellent Gensim library.
So I sent him an email and he answered these questions. I lightly edited his answers.
A key piece of information to add is that Radim has been an independent consultant for a number of years. Radim has over 10 years of experience in industry. He’s trained and mentored others for a number of years in machine learning and data processing, and his experience includes content targeting, game dev, digital libraries and search engines.
So he’s well qualified to comment on Data Science – especially given his experience running his own consultancy and specialising in text analytics, NLP and search.
1. What project have you worked on do you wish you could go back to, and do better?
We all learn constantly. But I have no failures gnawing at my conscience, no.
Or, to go a bit “meta”: the mental knobs and levers that decide where to push harder and where to let go, they have changed over time, yes. I’d spend more focus on understanding the global business perspective now, and less on technicalities, optimizations. I guess that’s a natural progression.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
I wouldn’t presume to give out advice. Everybody has different goals and priorities in life.
Or, to quote a classic, “Try and be nice to people, avoid eating fat, read a good book every now and then, get some walking in, and try and live together in peace and harmony with people of all creeds and nations.”
Plus, I’d add that presentation and sales skills matter (more than you think). Some cultures are innately better at it than others; I prepared some advanced infographic for you, illustrating the painful difference:
3. What do you wish you knew earlier about being a data scientist?
How to value the initial problem cracking and scoping (the “business analysis”) properly.
You know, that stuff that happens before you write any code or design algorithms or whatever concrete work. Before the contract is even signed (I used to think).
When a new client came and wanted a quote for a project, I’d think really hard about the problem, research around, come up with a viable solution. (And generally, when people come to consultants for help, it’s not because the problem is well scoped and easy to solve.)
In retrospect, this is completely insane. That’s the most valuable part of consulting! But I thought that’s expected, that I should already know this stuff (I’m an expensive consultant right?!).
By the time I created a proposal, the problem was practically solved — broken into actionable steps, with reasonable time estimates and all. The client could just have a chuckle, go “good bye and thank you very much”, and hand my proposal out as specs to their own developers.
I’ve stopped doing that.
Silver lining: I got good at estimating all kinds projects, in sundry domains and verticals 🙂
4. How do you respond when you hear the phrase ‘big data’?
Depends who says it and why. No predetermined generic reaction.
My view on hype in general: my anti-bullshit radar is notoriously biased. I’m very conservative. I only opened a twitter account a year and a half ago! (come and say hi btw)
But we are in the middle of a data revolution, no doubt about it. So as long as you don’t capitalize Big Data, I’m good. It’s my job as a consultant to manage client’s expectations of fads and choose the right technology.
5. What is the most exciting thing about your field?
Building stuff that makes a difference in the Real World™.
I left academia because it was too academic, industry employment because it was too menial and personally inert… Now I live and consult on the ethereal intersection of both.
On a related note, in my experience, a well tuned system (search, recommendation, entity detection, query correction, classification, whatever) beats an application of the “latest exciting research paper” any day of the week. Beating a baseline of a system that has been tuned by domain experts and battle-hardened by years of experience is damn hard. Spectacular math formulas and latest tech are cute and good PR, but the devil is always in the details.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
With honest communication (doh).
I learned to ask clients for sample data upfront, right at the beginning. Mock data is fine. Forces the client to concretize what they have and want, often with surprising (to them) results.
Also, clarify up front that problem analysis (framing the right questions, understanding their business domain, its constraints) is a paid part of the process. See above on “business analysis”. Contrary to popular opinion, the actual machine learning algorithm is a tiny, tiny component of a successful data mining project. Anybody can whip out a Naive Bayes classifier in a few hours from scratch (and NB-level stuff is all that many projects realistically need, despite what they read on latest TechCrunch or HackerNews).
The “what’s good enough” & “how the data flows” & “how components integrate and update” and other tricky questions usually come out of the analysis, iterating over solutions and communicating with the client, fairly naturally.