Before I joined a startup, I was working as an analyst on the trading floor of one of the oil majors. I spent a lot of time building out models to predict futures timespreads based on our understanding of oil stocks around the world, amongst other things. The output was a simple binary indication of whether the timespreads were reasonably priced, so that we could speculate accordingly. I learned a lot about time series regression during this time but worked exclusively with Excel and eViews. Given how much I’ve learned about open source languages, code optimisation, and process automation since working at GoCardless, I’d love to go back in time and persuade the old me to embrace these sooner.
Don’t underestimate the software engineers out there! These guys and girls have been coding away in their spare time for years and it’s with their help that your models are going to make it into production. Get familiar with OOP as quickly as you can and make it your mission to learn from the backend and platform engineers so that you can work more independently.
What do you wish you knew earlier about being a data scientist?
It’s not all machine learning. I meet with some really smart candidates every week who are trying to make their entrance into the world of data science and machine learning is never far from the front of their minds. The truth is machine learning is only a small part of what we do. When we do undertake projects that involve machine learning, we do so because they are beneficial to the company, not just because we have a personal interest in them. There is so much other work that needs to be done including statistical inference, data visualization, and API integrations. And all this fundamentally requires spending vast amounts of time cleaning data.
How do you respond when you hear the phrase ‘big data’?
The biggest challenge has been making sure that everyone is working on something they find interesting most of the time. To avoid losing great people, they need to be developing all the time. Sometimes this means bringing forward projects to provide interest and raise morale. Moreover, there are so many developments in the field that its hard to keep track, but attending meetups and interacting with other professionals means that we are always seeking out opportunities to put into practice the new things that we have learned.
I interviewed Thomas Wiecki recently – Thomas is Data Science Lead at Quantopian Inc which is a crowd-sourced hedge fund and algotrading platform. Thomas is a cool guy and came to give a great talk in Luxembourg last year – which I found so fascinating that I decided to learn some PyMC3 🙂
1. What project have you worked on do you wish you could go back to, and do better?
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
3. What do you wish you knew earlier about being a data scientist?
4. How do you respond when you hear the phrase ‘big data’?
5. What is the most exciting thing about your field?
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
Recently I decided to do some quick Data Analysis of my interviews with data scientists.
It seems natural when you collect a lot of data to explore it and do some data analysis on it.
You can access the code here.
The code isn’t in much depth but it is a simple example of how to use NLTK, and a few other libraries in Python to do some quick data analysis of ‘unstructured’ data.
What does a word cloud of the data look like?
Here we can see above that science, PHD, science, big etc all pop up a lot – which is not surprising given the subject matter.
Then I leveraged NLTK to do some word frequency analysis. Firstly I removed stop words, and punctuation.
I got the following result – unsurprisingly the most common word was data followed by science, however the other words are of interest – since they indicate what professional data scientists talk about in regards their work.
Source: All interviews published on Dataconomy by me until the end of last week – which was the end of September 2015.
As part of my interviews with Data Scientists I recently caught up with Erik Bernhardsson who is famous in the world of ‘Big Data’ for his open source contributions, his leading of teams at Spotify, and his various talks at various conferences.
At Spotify, I built up and lead the team responsible for music recommendations and machine learning. We designed and built many large scale machine learning algorithms we use to power the recommendation features: the radio feature, the “Discover” page, “Related Artists”, and much more. I also authored Luigi, which is a workflow manager in Python with 3,000+ stars on Github – used by Foursquare, Quora, Stripe, Asana, etc.
When I was younger I participated in lots of programming competitions. My team was five times Nordic champions in programming (2003-2010) and I have an IOI gold medal (2003).
As part of my Interview with Data Scientists project I recently caught up with Rosaria – who is an active member of the Data Mining community.
Bio: Rosaria has been a researcher in applications of Data Mining and Machine Learning for over a decade. Application fields include biomedical systems and data analysis, financial time series (including risk analysis), and automatic speech processing.
She is currently based in Zurich (Switzerland).
- What project have you worked on do you wish you could go back to, and do better?
There is not such a thing like the perfect project! As close as you can be to perfection, at some point you need to stop either because the time is over or because the money is over or because you just need to have a productive solution. I am sure I can go back to all my past projects and find something to improve in each of them!
This is actually one of the biggest issues in a data analytics projects: when do we stop? Of course, you need to identify some basic deliverables in the project initial phase, without which the project is not satisfactorily completed.
But once you have passed these deliverable milestones, when do you stop?
What is the right compromise between perfection and resource investment?
In addition, every few years some new technology becomes available which could help re-engineering your old projects, for speed or accuracy or both. So, even the most perfect project solution, after a few years, can surely be improved due to new technologies. This is, for example, the case of the new big data platforms. Most of my old projects would benefit now from a big data based speeding operation. This could help to speed up old models training and deployment, to create more complex data analytics models, and to optimize model paramters better.
- What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Use your time to learn! Data Science is a relatively new discipline that combines old knowledge, such as statstics and machine learning, with newer wisdom, like big data platforms and parallel computation. Not many people know everything here, really! So, take your time to learn what you do not know yet from the experts in that area.
Combining a few different pieces of data science knowledge probably makes you unique already in the data science landscape. The more pieces of different knowledge, the bigger of an advantage for you in the data science ecosystem!
One way to get easy hands-on experience on a different range of application fields is to explore the Kaggle challenges
Kaggle has a number of interesting challenges up every months and who knows you might also win some money!
- What do you wish you knew earlier about being a data scientist?
This answer is related to the previous one, since my advise to young data scientists sprouts from my earlier experience and failures. My early background is in machine learning. So, when I moved my first steps in the data science world many years ago, I thought that knowledge of machine learning algorithms was all I needed. I wish! I had to learn that data science is the sum of many different skills, including data collection and data cleaning and transformation. The latter, for example, is highly underestimated! In all data science projects I have seen (not only mine), the data processing part takes way more than 50% of the used resources!
Including also data visualization and data presentation. A genial solution is worth nothing if the executives and stakeholders do not understand the results by means of a clear and compact representation! And so on. I guess I wish I took more time early on to learn from colleagues with a different set of skills than mine.
- How do you respond when you hear the phrase ‘big data’?
Do you really need big data? Sometimes customers ask for a big data platform just because. Then when you investigate deeper you realize that they really do not have and do not want to have such a big amount of data to take care of every day. A nice traditional DWH (Data Warehouse) solution is definitely enough for them.
Sometimes though, a big data solution is really needed or at least it will be needed
- What is the most exciting thing about your field?
Probably, the variety of applications. The whole knowledge of data collection, data warehousing, data analytics, data visualization, results inspection and presentation is transveral to a number of application fields. You would be surprised at how many different applications can be designed using a variation of the same data science technique! Once you have the data science knowledge and a particular application request, all you need is imagination to make the two match and find the best solution.
- How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
I always propose a first pilot/investigation mini-project at the very beginning. This is for me to get a better idea of the application specs, of the data set, and yes also of the customer. This is a crucial phase, though short. During this part, in fact, I can take the measures of the project in terms of needed time and resources, and I and the customer we can study each other and adjust our expectations about input data and final results. This initial phase, usually involves a sample of the data, an understanding of the data update strategy, some visual investigation, and a first tentative analysis to produce the requested results.
Once this part is successful and expectations have been adjusted on both sides, the real project can start.
- You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?
Ah … I am really not a very good example for dealing with stakeholders and executives and successfully manage cultural challenges! Usually, I rely on external collaborators to handle this part for me, also because of time constraints.
I see myself as a technical professional, with little time for talking and convincing. Unfortunately, because this is a big part of each data analytics project.
However, when I have to deal with it myself, I let the facts speak for me: final or intermediate results of current and past projects. This is the easiest way to convince stakeholders that the project is worth the time and the money. For any occurrence, though, I always have at hand a set of slides with previous accomplishements to present to executives if and when needed.
- Tell us about something cool you’ve been doing in Data Science lately.
My latest project was about anomaly detection in industry. I found it a very interesting problem to solve, where skills and expertise have to meet creativity. In anomaly detection you have no historical records of anomalies, either because they rarely happen or because they are too expensive to let them happen. What you have is a data set of records of normal functioning of the machine, transactions, system, or whatever it is you are observing. The challenge then is to predict anomalies before they happen and without previous historical examples. That is where the creativity comes in. Traditional machine learning algorithms need a twist in application to provide an adequate solution for this problem.