Hilary Mason one of the shining lights of the world of data science Tweeted recently
‘Data people: What is the very first thing you do when you get your hands on a new data set?’
What I do when I get a new dataset is a recent article on the Simple Statistics blog, is a response to this.
I’ve been thinking about my own data science process. My academic background is in Physics and Mathematics, so I am influenced by those disciplines. This is a personal blog post, just to document my own Data Science Process.
0) Try to understand the data set: I must admit there have been projects that I’ve forgotten this during. I’ve been so eager to apply ‘Cool algorithm X’ or ‘Cool model Y’ to a problem that I’ve forgotten that Exploratory Data Analysis – which always strikes me as low tech is extremely valuable. Now I love this period, I plot graphs, I look for trends, I clean out outliers. I try to figure out what sort of distribution the data follows. I try to figure out anything interesting or novel. This often highlights duplicates, or possible duplicates. And I find myself coming back to this period.
So I have three interesting heuristics to follow which I find work
- Talk to the Business Analysts or domain experts: Never be so arrogant as a data scientist that you think that you can ignore an experienced Business Analyst. When I was at Amazon I learned countless things from chatting to Supply Chain Analysts or Financial Analysts. Some of these included business logic, or ‘what does that acronym’ mean and some of these things involved an explanation of a process that was new. Remember that data in lots of companies is designed for operations not data analysis.
- Import the data into a database: Building a database doesn’t take too long, and often you are given data in some sort of ugly csv or log format. You can’t underestimate the effects of running a few groupby and count statements to see if the data is corrupt.
- Examine the heads and tails. In R this is sapply(df, class); head(df), tail(df). You can do similar things in Python – Pandas. This generally gives you a chance to look at outliers or NA’s.
1) Learn about what kind of elephant you are trying to eat.
One of my old mentors at Amazon told me that all problems at Amazon are huge elephants. At a dynamic and exciting company like that, he was right. So we used to talk about ‘eating the elephant, piece by piece’. I would add a corollary, you need to know what kind of ‘elephant’ you are dealing with. For example what data sources are you dealing with, what is the business process? Are there quirks to this business process that you’ll need a domain specific expert to help you understand it.
Here chatting to business analysts helps as well. Unless I know the domain really well, I find this period takes some time. Sometimes you’ll have stakeholders who are impatient about this period. But you must be frank with them that this period is needed. That this period is valuable for the later on analysis. And if you aren’t allowed to do this, you can just update your linkedin profile and find another job :)
I consider the outliers checks and reading column headings to be part of this as well.
You can learn a lot just by documenting what the column headers are.
The final part of this step is to generate a question about the data. Write this down in a document, with the steps your taking. It doesn’t have to be scientific but this is a good way to keep you focused on the business questions.
2) Clean the data
I don’t necessarily enjoy this part. It is painful but necessary, and involves a lot of tedious and sometimes annoying work.
- Look for encoding issues
- Test boring hypothesis. Like if you have telecommunication data that num_of_foreign_calls <= total_number_of_calls
- Sanity checks
3) Plot the data
Now I plot the data and try to understand it. Ggplot and Matplotlib are your friends here – until they aren’t. I am currently investing time in learning about the visualization libraries in Python and R. I suggest you do the same. Yet histograms and simple plots work best here. The purpose of plotting data is to lead to insight. The sexy stuff can come later.
Scatter plots, density plots, historgrams, etc etc.
Here is an example of a scatter plot from a logistic regression model.
4)In step 1 we came up with a hypothesis such as ‘correlation between X and revenue’ answer this simple hypothesis
Now you should know enough about the data to do an analysis. It could be a simple machine learning model or a regression model.