Data Scientist is the hot new job title

All of us involved in Tech know that ‘Data Scientist’ is the hot new job title.
I saw recently from Martin Fowler an Infodeck.

I reproduce this.

“Data Scientist” will soon be the most over-hyped job title in our industry. Lots of people will attach it to their resumé in the hopes of better positions
but despite the hype, there is a genuine skill set”

    The ability to explore questions and formulate them as hypotheses that can be tested with statistics.
    Business knowledge, consulting, and collaboration skills
    Understanding machine-learning techniques.
    Programming ability enough to implement the various models they are working with.

Although most data scientists will be comfortable using specialized tools, all this is much more than knowing how to use R. The understanding of when to use models is usually more important than being able to use them, as is how to avoid probabilistic illusions and overfitting.

Data Science as a Process

Hilary Mason one of the shining lights of the world of data science Tweeted recently 

‘Data people: What is the very first thing you do when you get your hands on a new data set?’ 

What I do when I get a new dataset is a recent article on the Simple Statistics blog, is a response to this. 

I’ve been thinking about my own data science process. My academic background is in Physics and Mathematics, so I am influenced by those disciplines. This is a personal blog post, just to document my own Data Science Process. 

0) Try to understand the data set: I must admit there have been projects that I’ve forgotten this during. I’ve been so eager to apply ‘Cool algorithm X’ or ‘Cool model Y’ to a problem that I’ve forgotten that Exploratory Data Analysis – which always strikes me as low tech is extremely valuable. Now I love this period, I plot graphs, I look for trends, I clean out outliers. I try to figure out what sort of distribution the data follows. I try to figure out anything interesting or novel. This often highlights duplicates, or possible duplicates. And I find myself coming back to this period. 

So I have three interesting heuristics to follow which I find work

  • Talk to the Business Analysts or domain experts: Never be so arrogant as a data scientist that you think that you can ignore an experienced Business Analyst. When I was at Amazon I learned countless things from chatting to Supply Chain Analysts or Financial Analysts. Some of these included business logic, or ‘what does that acronym’ mean and some of these things involved an explanation of a process that was new. Remember that data in lots of companies is designed for operations not data analysis. 
  • Import the data into a database: Building a database doesn’t take too long, and often you are given data in some sort of ugly csv or log format. You can’t underestimate the effects of running a few groupby and count statements to see if the data is corrupt. 
  • Examine the heads and tails. In R this is sapply(df, class); head(df), tail(df). You can do similar things in Python – Pandas. This generally gives you a chance to look at outliers or NA’s. 

1) Learn about what kind of elephant you are trying to eat.

One of my old mentors at Amazon told me that all problems at Amazon are huge elephants. At a dynamic and exciting company like that, he was right. So we used to talk about ‘eating the elephant, piece by piece’. I would add a corollary, you need to know what kind of ‘elephant’ you are dealing with. For example what data sources are you dealing with, what is the business process? Are there quirks to this business process that you’ll need a domain specific expert to help you understand it.
Here chatting to business analysts helps as well. Unless I know the domain really well, I find this period takes some time. Sometimes you’ll have stakeholders who are impatient about this period. But you must be frank with them that this period is needed. That this period is valuable for the later on analysis. And if you aren’t allowed to do this, you can just update your linkedin profile and find another job :) 

I consider the outliers checks and reading column headings to be part of this as well.

You can learn a lot just by documenting what the column headers are.

The final part of this step is to generate a question about the data. Write this down in a document, with the steps your taking. It doesn’t have to be scientific but this is a good way to keep you focused on the business questions. 

2) Clean the data 

I don’t necessarily enjoy this part. It is painful but necessary, and involves a lot of tedious and sometimes annoying work.

  • Look for encoding issues
  • Test boring hypothesis. Like if you have telecommunication data that num_of_foreign_calls <= total_number_of_calls
  • Sanity checks

3) Plot the data

Now I plot the data and try to understand it. Ggplot and Matplotlib are your friends here – until they aren’t. I am currently investing time in learning about the visualization libraries in Python and R. I suggest you do the same. Yet histograms and simple plots work best here. The purpose of plotting data is to lead to insight. The sexy stuff can come later. 

Scatter plots, density plots, historgrams, etc etc.

    Here is an example of a scatter plot from a logistic regression model.
    Logistic Regression

    Logistic Regression

    4)In step 1 we came up with a hypothesis such as ‘correlation between X and revenue’ answer this simple hypothesis

    Now you should know enough about the data to do an analysis. It could be a simple machine learning model or a regression model. 

     

     

     

An interview with a data artisan

14314a1

J.D.Long is the current AVP Risk Management at RenaissanceRe and has a 15 year history of working as an analytics professional.

I sent him an interview recently to see what he would say.

Good questions Peadar. Here’s a really fast attempt at answers: 
1. What project have you worked on do you wish you could go back to, and do better? 

1) I’ve been asked this question before:http://www.cerebralmastication.com/2011/04/the-best-interview-question-ive-ever-been-asked/

Longer answer: Interestingly, what I find myself thinking about when asked this question is not analytics projects where I wish I could redo the analysis, but rather instances where I felt I did good analysis but did a bad job explaining the implications to those who needed the info. Which brings me to #2… 


2. What advice do you have to younger analytics professionals? 

2) Learn technical skills and enjoy learning new things, naturally. But, 1) always plot your data to visualize relationships and 2) remember at the end of the analysis you have to tell a story. Humans are hard wired to remember stories and not numbers. Throw away your slide deck pages with a table of p values and instead put a picture of someone’s face and tell their story. Or possible show a graph that illustrates the story. But don’t forget to tell the story. 

3. What do you wish you knew earlier about being a data artisan? 

3) Inside of a firm, cost savings of $1mm seems like it should be the same as generating income of $1mm. It’s not. As an analyst you can kick and whine and gripe about that reality, or you can live with it. One rational reason for the inequality is that income is often more reproducible than cost savings. However, the real reason is psychological. Once a cost savings happens it’s the new expectation. So there’s no ‘credit’ for future years. Income is a little different in that people who can produce $1mm in income every year are valued every year. That’s one of the reasons I listed “be a profit center” in the post John referenced. There are many more reasons, but that alone is a good one. 


4. How do you respond when you hear the phrase ‘big data’? 

4) I immediately think, “buzz word alert”. The phrase is almost meaningless. I try to listen to what comes next to see if I’m interested. 

5. What is the most exciting thing about your field? 
5) Everybody loves a good “ah-ha!” moment. Analytics is full of those. I think most of us get a little endorphin drop when we learn or discover something. I’ve always been very open about what I like about my job. I like being surrounded by interesting people, working on interesting problems, and being well compensated. What’s not to love! 

Cheers, 

P.s. the post J.D.Long mentioned is http://www.johndcook.com/blog/2011/11/21/career-advice-regarding-tools/ 

Data Science and Soft Skills

I once did an internship under Andrew Fogg at Import.io.

I learned a lot about data science at that period, but one of the hardest lessons I had to learn was the importance of soft skills and project management in any data science projects.

John Foreman another idol of mine, talked a bit about this, in his book about data. 

So although I am not a super experienced data scientist, I am going to talk about what I have learned so far from the data science projects, which I have been involved in.

Sometimes it is a development project

 Sometimes you will encounter data science projects which actually need data engineering or software engineering. I think it is ok for data scientists to do some scripting and maybe hack together some web applications. But it is a bit different from what a software engineer team should do.

Data Science is not software engineering For reasons I have not quite well understood, some parts from project management in software engineering work in data science projects and sometimes do not. In my experience the notion that it is an agile project seems to work. Yet daily scrum meetings can sometimes be too much. Also too much interaction with business partners can derail analytics projects.

Gantt charts or burndown charts work to some degree

I have successfully used these in data science projects. They communicate to non-technical stakeholders that progress is being made. Which they often lack the mental model to sufficiently understand.

Solving a problem as stated is not a good idea, without further exploration Sometimes you are given a data science project and a suggested technique – and you try as an analyst to solve that problem. This generally backfires. Interaction with the business here helps, and lots of questions to sufficiently understand what their motivations are.

Deadlines are lies

I have never ever done an analytics project that worked in the way I expected. One reason is that some things are what I call ‘linear tasks’ and somethings are ‘non-linear’ tasks. Applying a basket analysis algorithm can be a linear task for example, but only if one has the right data set prepared and is familiar with the programming language and tools that are used.

So be very firm and explicit with your stakeholders about what is linear and what is not.

Of course if you are in an environment that does not allow you to control your own deadlines and has unrealistic expectations for good quality analytics work, then it is probably a sign the universe is telling you to clean up your Linkedin profile.

I will explore more of these concepts in the future. 

 

On the NHS

Image

A succinct observation from a Dr about the NHS.

A succinct observation from a Dr about the NHS.

A succint observation about the NHS – and how taxes should work. I suspect some of the proposed reforms (or 10 pounds from every person) is because of our collective apathy towards tax.
Taxes certainly are needed for the NHS, but we seem to not want to hear that. But also we need more empowerment of workers.

How can big data help my supply chain?

When people speak of data science they often talk about Hadoop, Cassandra and such technologies. However I am still a data scientist and don’t use such technologies.

In fact I think that Operations Research, cleaning Data and using Python, R and MySQL can be sufficient for a lot of Logistics.

 Supply Chain performance programs should have three objectives: 

  • (1) improve the efficiency of decision-making processes throughout supply chain operations by synchronizing production plans, inventory control, transportation policies;
  • (2) integrate the core business processes of planning and budgeting with supply chain operations;
  •  (3) achieve system-wide cost reductions through global optimization. Optimizing each component of the supply chain does not usually imply total optimization and off-the-shelf solutions do not fully guarantee either smooth integration or global cost optimization (Chen et al. 2006).

Applying a holisitc or total systems management approach to managing the entire flow of materials, services and information in fulfilling customers’ expectation is sometimes conflictive with managerial objectives who focus on optimizing their internal operations. 

Key performance indicators utilized to track departmental performance do not guarrantee system-wide performance optimization and in many cases are conflictive. Arbitrations or trade-offs are in most cases subject to personal views (biases), interorgnaizational politics or short-term gains, lacking an integrated framework.

Having the different components of a supply chain system perform independently may lead to inefficiencies from systems, processes fighting each other.

My favorite part of data science is producing data products that help people make business decisions. I like to have my hands on KPIs, integrating analytics and users needs into new product design. Building data products is one of the most exciting parts of data science. And I truly believe that data makes the world go around. In a series of future posts I’ll talk a bit about big data in logistics.