# One weird tip to improve the success of Data Science projects

Standard

I was recently speaking to some data science friends on Slack, and we were discussing projects and war stories. Something that came across was that ‘data science’ projects aren’t always successful.

Source: pixabay

Somewhere around this discussion a lightbulb went off in my head about some of the problems we have with embarking on data science projects. There’s a certain amount of Cargo cult Data Science and so collectively we as a community – of business people, technologists and executives don’t think deeply enough about the risks and opportunities of projects.

So I had my lightbulb moment and now I share it with everyone.

The one weird trick is to write down risks before embarking on a project.

Here’s some questions you should ask you start a project – preferably gather all data .

• What happens if we don’t do this project? What is the worse case scenario?
• What legal, ethical or reputational risks are there involved if we successfully deliver results with this project?
• What engineering risks are there in the project? Is it possible this could turn into a 2 year engineering project as opposed to a quick win?
• What data risks are there? What kinds of data do we have, and what are we not sure we have? What risks are there in terms of privacy and legal/ ethics?

I’ve found that gathering stakeholders around helps a lot with this, you hear different perspectives and it can help you figure out what the key risks in your project are. I’ve found for instance in the past that ‘lack of data’ killed certain projects. It’s good to clarify that before you spend 3 months on a project.

Try this out and let me know how it works for you! Share your stories with me at myfullname[at]google[dot]com.

# Interviews with Data Scientists: NLP for the win

Standard

Recently I decided to do some quick Data Analysis of my interviews with data scientists.

It seems natural when you collect a lot of data to explore it and do some data analysis on it.

You can access the code here.
The code isn’t in much depth but it is a simple example of how to use NLTK, and a few other libraries in Python to do some quick data analysis of ‘unstructured’ data.

First question:

What does a word cloud of the data look like?

Word cloud of my Corpus based on interviews published on Dataconomy

Here we can see above that science, PHD, science, big etc all pop up a lot – which is not surprising given the subject matter.

Then I leveraged NLTK to do some word frequency analysis. Firstly I removed stop words, and punctuation.

I got the following result – unsurprisingly the most common word was data followed by science, however the other words are of interest – since they indicate what professional data scientists talk about in regards their work.

Source: All interviews published on Dataconomy by me until the end of last week – which was the end of September 2015.

# An interview with a data artisan

Standard

J.D.Long is the current AVP Risk Management at RenaissanceRe and has a 15 year history of working as an analytics professional.

I sent him an interview recently to see what he would say.

1. What project have you worked on do you wish you could go back to, and do better?

Longer answer: Interestingly, what I find myself thinking about when asked this question is not analytics projects where I wish I could redo the analysis, but rather instances where I felt I did good analysis but did a bad job explaining the implications to those who needed the info. Which brings me to #2…

2. What advice do you have to younger analytics professionals?

2) Learn technical skills and enjoy learning new things, naturally. But, 1) always plot your data to visualize relationships and 2) remember at the end of the analysis you have to tell a story. Humans are hard wired to remember stories and not numbers. Throw away your slide deck pages with a table of p values and instead put a picture of someone’s face and tell their story. Or possible show a graph that illustrates the story. But don’t forget to tell the story.

3. What do you wish you knew earlier about being a data artisan?

3) Inside of a firm, cost savings of $1mm seems like it should be the same as generating income of$1mm. It’s not. As an analyst you can kick and whine and gripe about that reality, or you can live with it. One rational reason for the inequality is that income is often more reproducible than cost savings. However, the real reason is psychological. Once a cost savings happens it’s the new expectation. So there’s no ‘credit’ for future years. Income is a little different in that people who can produce \$1mm in income every year are valued every year. That’s one of the reasons I listed “be a profit center” in the post John referenced. There are many more reasons, but that alone is a good one.

4. How do you respond when you hear the phrase ‘big data’?

4) I immediately think, “buzz word alert”. The phrase is almost meaningless. I try to listen to what comes next to see if I’m interested.

5) Everybody loves a good “ah-ha!” moment. Analytics is full of those. I think most of us get a little endorphin drop when we learn or discover something. I’ve always been very open about what I like about my job. I like being surrounded by interesting people, working on interesting problems, and being well compensated. What’s not to love!

Cheers,

P.s. the post J.D.Long mentioned is http://www.johndcook.com/blog/2011/11/21/career-advice-regarding-tools/

# Information Retrieval

Standard

Attention conservation notice: 680 words about Information Retrieval, and highly unoriginal.
The following is very much inspired by a course by Cosma Shalizi but I felt it was worth rewriting to get to grips with the concepts. This is the first of what is hopefully a series of posts on ‘Information Retrieval’, and applications of Mathematics to ‘Data Mining’.
1. Textual features

I’d like to introduce the concepts of how features are extracted, and we shall consider these proxies of actual meanings. One classic representation we can use is bag-of-words (BoW). Let us define this representation, this means we list all of the distinct words in the document together with how often each one appears. This is easy to calculate from the text. Vectors One way we could try to code up the bag-of-words is to use vectors. Let each component of the vector correspond to a different word in the total lexicon of our document collection, in a fixed, standardised order. The value of the component would be the number of times the word appears, possibly including zero. We use this vector bag-of-words representation of documents for two big reasons:

• There is a huge pre-existing technology for vectors: people have worked out, in excruciating detail, how to compare them, compose them, simplify them, etc. Why not exploit that, rather than coming up with stuff from scratch?
• In practice, it’s proved to work pretty well.

We can store data from a corpus in the form of a matrix. Each row corresponds to a distinct case (or instance instance, unit, subject,…) – here, a document – and each column to a distinct feature. Conventionally, the number of cases is n and the number of features is p. It is no coincidence that this is the same format as the data matrix X in linear regression.

2. Measuring Similarity

Right now, we are interested in saying which documents are similar to each other because we want to do a search by content. But measuring similarity – or equivalently measuring dissimilarity or distance</b> – is fundamental to data mining. Most of what we will do will rely on having a sensible way of saying how similar to each other different objects are, or how close they are in some geometric setting. Getting the right measure of closeness will have a huge impact on our results. \paragraph{} This is where representing the data as vectors comes in so handy. We already know a nice way of saying how far apart two vectors are, the ordinary or Euclidean distance, which we can calculate with the Pythagorean formula:

$\displaystyle \|\bar{x} - \bar{y}\| = \sqrt{\sum_{i=1}^{p}(x_i - y_i)^2}$

where ${x_i}$, ${y_i}$ are the ${i^{th}}$ components of ${\bar{x}}$ and ${\bar{y}}$. Remember that for bag-of-words vectors each distinct word – each entry in the lexicon – is a component or a fecture. We can also use our Linear Algebra skills to calculate the Euclidean norm or Euclidean distance . Of any vector this is ${\|\bar{x}\| = \sqrt{\sum_{i=1}^{p}x^{2}_{i}}}$ so the distance between two vectors is the norm of their distance ${\bar{x} - \bar{y}}$. Equivalently, the norm of a vector is the distance from it to the origin, ${\bar{0}}$ Obviously, one can just look up a topology textbook and remind oneself of other metrics such as the taxicab metric.

2.1. Normalisation

Just looking at the Euclidean distances between document vectors doesn’t work, at least if the documents are at all different in size. Instead, we need to normalise by document size, so that we can fairly compare short texts with long ones. There are (at least) two ways of doing this.
Document length normalisation Divide the word counts by the total number of words in the document. In symbols,

$\displaystyle \bar{x} \mapsto \frac{\bar{x}}{\sum_{i=1}^{p} x_{i}}\ \ \ \ \ (1)$

Notice that all the entries in the normalised vector are non-negative fractions, which sum to 1. The i-th component is thus the probability that if we pick a word out of the bag at random, it’s the i-th entry in the lexicon.

Cosine ‘distance’ is actually a similarity measure, not a distance`:

$\displaystyle d_{cos} \bar{x}, \bar{y} = \frac{\sum_{i} x_{i}y_{i}}{\|\bar{x}\|\|\bar{y}\|} \ \ \ \ \ (2)$

It’s the cosine of the angle between the vectors ${\bar{x}}$ and ${\bar{y}}$.