A Bayesian Hierarchical model for the Six Nations.

I’m a data scientist and a massive Rugby fan. I recently built a Bayesian model, based on some papers I found on Bayesian models in soccer. The basic idea is to simulate the results of the 6 teams based on historical data, and the model takes into account home advantage. I suggest you read it and I’ll stick the code on github soon.

The sad fact as an Irish fan is that England win the 6 nations in the majority of the cases based on this model and simulation.

Enjoy! I’ll write up a tutorial in the future about this probably but I found this a useful exercise.

You can see the IPython Notebook here

Interview with a Data Expert – Kevin Hillstrom

This interview with is Kevin Hillstrom who I’ve found illuminating since I’ve followed him on Twitter. He’s a analyst who stepped up the corporate ladder a bit and now helps companies with their data strategy and understand their data better. I emailed him a few weeks ago with these interview questions and I’ve lightly edited them.

What I liked about this interview was that Kevin focused on the soft skills – I feel we as a data science community speak too much about the technical skills.

What is the biggest misunderstanding in “big data” and “data science”?

  • To me, it is the “we’re going to save the world with data” mentality. I like the optimism, that’s good! I do not like the hype.


Describe the three most underrated skills of a good analyst and how does an analyst learn them?

  • The first underrated skill is selling. An analyst must learn how to sell ideas. My boss sent me to Dale Carnegie training, a course for sales people. The skills I learned in that class are invaluable.
  • The second underrated skill is accuracy. I work with too many analysts who do all of the “big data” stuff, but then run incorrect queries and, as a result, lose credibility with those they are analyzing data for.
  • The third underrated skill is business knowledge. So many analysts put their heart and soul into analyzing stuff. They could put some of their heart and soul into understanding how their business behaves. Knowledge of the business really influences how one approaches analyzing issues.

How do you clearly explain the context of a data problem to a skeptical stakeholder?

  • · To me, this is where knowledge of the business is really important. So many of my mistakes happened when I cared about the data and the analysis, and did not care enough about the business. I once worked for a retail business that only had twenty-four months of data. That was a big problem, given that the company had been in business for fifty years. Nobody, and I mean nobody, cared. I explained repeatedly how I was unable to perform the work I wanted to perform. Nobody cared. When I shifted my message to what I was able to do for a competing retailer who had ten years of data, then people cared. They cared because their business was not competitive with a business they all knew. Then folks wanted to compete, and we were able to build a new database with many years of data.

What is the best question you’ve ever been asked in your professional career?

  • A high level Vice President once listened to a presentation, and then said to me, “Who cares?” The executive went on to say that I was only sharing trivia. He told me that unless I had facts and information that he could act upon, he didn’t want me to share anything. This is a good lesson. Too often, we share information because we were able to unearth an interesting nugget in the database. But if the information is “nice to know”, it doesn’t help anybody. It is better to share a simple fact that causes people to change than to share interesting facts that nobody can use to improve the business.

What is the best thing – in terms of career acceleration – you’ve ever been told in your professional career?

  • Ask to be promoted to your next job. I had a boss who, in the 9th year of my career, asked me what I wanted to do next? So I told my boss – the job was outside of my area of experience, to be honest, and the job was a major promotion. I described why I wanted the job, I described how I would do the job differently, and I described my vision for how I would make the company more profitable. Within twelve months, I was promoted into the job. My goodness, were people upset! But it was a major lesson. When somebody asks you what you want to do next in your career, be ready to offer a credible answer. Maybe more important, be ready to share your answer even if nobody asks you the question! Tell people what your next job looks like, tell people your vision for that job, tell people how the company benefits, and then do work that proves you are ready for an audacious promotion!

About Kevin: Kevin is President of MineThatData, a consultancy that helps CEOs understand the complex relationship between Customers, Advertising, Products, Brands, and Channels. Kevin supports a diverse set of clients, including internet startups, thirty million dollar catalog merchants, international brands, and billion dollar multichannel retailers. Kevin is frequently quoted in the mainstream media, including the New York Times, Boston Globe, and Forbes Magazine.

Prior to founding MineThatData, Kevin held various roles at leading multichannel brands, including Vice President of Database Marketing at Nordstrom, Director of Circulation at Eddie Bauer, and Manager of Analytical Services at Lands’ End.

Data Science tools and processes

I’ve recently been experimenting with some Data Science tools and methodologies.

The first link is
Data Products how do we get there which discusses what methodologies people in the data science world use. I personally use one not used there called OSEMN – Obtain data, Scrub data, Explore data, Model data, Interpret results. Still the link is interesting. I’ve use CRISP-DM in a project as well, I found CRISP-DM suited a more report based and process based culture, whereas OSEMN allowed you to work in a more agile environment.

One of the challenges I find is finding the right tools to disseminate your ideas. So recently I’ve been learning how to use Flask and Jinja2 (for emails and automated reports) but I also came across an easier solution which is
runipy which can be used for report automation as well. This integrates well into my Ipython reporting workflow, and together with a cron job this could be very powerful. For say if you need to produce regularly a report for a metrics deck or something similar. An advantage of this sort of workflow is that it is reproducible and debuggable.

Python is getting a lot better tooling for these reporting challenges, and a sign that the python stack is getting even better. Unfortunately we’re not quite at Shiny or Kitnr level, but we’re getting there.

Stories in Data Analytics

As a Data Analytics professional I made the same mistake everyone else from a strong STEM background makes when first meeting Non -analysts in a work environment.

I talked about R^2 values. And then I lost my audience….

I am reminded about this because a more experienced analyst tweeted this recently.

From who is @minethedata

1.Early in my career, I created beautiful statistical models. Then I’d present my models to non-analytics staff. Those folks were bored. –

Sexism in Tech conferences

Writing about sexism in tech conferences is hard. Especially as a young white male. I can only speak anecdotally – but most women in the Tech industry I speak to, talk a bit about moments of subtle sexism or sometimes out-and-out harassment. As a member of the tech community I’m completely behind any promotion of minorities in the industry, and feel that more can be done. It is interesting that most men I speak to in the industry don’t notice any problem.

Two articles spring to mind:

http://womeninastronomy.blogspot.com/2014/11/its-not-about-that-damn-shirt.html

This was written about STEM but I feel the same rules apply to the Tech community (especially since I personally straddle both communities).

It’s “not a big deal” when someone tells you he came to your talk because you’re attractive.
It’s “not a big deal” when a coworker comments on your appearance.
It’s “not a big deal” when someone makes a “joke” at work demeaning women.
It’s “not a big deal” when you are asked in a job interview if you have or are planning to have kids.
It’s “not a big deal” that you were warned about what professor to avoid basically as soon as you got to school.
It’s “not a big deal” that that same professor was untouchable by the administration because he was too famous.
It’s “not a big deal” when someone assumes you are your own secretary on the phone.
It’s “not a big deal” when someone calls you “Miss” and your male colleague “Doctor.”
It’s “not a big deal” when going to parties at a conference comes with warnings of which of your fellow scientists are dangerous.
It’s “not a big deal” when your boss, adviser, or senior colleague asks you out.
All of this stuff IS a big deal. One of the things I hear about the tech industry – partly because of the passive agression that Hackers sometimes adopt is that as a community we need to grow up and become more professional AND inclusive. I agree wholeheartedly with this and applaud the conferences that encourage more female participation and more female speakers. Diversity is a good thing and I think it makes us smarter :).
The other link I saw was http://adainitiative.org/2012/08/defcon-why-conference-harassment-matters/ about Defcon a famous security conference. I found the following paragraph to be very powerful.
When you say, “Women shouldn’t go to DEFCON if they don’t like it,” you are saying that women shouldn’t have all of the opportunities that come with attending DEFCON: jobs, education, networking, book contracts, speaking opportunities – or else should be willing to undergo sexual harassment and assault to get access to them. Is that really what you believe?
I am glad things are getting better but there are still a number of actions that we can all take. I think this is a subproblem of the larger problem that Pete Warden commented about. I consider his article to be self-recommending http://petewarden.com/2014/10/05/why-nerd-culture-must-die/
Comments are welcome. The articles I linked to, contain some excellent resources on how to enforce or come up with policies in regards harassment – which is a legal issue. Lots of us like to avoid legal issues like this – but an advantage of policies and ‘processes’ is that they are transparent and fair. Some of us consider these things to be too formal – but as I get older I see that some of these ‘formalities’ that we have in corporations and other organizations are useful and save a lot of hassle.

Book Review: Analytics in a Big Data World

So this is a quick review of a book that ended up in my mailbox a few months ago.

Firstly the good: this is a good academic introduction to a variety of techniques all in one reference book. I particularly liked the discussion of Process mining and survival analysis as I feel these are techniques often neglected in the discussion of data science. I know that the author of the Lifelines library. Cameron-Davidson Pilon has done some screencasts on Data Origami of this technique and the applications it has to say Customer Churn modelling but this is the first time I saw it in a book aimed at Data Scientists.

I believe that Bart is an expert in risk modelling so there is a lot of discussion of financial services applications – this is fine and a good addition to the literature on data science, since a lot of the literature is focused on Machine learning applications for social networking websites or the e-commerce sector.This last point may be due to the fact that Bart is based in Europe as opposed to the Bay Area.

An interesting addition to the data science literature is in his applications chapter – and he includes Business Process Analytics, as someone who has worked on some Business Process Mining I’ve not seen too many remarks to this field in the literature and certainly none in book form so this is a worthy addition.

The bad: The print of the book is terrible and the paper a colour that makes reading it extremely difficult. I also felt that the type face for the mathematics equations was hard to read. This may not be Dr Baesens fault. I felt that some of the material was not new to me – but this is fine I’ve probably got more experience in this sector than the target audience who seems to be soon to finish Masters students or PhD students in STEM subjects who are considering a career in Data Science.

I would also like more discussion on how to present your ideas to clients but I guess this is for a separate book or a book on ‘Creating Data Products’.

Nevertheless I would give recommend the book to any MSc or PhD students interested in a career in data science and any analysts like myself who want a good reference for Survival Analysis and Process Mining. I think those chapters and subchapters make this a worthy addition to my own library. I think also that the discussion of risk modelling and customer churn modelling is excellent as this is a bottom up approach from the Mathematical models and data processing to how a model could be produced and evaluated. Together with say a good Coursera course this could be an excellent preparation for interviews for Data Science roles.

Disclaimer: Dr Bart Baesens sent me a copy of this book for review but I have no stake in it’s success.

Business Analytics versus Data Science

When you work in the IT industry you often realize that a big challenge of getting to grip with the industry is learning the memes and buzzwords. One of the difficulties I came across recently was understanding the difference between Business Analytics and Data Science. 

I won’t define data science here – primarily because it has no formal definition. Yet I will describe some of the differences that came up when I reached out to some luminaries on Social Media. 

I asked the question what is the difference between Business Analytics and Data Science, these are some of the answers I got. 

Andrew Clegg@andrew_clegg  more fundamentally: testing hypotheses in controlled experiments (a/b etc) rather than just making changes and reporting results 

reporting vs predicting/recommending/explaining maybe?

Andrew Clegg is a Director of Data Science at Pearson Education, and has academic experience in Natural Language Processing and a strong Engineering background. He clearly still thinks somewhat like a scientist, and this is obvious from his own blog and his own presentations. I think sometimes though people bash ‘reporting’ because it has become such a bane of our professional existence. I live in the Financial Centre that is Luxembourg, and regularly hear my friends complain about the ‘reports’ they have to produce for regulatory compliance. This can be very descriptive work – but I still think that a good ‘data product’ can be in report form. 

 

context :-) age and industry of practitioner, start-up vs. enterprise, etc. most of the time it’s a semantic difference 

John loves to cut through the bullshit – he has experience as an Analytics Consultant and now leads a team at Mailchimp. If I were willing to move to Atlanta I would probably pester him for a job interview :p I think he has a good point here about the semantic difference, at Amazon I met and worked with business analysts who were effectively doing data science, building predictive models of the changes in price elasticity of certain goods. That is not to say that sometimes they had to build reports and that a lot of their day to day work was ad-hoc analysis – which often means ad-hoc scripting to count something. 

17h

very similar. Typically biz analytics don’t build models. But they should. Too often biz analytics is report & slice/dice data

And the wonderful JD Long – who I interviewed for this blog before @ Data Artisan interview – gives a good point about the fact that Biz Analytics should often build models rather than just report and slice/dice data. I think this is personally where the scientific aspect comes in – and this is a corollary of one of Andrew Cleggs comments above. 

I am often surprised by how decisions are made in business – one of the best parts of being a consultant is learning how important culture is. I hear sometimes the job title ‘data strategy consultant’ which strikes me as a data scientist working with companies to create a data-driven culture. Culture is a hard thing to change in an organization even if you are a member of the C-class. I will probably comment on that when I am old and wise enough to understand it. 

So what do you do if you want to work on your predictive modelling skills?

Well I can recommend one good lecture notes or pdf and one good book. 

Applied Data Science Notes

Applied Predictive Modelling

Both are good hands-on introductions to the applied part of Data Science.

Which is thankfully more exciting than mere reporting.

P.s Although sometimes data scientists have to build the infrastructure themselves – I think that a data scientist should be capable of building a data mart for themselves, since one often doesn’t have the software engineering support to do that.