Speaking at PyData Track at PyCon Sei in Florence Italy

I’m happy to be a part of the PyData speaking community by speaking at my first PyCon.

Here is the abstract and then some remarks :)

One of the biggest challenges we have as data scientists is getting our models into production. I’ve worked with Java developers to get models into production and there aren’t always the same libraries in Java as there are in Python. Example try porting Scikitlearn code to Java. Possible solution: PMML or you write spec.

An even better solution: I will explain how to use Science Ops from YhatHQ to build better data products. Specifically I will talk about how to use a Python, Pandas etc to build a model. Test it locally and then deploy it so thatdevelopers can get an easy to use RESTful API. I will remark some of my experiences from working with it, and give a use case and some architectural remarks. I’ll also give a run down of alternatives to Science Ops that I’ve found.

Pre Requisites – some experience with Pandas and the scientific Python would be beneficial. This talk is aimed at Data Science enthusiasts or professionals.

Firstly you can check out www.pydata.it for the PyData focused schedule

Secondly: My slides are here https://speakerdeck.com/springcoil/data-products-or-getting-models-into-production which you can look at before the talk if you wish.

Finally: I provide here a link to the code I’ll mention in the talk – this is a simple example of how you would build an ODE using the PyData stack. The code isn’t excellent, but it is functional and easy to read.

https://gist.github.com/springcoil/dacc5dcadc11d4165473

Interview with a Data Scientist: Ignacio Elola

As part of my ongoing series with interviews with Data Scientists and Data Analysts, I provide an interview with Ignacio Elola, who is the data scientist at Import.io. Import.io is a cool web platform for allowing you easier access to web data and is one of the cool data scientist enabling tools we see on the market. I also was an intern at this startup a few years ago! Ignacio has been a cool twitter buddy over the last few years, and I’ve found his tweets useful to add to my reading list. So without further ado here we go :)

1. What project have you worked on do you wish you could go back to, and do better?
All of them. I’m constantly learning and improving and if I could go back I could do all past projects much better. That doesn’t mean I wish to re-do all past projects, as when something is working is working and is done, but for important projects is a good practice on my opinion to keep iterating and re-factoring code, as every month I learn something new that could help doing this better
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Two words: do it. The only way to really learn something is by doing; so be proactive and start getting things done and learning in the process. I would also advice against specializing too much into something unless you have things very clear, a generalist can always get specialized  something later on, but the other way is harder. Plus it would be much beneficial in any early stage career to learn as much as possible from any related disciplines and any business aspects, not only the algorithm or statistics you are working on. Know your environment and learn from everybody around you.
3. What do you wish you knew earlier about being a data scientist?
I really haven’t find any bad surprises on my journey, things that I wish I knew early. I think keeping an open mind approach about your role and your company and everything else help a lot on this.
4. How do you respond when you hear the phrase ‘big data’?
Well, I think “big data” really change the data and the technology space in terms of what tools (databases, search indexes, and so) we need to use to deal with these amounts of data. But the real revolution it started is a mentality revolution: the “all data is useful” thinking, the data driven approach for decision making… it is al related, we can see how it is already having a real impact in startups, medium companies and big enterprises. That is an approach that can be used in “big” or “small” data, it doesn’t matter and most of the time people actually work with small or medium data, not so many companies are actually doing “big data”. But that is okay!
5. What is the most exciting thing about your field?
The thing I find the most exciting is to be able to work with different teams and departments and help everyone in their decision process by using data. I just love to improve processes and open everybody mind to the data driven world!
Having the freedom to came-up with new ideas and projects to create value out of the data you have in unexpected ways is also something very challenging but rewarding, and I think is a must have in any data science role.
6. How do you go about framing a data problem?
The starting point need to be the business: what question are you trying to solve. I’m very pragmatic in framing data problems, and very output oriented. First thing is to formulate a question that makes sense and that will help you in some way, and understand the business problem you are trying to solve or improve – otherwise you won’t be able to know how good is your answer later on!
Then is the turn of the data itself: what data do you have and how you can use it to answer that question, how close can you get to answering that question? What algorithm do you need to use or how to clean the data are things of technical difficulty, but where you’ll find many resources to help you in the way: courses, books, tutorials, blogs… That’s why I find those first steps the most important ones.

Interview with a Data Scientist: Thomas Levi

This is part of my ongoing series of interviews with Data Scientists. Thomas Levi is a Data Scientist at Plenty of Fish (POF) an online dating website based in Vancouver. Thomas has a background in theoretical physics and at one point did string theory. Recently a lot of his work has involved topic models and other cool algorithms.

I present here a lightly edited version of his interview.

  1. What project have you worked on do you wish you could go back to, and do better?

 

You know, that’s a really hard question. The one that comes to mind is a study I did on user churn. The idea was to look at the first few hours or day of a user’s experience on the site and see if I could predict at various windows they would still be active. The goal here was twofold, first to make a reasonably good predictive model and second to identify the key actions users take that lead to them becoming engaged or deleting their account to improve user experience. The initial study I did was actually pretty solid. I had to do a lot of work to collect, clean and process the data (the sets were very large, so parallel querying, wrangling and de-normalization came into play) and then build some relatively simple models. Those models worked well and offered some insights. The study sort of stopped there though as other priorities took over, and the thinking was that I could always go back to it. Then we switched over our data pipe, and inadvertently a large chunk of that data was lost so the studies can’t be repeated or revisited. I wish I could go back and either save the sets, make sure we didn’t lose them, or have done more initial work to advance it. I still hope to get back to that someday, but it’ll take our new data pipe to be fully in place.

 

  1. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

 

Learn statistics. It’s the single best piece of advice I can offer. So many people equate Data Science with Machine Learning, and while that’s a large part of it, stats is just as if not more important. A lot of people don’t realize that machine learning is basically computational techniques for model fitting in statistics. A solid background in statistics really informs choices about model building as well as opening up whole other fields like experiment design/testing. When I interview or speak to people, I’m often surprised by how many people can tell me about deep learning, but not basic regression or how to run a split test properly.

  1. What do you wish you knew earlier about being a data scientist?

 

See above, I wish I knew more stats when I first started out. The other thing I’m still learning myself is communication. Coming from an academic background, I had a lot of practice giving talks, and teaching. Nearly all of that however, was to other specialists or at the least others of a similar background. In my current role, I interact a lot more with people who aren’t PhD level scientists, or aren’t even technical. Learning to communicate with them and still get my points across is an ongoing challenge. I wish I had had a bit more practice with those sort of things earlier on.

  1. How do you respond when you hear the phrase ‘big data’?

 

Honestly? I shudder. That phrase has become such a buzzword it’s pretty much lost all meaning. People throw it around pretty much everywhere at this point. The other bit that makes me shudder is when people tell me all about the size of their dataset, or how many nodes are in their cluster. Working with a very large amount of data can be exciting, but only insofar as the data itself is. I  find there’s a growing culture of people who think the best way to solve a problem is to add more data and more features, which falls into the trap of overfitting and overly complicated models. There’s a reason things like sampling theory and feature selection exist, and it’s important to question if you’re using a “big data” set because you really need it for the problem at hand, or because you want to say you used one. That said, there are some problems and algorithms that require truly huge amounts of input, or some problems where aggregating/summarizing requires processing a very large amount of raw data and then it’s definitely appropriate.

 

I suppose I should actually define the term as I see it. To me, “big data” is any data size where the processing, storing and querying of the data becomes a difficult problem unto itself. I like that definition because it’s operational, it defines when I need to change up the way I think and approach a problem. I also like it because it scales, while today a data set of a particular size might be a “big data” set, in a few years it won’t be and something else will. My definition will still hold.

 

  1. What is the most exciting thing about your field?

 

What excites me is applying all of this machinery to actual real world problems. To me, it’s always about the insights and applications to our actual human experience. At POF that comes down to seeing how people interact, match up, etc. It’s most exciting to me when those insights bump up against our assumed standard lore. Moving beyond POF, I see the same sort of approach in a lot of other really interesting areas, whether it be from baseball stats, political polling, healthcare etc. There’s a lot of really interesting questions about the human condition that we can start to address with Data Science.

 

  1. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

 

I think it comes down to the problem itself, and as part of that I mean what the business needs are. There have been times where something that just worked decently was needed on a very short time frame and other times where designing the best system was the deciding factor (e.g. when I built automatic scam and fraud detection which took about six months). For every problem or task at hand, I usually try to scope the requirements and the desired performance constraints and then start iterating. Sometimes the simplest model is accurate enough and the cost benefit of spending a large chunk of time for a small marginal gain really isn’t worth it. The other issue is whether this is a model for a report versus something that has to run in production like scam detection. Designing for production adds on a host of other requirements from performance, specific architectures and much more stringent error handling which greatly increases the time spent. It’s also important to remember that there’s nothing preventing you from going back to a problem. If the original model isn’t quite up to snuff anymore, or someone wants more accurate predictions you just go back. To that end, well documented, version controlled code and notebooks are essential.

An extension of the Data Science process – OSEMIC

One of the most famous taxonomies of data science is OSEMN pronounched ‘Awesome’.

It stands for Obtain, Scrub, Explore, Model, Interpret.

I was recently chatting to some data scientists on twitter and they pointed out that shouldn’t it be OSEMIC?

Obtain, Scrub, Explore, Model, Interpret and Communicate!!!

I hadn’t thought of this, but I agree it is part of the process, interpretation by a specialist like myself isn’t the full battle, it needs to be translated into something that business stakeholders can understand. And the challenge is to not lose them with ‘this is the R^2 part’.

I think this ‘last mile’ problem of data science is a real challenge, how do you get something complicated as a Machine Learning model or a differential equation model into something that stakeholders can act on. And I suspect that this is even harder than just learning the mathematics or the programming. I think data scientists can also learn a lot from storytellers such as journalists and designers.

Thanks to everyone who contributed ideas for this post.

The challenge of Data Science

I recently saw this – https://dartthrowingchimp.wordpress.com/2015/03/19/data-science-takes-work-too/ which is basically an article about the workload of Data Science.

This is a personal and opinionated piece, and all my views are my own and do not reflect anyone else’s. Yet I feel strongly as a working Data Analyst that one of the real unseen challenges is communicating or having people communicate the hard work aspect of it. So I welcome articles like this.

I have seen personally the situation where confusion about what a ‘model’ was led to a very difficult work environment for me. These miss-calibrated expectations that it would just be ‘magic’ or like a feature put unrealistic load on me.

Now maybe one of the things that data scientists must do is ‘explain’ the difficulty and the challenge. Today for instance it took me 3 hours to do a relatively simple bar chart – partly because of the difficulty in finding the data and adjusting the axes etc.

This was not an automated, scripted, process this was a bespoke data visualization developed by me to help share with colleagues and stakeholders the story of the current department I am in. And their challenges and key performance indicators.

I think what is often not acknowledged is just how complicated software and data analysis is – it takes an mixture of hard work, domain expertise, data visualization and modelling – and all these things are changing. I’ve built complicated models and reporting that need changed after 3 months because an API or database changes!

So I think we should share more of our challenges, and our frustrations and our success stories. Our success stories should also not be explained as if we are geniuses – we are just humans with rare and valuable skills.

So this should be explained constantly to stakeholders, and perhaps one of the things we can do is to get our colleagues to sit with us through a data analysis project or mini-project. Rather than just barking unrealistic expectations at us :)

I’m still thinking about this, but as Jay says I suspect the biggest problem is that ‘I think most people who don’t do this work simply have no idea.’

Perhaps the lesson here is the following – never underestimate the skill and craft of those you work with, and learn how valuable that is without making lots of assumptions.

A Bayesian Hierarchical model for the Six Nations.

I’m a data scientist and a massive Rugby fan. I recently built a Bayesian model, based on some papers I found on Bayesian models in soccer. The basic idea is to simulate the results of the 6 teams based on historical data, and the model takes into account home advantage. I suggest you read it and I’ll stick the code on github soon.

The sad fact as an Irish fan is that England win the 6 nations in the majority of the cases based on this model and simulation.

Enjoy! I’ll write up a tutorial in the future about this probably but I found this a useful exercise.

You can see the IPython Notebook here

And blog post here http://springcoil.github.io/Bayesian_Model.html

Interview with a Data Expert – Kevin Hillstrom

This interview with is Kevin Hillstrom who I’ve found illuminating since I’ve followed him on Twitter. He’s a analyst who stepped up the corporate ladder a bit and now helps companies with their data strategy and understand their data better. I emailed him a few weeks ago with these interview questions and I’ve lightly edited them.

What I liked about this interview was that Kevin focused on the soft skills – I feel we as a data science community speak too much about the technical skills.

What is the biggest misunderstanding in “big data” and “data science”?

  • To me, it is the “we’re going to save the world with data” mentality. I like the optimism, that’s good! I do not like the hype.


Describe the three most underrated skills of a good analyst and how does an analyst learn them?

  • The first underrated skill is selling. An analyst must learn how to sell ideas. My boss sent me to Dale Carnegie training, a course for sales people. The skills I learned in that class are invaluable.
  • The second underrated skill is accuracy. I work with too many analysts who do all of the “big data” stuff, but then run incorrect queries and, as a result, lose credibility with those they are analyzing data for.
  • The third underrated skill is business knowledge. So many analysts put their heart and soul into analyzing stuff. They could put some of their heart and soul into understanding how their business behaves. Knowledge of the business really influences how one approaches analyzing issues.

How do you clearly explain the context of a data problem to a skeptical stakeholder?

  • · To me, this is where knowledge of the business is really important. So many of my mistakes happened when I cared about the data and the analysis, and did not care enough about the business. I once worked for a retail business that only had twenty-four months of data. That was a big problem, given that the company had been in business for fifty years. Nobody, and I mean nobody, cared. I explained repeatedly how I was unable to perform the work I wanted to perform. Nobody cared. When I shifted my message to what I was able to do for a competing retailer who had ten years of data, then people cared. They cared because their business was not competitive with a business they all knew. Then folks wanted to compete, and we were able to build a new database with many years of data.

What is the best question you’ve ever been asked in your professional career?

  • A high level Vice President once listened to a presentation, and then said to me, “Who cares?” The executive went on to say that I was only sharing trivia. He told me that unless I had facts and information that he could act upon, he didn’t want me to share anything. This is a good lesson. Too often, we share information because we were able to unearth an interesting nugget in the database. But if the information is “nice to know”, it doesn’t help anybody. It is better to share a simple fact that causes people to change than to share interesting facts that nobody can use to improve the business.

What is the best thing – in terms of career acceleration – you’ve ever been told in your professional career?

  • Ask to be promoted to your next job. I had a boss who, in the 9th year of my career, asked me what I wanted to do next? So I told my boss – the job was outside of my area of experience, to be honest, and the job was a major promotion. I described why I wanted the job, I described how I would do the job differently, and I described my vision for how I would make the company more profitable. Within twelve months, I was promoted into the job. My goodness, were people upset! But it was a major lesson. When somebody asks you what you want to do next in your career, be ready to offer a credible answer. Maybe more important, be ready to share your answer even if nobody asks you the question! Tell people what your next job looks like, tell people your vision for that job, tell people how the company benefits, and then do work that proves you are ready for an audacious promotion!

About Kevin: Kevin is President of MineThatData, a consultancy that helps CEOs understand the complex relationship between Customers, Advertising, Products, Brands, and Channels. Kevin supports a diverse set of clients, including internet startups, thirty million dollar catalog merchants, international brands, and billion dollar multichannel retailers. Kevin is frequently quoted in the mainstream media, including the New York Times, Boston Globe, and Forbes Magazine.

Prior to founding MineThatData, Kevin held various roles at leading multichannel brands, including Vice President of Database Marketing at Nordstrom, Director of Circulation at Eddie Bauer, and Manager of Analytical Services at Lands’ End.