Some quick book recommendations

Standard

I don’t have time to write a long post – so I’ll just mention en passant some books that I’ve read recently.

The Myths of Innovation is a brilliant explanation of the challenges of innovation and how it actually works in the real world.

Make it Stick is a great summary of recent research on how we learn. Since we are all knowledge workers now – and thus all ‘continuous learners’ I highly recommend you read this book before scoping your next learning challenge. Some of it is common sense, and some of it is academic – but it was a great compendium of academic research, and I’ve made copious notes!

Advertisements

Speaking at PyData Track at PyCon Sei in Florence Italy

Standard

I’m happy to be a part of the PyData speaking community by speaking at my first PyCon.

Here is the abstract and then some remarks 🙂

One of the biggest challenges we have as data scientists is getting our models into production. I’ve worked with Java developers to get models into production and there aren’t always the same libraries in Java as there are in Python. Example try porting Scikitlearn code to Java. Possible solution: PMML or you write spec.

An even better solution: I will explain how to use Science Ops from YhatHQ to build better data products. Specifically I will talk about how to use a Python, Pandas etc to build a model. Test it locally and then deploy it so thatdevelopers can get an easy to use RESTful API. I will remark some of my experiences from working with it, and give a use case and some architectural remarks. I’ll also give a run down of alternatives to Science Ops that I’ve found.

Pre Requisites – some experience with Pandas and the scientific Python would be beneficial. This talk is aimed at Data Science enthusiasts or professionals.

Firstly you can check out www.pydata.it for the PyData focused schedule

Secondly: My slides are here https://speakerdeck.com/springcoil/data-products-or-getting-models-into-production which you can look at before the talk if you wish.

Finally: I provide here a link to the code I’ll mention in the talk – this is a simple example of how you would build an ODE using the PyData stack. The code isn’t excellent, but it is functional and easy to read.

https://gist.github.com/springcoil/dacc5dcadc11d4165473

Interview with a Data Scientist: Ignacio Elola

Standard

As part of my ongoing series with interviews with Data Scientists and Data Analysts, I provide an interview with Ignacio Elola, who is the data scientist at Import.io. Import.io is a cool web platform for allowing you easier access to web data and is one of the cool data scientist enabling tools we see on the market. I also was an intern at this startup a few years ago! Ignacio has been a cool twitter buddy over the last few years, and I’ve found his tweets useful to add to my reading list. So without further ado here we go 🙂

1. What project have you worked on do you wish you could go back to, and do better?
All of them. I’m constantly learning and improving and if I could go back I could do all past projects much better. That doesn’t mean I wish to re-do all past projects, as when something is working is working and is done, but for important projects is a good practice on my opinion to keep iterating and re-factoring code, as every month I learn something new that could help doing this better
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Two words: do it. The only way to really learn something is by doing; so be proactive and start getting things done and learning in the process. I would also advice against specializing too much into something unless you have things very clear, a generalist can always get specialized  something later on, but the other way is harder. Plus it would be much beneficial in any early stage career to learn as much as possible from any related disciplines and any business aspects, not only the algorithm or statistics you are working on. Know your environment and learn from everybody around you.
3. What do you wish you knew earlier about being a data scientist?
I really haven’t find any bad surprises on my journey, things that I wish I knew early. I think keeping an open mind approach about your role and your company and everything else help a lot on this.
4. How do you respond when you hear the phrase ‘big data’?
Well, I think “big data” really change the data and the technology space in terms of what tools (databases, search indexes, and so) we need to use to deal with these amounts of data. But the real revolution it started is a mentality revolution: the “all data is useful” thinking, the data driven approach for decision making… it is al related, we can see how it is already having a real impact in startups, medium companies and big enterprises. That is an approach that can be used in “big” or “small” data, it doesn’t matter and most of the time people actually work with small or medium data, not so many companies are actually doing “big data”. But that is okay!
5. What is the most exciting thing about your field?
The thing I find the most exciting is to be able to work with different teams and departments and help everyone in their decision process by using data. I just love to improve processes and open everybody mind to the data driven world!
Having the freedom to came-up with new ideas and projects to create value out of the data you have in unexpected ways is also something very challenging but rewarding, and I think is a must have in any data science role.
6. How do you go about framing a data problem?
The starting point need to be the business: what question are you trying to solve. I’m very pragmatic in framing data problems, and very output oriented. First thing is to formulate a question that makes sense and that will help you in some way, and understand the business problem you are trying to solve or improve – otherwise you won’t be able to know how good is your answer later on!
Then is the turn of the data itself: what data do you have and how you can use it to answer that question, how close can you get to answering that question? What algorithm do you need to use or how to clean the data are things of technical difficulty, but where you’ll find many resources to help you in the way: courses, books, tutorials, blogs… That’s why I find those first steps the most important ones.

Interview with a Data Scientist: Thomas Levi

Standard

This is part of my ongoing series of interviews with Data Scientists. Thomas Levi is a Data Scientist at Plenty of Fish (POF) an online dating website based in Vancouver. Thomas has a background in theoretical physics and at one point did string theory. Recently a lot of his work has involved topic models and other cool algorithms.

I present here a lightly edited version of his interview.

  1. What project have you worked on do you wish you could go back to, and do better?

 

You know, that’s a really hard question. The one that comes to mind is a study I did on user churn. The idea was to look at the first few hours or day of a user’s experience on the site and see if I could predict at various windows they would still be active. The goal here was twofold, first to make a reasonably good predictive model and second to identify the key actions users take that lead to them becoming engaged or deleting their account to improve user experience. The initial study I did was actually pretty solid. I had to do a lot of work to collect, clean and process the data (the sets were very large, so parallel querying, wrangling and de-normalization came into play) and then build some relatively simple models. Those models worked well and offered some insights. The study sort of stopped there though as other priorities took over, and the thinking was that I could always go back to it. Then we switched over our data pipe, and inadvertently a large chunk of that data was lost so the studies can’t be repeated or revisited. I wish I could go back and either save the sets, make sure we didn’t lose them, or have done more initial work to advance it. I still hope to get back to that someday, but it’ll take our new data pipe to be fully in place.

 

  1. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

 

Learn statistics. It’s the single best piece of advice I can offer. So many people equate Data Science with Machine Learning, and while that’s a large part of it, stats is just as if not more important. A lot of people don’t realize that machine learning is basically computational techniques for model fitting in statistics. A solid background in statistics really informs choices about model building as well as opening up whole other fields like experiment design/testing. When I interview or speak to people, I’m often surprised by how many people can tell me about deep learning, but not basic regression or how to run a split test properly.

  1. What do you wish you knew earlier about being a data scientist?

 

See above, I wish I knew more stats when I first started out. The other thing I’m still learning myself is communication. Coming from an academic background, I had a lot of practice giving talks, and teaching. Nearly all of that however, was to other specialists or at the least others of a similar background. In my current role, I interact a lot more with people who aren’t PhD level scientists, or aren’t even technical. Learning to communicate with them and still get my points across is an ongoing challenge. I wish I had had a bit more practice with those sort of things earlier on.

  1. How do you respond when you hear the phrase ‘big data’?

 

Honestly? I shudder. That phrase has become such a buzzword it’s pretty much lost all meaning. People throw it around pretty much everywhere at this point. The other bit that makes me shudder is when people tell me all about the size of their dataset, or how many nodes are in their cluster. Working with a very large amount of data can be exciting, but only insofar as the data itself is. I  find there’s a growing culture of people who think the best way to solve a problem is to add more data and more features, which falls into the trap of overfitting and overly complicated models. There’s a reason things like sampling theory and feature selection exist, and it’s important to question if you’re using a “big data” set because you really need it for the problem at hand, or because you want to say you used one. That said, there are some problems and algorithms that require truly huge amounts of input, or some problems where aggregating/summarizing requires processing a very large amount of raw data and then it’s definitely appropriate.

 

I suppose I should actually define the term as I see it. To me, “big data” is any data size where the processing, storing and querying of the data becomes a difficult problem unto itself. I like that definition because it’s operational, it defines when I need to change up the way I think and approach a problem. I also like it because it scales, while today a data set of a particular size might be a “big data” set, in a few years it won’t be and something else will. My definition will still hold.

 

  1. What is the most exciting thing about your field?

 

What excites me is applying all of this machinery to actual real world problems. To me, it’s always about the insights and applications to our actual human experience. At POF that comes down to seeing how people interact, match up, etc. It’s most exciting to me when those insights bump up against our assumed standard lore. Moving beyond POF, I see the same sort of approach in a lot of other really interesting areas, whether it be from baseball stats, political polling, healthcare etc. There’s a lot of really interesting questions about the human condition that we can start to address with Data Science.

 

  1. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

 

I think it comes down to the problem itself, and as part of that I mean what the business needs are. There have been times where something that just worked decently was needed on a very short time frame and other times where designing the best system was the deciding factor (e.g. when I built automatic scam and fraud detection which took about six months). For every problem or task at hand, I usually try to scope the requirements and the desired performance constraints and then start iterating. Sometimes the simplest model is accurate enough and the cost benefit of spending a large chunk of time for a small marginal gain really isn’t worth it. The other issue is whether this is a model for a report versus something that has to run in production like scam detection. Designing for production adds on a host of other requirements from performance, specific architectures and much more stringent error handling which greatly increases the time spent. It’s also important to remember that there’s nothing preventing you from going back to a problem. If the original model isn’t quite up to snuff anymore, or someone wants more accurate predictions you just go back. To that end, well documented, version controlled code and notebooks are essential.