Deep Learning on AWS

Standard

This will be a very short post. But yesterday I spent some time trying to manually set up gpu on AWS.My reflections:

  1. Use someone else’s AMI see http://erikbern.com/2015/11/12/installing-tensorflow-on-aws/ for example.
  2. I had a lot of trouble updating cudnn from nvidia to work with the newest keras. I expect this to improve in the next version.
  3. Use spot instances they are much cheaper!
  4. There seems to be some cumbersome setup cost with modern deep learning frameworks. So using a GPU is a long way from being super easy.

 

 

 

Advertisements

The Setup

Standard

The Setup has always been one of my favorite sites on the internet. I love seeing how other people – in vastly different careers – get their work done. Though I don’t craft Chinese soliders out of cardboard or anything nearly that fascinating, I thought it would be a fun exercise to put together my own version.

Who are you, and what do you do?

I’m Peadar Coyle, and I’m a data scientist based in Luxembourg, until recently I was at Vodafone as a Quantitative Analyst in their Energy team.  As you might expect, there are many people out there with that title and many do quite different work. My career has been varied so far, but I’m predominately a type A (for insights) data-scientist which means I spend half of my time coding and prototyping models to provide insights for business stakeholders. I’m working hard on improving my development skills so that I can deliver robust, working code in production. My intellectual background is in Physics and Mathematics.

I enjoy talking (as all Irish people do 🙂 ) so I regularly share my knowledge at conferences such as PyData.

What hardware do you use?

I use (and adore) my Leuchterm notebook (8″, with dots) for taking notes during phone calls, meetings, and any other times when typing on a laptop feels out of place or unnecessary. It’s a fantastic thought-collector for all manner of doodles, brainstorms, projects, and data visualisations. In that notebook (and everywhere else, really), I’ll write with whatever is around, but my preference is for ultra fine gel ballpoints.

Until recently I was using Moleskines, but I found them a tad expensive for their quality.

 

I carry a Samsung Galaxy J everywhere for all the uses in the world (+ multi-factor auth all the things). The battery is absolutely terrible, so I always keep a portable battery in my bag. That might actually be one of the most worthwhile 25 euros I’ve ever spent.

My home machine is a MacBook Pro (Retina, 15-inch, Mid 2014) with 16GB of RAM. This is a pretty hefty machine and quite difficult to carry around, but the retina screen is awesome.

For cloud computing (that counts as hardware right?) I use EC2 and S3 on AWS. For certain problems like Kaggle or complicated problems I’ll use whatever the most powerful machine I can get my hands on 🙂

And what software?

This is where I spent most of my time. I try out lots of tools to make my work (and life) easier. For me, “easier” is always a balance between “more tools that each do one thing well” and “fewer tools that each do all sorts of things.” It’s a constant work in progress.

I’m still using OS X 10.10 (Yosemite). When it comes to my work system, I’m rarely an early adopter because new OS updates always break environments.

I probably spend 50% of my time in OS X’s Terminal. Most of that time is spent in vim. I write most things there: code (mostly Python and bash), documents (Markdown, text, and TeX), etc. The solarized (dark) theme gives nice syntax highlighting contrast, and also keeps my eyes from getting tired (this will be a recurring theme). I keep meaning to try out iTerm but haven’t gotten around to it. I spend a lot of time working on remote Linux servers, so I tend to keep it simple (and similar) on my own machine. I’ll occasionally try to learn Emacs – and then give up and go back to Vim.

I’d guess the next 45% of my time is spent in Chrome. Among all the articles I’ve opened to read (but will inevitably drop into the Pocket black hole), you’ll pretty much always find some combination of tabs open that include: all Google Apps (mail, cal, drive, and a handful of docs), StackOverflow, the Python docsGitHub, Slack, trello, twitter and often a wikipedia page or two about whatever concept or technique I’m trying to grok at the moment. I’ve recently started using Safari books which is an expensive investment but it strikes me personally as a worthwhile one.

I recommend any data geek wanting to improve their productivity learn sed, awk and also use csvkit which I couldn’t live without.

I also use a bunch of Extensions because efficiency makes me incredibly happy: JSONView & XML Tree (prettify API responses), Markdown Reader (live rendering of local .md files – usually how I write and review these posts), Pocket (save-for-later), and Tab-Snap (store giant tab list as restorable text file)..

The last 5% of my time is spent switching between a host of other apps: Wunderlust (daily note-taking and long-term reference storage), Slack (team/org communication), Gimp (for my amateur image creation needs), Slides (for important presentations, GDocs for less important ones), and Toggl (time tracking; incredibly enlightening if you’ve never tried it). I also use Jupyter a lot but recently I’ve been moving to PyCharm  since I’m trying to write less ad-hoc stuff and more python modules. Since I’m trying to learn Scala at the moment I’ve been using IntelliJ which is an awesome IDE. I honestly don’t know how anyone codes in a JVM language without a good IDE.

There are a handful of other apps that are hugely valuable and always running in the background, too:  Dropbox (for both personal syncing – Camera Uploads! – and quick file sharing), Skypef.lux(adjusts your display’s color temperature – helps reduce eye strain when working at night).

What would be your dream setup?

Although I am close to it. Some small changes would include: a not-yet-possible 13″ MacBook Air with the specs of the burly 15″ Retina MBP, a pair of those magical Bose headphones I mentioned earlier, a couple of 27″ displays, and a beautiful, automatic sit-to-stand desk would be a nice start.

Three things I wish I knew earlier about Machine Learning

Standard

I’ve been working with Machine Learning models both in academic and industrial settings for a few years now. I’ve recently been watching the excellent Scalable ML from Mikio Braun, this is to learn some more about Scala and Spark.

His video series talks about the practicalities of ‘big data’ and so made me think what I wish I knew earlier about Machine Learning

  1. Getting models into production is a lot more than just micro services 
  2.  Feature selection and feature extraction are really hard to learn from a book
  3. The evaluation phase is really important

I’ll take each in turn.

Getting models into production is a lot more than just micro services 

I gave a talk on Data-Products and getting Ordinary Differential Equations into production. One thing that I didn’t realise until sometime afterwards was just how challenging it is to handle issues like model decay, evaluation of models in production, dev-ops etc all by yourself. When I first used ScienceOps from Yhat I underestimated how awesome it is. I struggle to find a direct competitor on the market right now, and I really think they’ve nailed a fascinating problem. Increasingly I realise I’m just not smart enough to deal with ops stuff – so I’m glad if I can outsource 🙂

Feature selection and feature extraction are really hard to learn

Something that I couldn’t learn from a book, but tried to is feature selection and extraction. These skills are only learned by Kaggle competitions and real world projects. And learning about the various tricks and methods for this is something one learns only by implementing them or using them in real-world projects. This eats up a lot of the work flow of the data science process. In the new year I’ll probably try to write out a blog post only on feature extraction and feature selection.

The evaluation phase is really important

Unless you apply your models to test data – you’re not doing predictive analytics. Evaluation techniques such as cross-validation, evaluation metrics, etc are all invaluable as is simply splitting your data into test data and training data. Life often doesn’t hand you a dataset with these things defined, so there is a lot of creativity and empathy involved in defining these two sets on a real world dataset. There is a great set of posts on Dato about the challenges of model evaluation.

I think the explanations by Mikio Braun are worth a read. I love his diagrams too and include it here in case you’re not familiar with training sets and testing sets.

3t-evaluation.png

Source: Mikio Braun 2015

Often we don’t discuss evaluation of models in papers, conferences or even when we talk about what techniques we use to solve problems. ‘We used SVM on that’ doesn’t really tell me anything, it doesn’t tell me your data sources, your feature selection, your evaluation methods, how you got into production, and how you used cross-validation or model-debugging. I think we need a lot more commentary about these ‘dirty’ aspects of machine learning. And I wish I knew that a lot earlier 🙂

My friend Ian has some great remarks on ‘Data Science Delivered’ which is a great read for any professional (junior or senior) who build machine learning models for a living. It is also a great read for recruiters hiring data scientists or managers interacting with data science teams – if you’re looking for questions to ask people about – ‘how did you handle that dirty data’.

I wish all my friends, readers and colleagues (past, present and future) a wonderful holiday season.

Image Similarity Database…

Standard

Image similarity questions are very common in e-commerce and fashion. This is particular the case with the question of similar colours. I based the following on the excellent work by my friend Thomas Hunger. My implementation has only a few alterations compared to his, but I felt it was worth putting online even if I’m not claiming any originality.

There are many improvements that would be made in a real industrial case, but I found this a good education exercise especially since I don’t have a lot of experience with image analysis and such similarity problems.

I hope you enjoy this too.

Hacking a Paris corpus from Inside Airbnb

Standard

There is an excellent resource called Inside Airbnb which has some data sources included in it.

I hacked together a script to extract from the descriptions in Paris a corpus.  And then applied this code.

On github I’ve put up the code and examples of this.

One problem with this example is that currently there are no stop words in French in the Scikitlearn library I was using. It’s quite difficult to do text analytics on multiple languages 🙂

I hope this forms a useful snippet.

It is getting increasingly easier in Python to do Topic Modelling and NLP like this. Which is excellent 🙂