Trent McConaghy has been doing AI and ML research since the mid 90s. He co-founded ascribe GmbH, which enables copyright protection via internet-scale ML and the blockchain. Before that, he co-founded Solido where he applied ML to circuit design; the majority of big semis now use Solido. Before that, he co-founded ADA also doing ML + circuits; it was acquired in 2004. Before that he did ML research at the Canadian Department of Defense. He has written two books and 50 papers+patents on ML. He co-organizes the Berlin ML meetup. He keynoted Data Science Day Berlin 2014, PyData Beriln 2015, and more. He holds a PhD in ML from KU Leuven, Belgium.
Keith Bawden worked with me at Amazon, but not directly. Despite the fact we only had interactions over internal chat and a few emails he was a great influence on my thinking about Software Development and how to use statistics to solve problems in Industry. He has over 10 years experience in the Tech industry including working with Analytics folks and System Engineers at places like Amazon and Groupon. I consider him a fine example of what a business-focused technologist should be, and consider him a great guy at doing ‘tech leadership’ – which is something we all find hard to explain but we know it when we see it. I provide the interview with few edits.
My friend Chris recently quoted me from Twitter on his Blog.
I don’t have much to add to this – other than I feel it is a fair article is correct, and it discusses my own challenges.
Ian Ozsvald has a good keynote on this – from PyCon Sweden where he talks about the need to add value everyday.
I recommend my Interviews with a Data Scientist page – for a discussion of this.
As part of my regular Interview with a Data Scientist feature. I recently interviewed Andrew Clegg. Andrew is a really interesting and pragmatic Data Science professional and currently he’s doing some cool stuff at Etsy. You can visit his Twitter here and his most recent talk on his work at Etsy from Berlin Buzzwords.
- What project have you worked on do you wish you could go back to, and do better?
The one that most springs to mind was an analytics and visualization platform called Palomino that my team at Pearson built: a custom JS/HTML5 app on top of Elasticsearch, Hadoop and HBase, plus a bunch of other pipeline components, some open source and some in-house. It kind of worked, and we learnt a lot, but it was buggy, flaky at the scale we tried to push it to, and reliant on constant supervision. And it’s no longer in use, mostly for those reasons. It was pretty ambitious to begin with, but I got dazzled by shiny new toys and the lure of realtime intelligence, and brought in too many new bits of tech that there was no organisational support for. We discovered that distributed data stores and message queues are never as robust as they claim (c.f. Jepsen); that most people don’t really need realtime interactive analytics; and that supporting complex clustered applications (even internal ones) is really hard, especially in an organisation that doesn’t really have a devops culture. These days, I’d try very hard to find a solution using existing tools — Kibana for example looks much more mature and powerful than it did when we started out, and has a whole community and coherent ecosystem around it. And I’d definitely shoot for a much simpler architecture with fewer moving parts and unfamiliar components. Dan McKinley’s article Choose Boring Technology is very relevant here.
- What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
I was asked this the other day by a recent PhD grad who was interested in a data science career, so I’ll pass on what I told him. I think there are broadly three kinds of work that take place under the general heading of “data scientist”, although, there are also plenty of exceptions to this. The first is about turning data into business insight, via statistical modelling, forecasting, predictive analytics, customer segmentation and clustering, survival analysis, churn prediction, visualization, online experiment design, and selection or design of meaningful metrics and KPIs. (Editor note: In the UK this used to be called an ‘Insight Analyst’ role, typical at retail firms or banks) The second is about developing data-driven products and features for the web, e.g. recommendation engines, trend detectors, anomaly detectors, search and ranking engines, ad placement algorithms, spam and abuse classifiers, content fingerprinting and similarity scoring, etc. The third is really a more modern take on what used to be called operational research, i.e. optimizing business processes algorithmically to reduce time or cost, or increase coverage or reported satisfaction. (Editor note: My own work at Amazon was very close to this, data analytics and operational research for supply chain operations. I imagine that a lot of e-commerce and logistics companies who employ Data Scientists do stuff like this too) In many companies these will be separate roles, and not all companies do all three. But you’ll also see roles that involve two or occasionally all three of these, in varying proportions. I guess a good start is to think about which appeals to you the most, and that will help guide you. Don’t get confused by the nomenclature: “data scientist” could mean any of those things, or something else entirely that’s been rebranded to look cool. And you could be doing any of those things and not be called a data scientist. Read the job specs closely and ask lots of questions.
- What do you wish you knew earlier about being a data scientist?
Well, I wish I’d taken double maths for A level, all those years ago! As it was, I took the single option, and chose the mechanics module over statistics, something that held me back ever since despite various post-graduate courses. There are certain things that are just harder to crowbar into an adult brain, if you don’t internalize the concepts early enough. I think languages and music are in that category too. (For our global readers: A-levels are the qualifications from the last two years of high school. You usually do three or four subjects, or at least you did in my day. You could do standard maths with mechanics or stats, or standard + further with both, which counted as two qualifications.) I had a similar experience with biology — I dropped it when I was 16 but ended up working in bioinformatics for several years. Statistics and biology are both subjects that are much more interesting than school makes them seem, and I wish I’d known that at the time.
- How do you respond when you hear the phrase ‘big data’?
Well, I used to react with anger and contempt, and have given some pretty opinionated talks on that subject before. It’s one of those things you can’t get away from in the enterprise IT world, but ironically, since I joined Etsy I’ve been numbed to the phrase by over-exposure… Just because the Github repo for our Scalding and Cascading code is called “BigData”. It’s a marketing term with very little information content — rather like “cloud”. But unlike “cloud” I actually think it’s actively misleading — it focuses attention on the size aspect, when most organisations have interesting and potentially valuable datasets that can fit on a laptop, or at least a medium-sized server. For that matter, a server with a terabyte of RAM isn’t much over $20K these days. “Big data” makes IT departments go all weak-kneed with delight or terror at the prospect of getting a Hadoop (or Spark) cluster, even though that’s often not the right fit at all. And as a noun phrase, it sucks, as it really doesn’t refer to anything. You can’t say “we solved this problem with big data” as big data isn’t really a thing with any consistent definition.
- What is the most exciting thing about your field?
That’s an interesting one. Deep learning is huge right now, but part of me still suspects it’s a passing fad, partly because I’m old enough to remember when plain-old neural networks were at the same stage of the hype cycle. Then they fell by the wayside for years. That said, the concrete improvements shown by convolutional nets on image recognition tasks are pretty impressive. Time will tell whether that feat can be replicated in other domains. Recent work on recurrent nets for modelling sequences (text, music, etc.) is interesting, and there’s been some fascinating work from Google (and their acquihires DeepMind) on learning to play video games or parse and execute code. These last two examples both combine deep learning with non-standard training methods (reinforcement learning and curriculum learning respectively), and my money’s on this being the direction that will really shake things up. But I’m a layman as far as this stuff goes. One problem with neural architectures is that they’re often black boxes, or at least pretty dark grey — hard to interpret or gain much insight from. There are still a lot of huge domains where this is a hard sell, education and healthcare being good examples. Maybe someone will invent a learning method with the transparency of decision trees but the power of deep nets, and win over those people in jobs where “just trust the machine” doesn’t work.
- How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
It took me a long time to realise this, but short release cycles with small iterative improvements are the way to go. Any result that shows an improvement over your current baseline is a result — so even if you think there are much bigger wins to be had, get it into production, and test it on real data, while you work on its replacement. (Or if you’re in academia, get a quick workshop paper out while you work on its replacement!) This is also a great way to avoid overfitting, especially if you are in industry, or a service-driven academic field like bioinformatics. Instead of constantly bashing away at the error rate on a well-worn standard test set, get some new data from actual users (or cultures or sensors or whatever) and see if your model holds up in real life. And make sure you’re optimizing for the right thing — i.e. that your evaluation metrics really reflect the true cost of a misprediction. I worked in natural language processing for quite a while, and I’m sure that field was held back for a while by collective, cultural overfitting to the same-old datasets, like Penn Treebank section 23. There’s an old John Langford article about this and other non-obvious ways to overfit, which is always worth a re-read.
Bio Andrew joined Etsy in 2014, and lives in London, making him their first data scientist outside the USA. Prior to Etsy he spent almost 15 years designing machine learning workflows, and building search and analytics services, in academia, startups and enterprises, and in an ever-growing list of research areas including biomedical informatics, computational linguistics, social analytics, and educational gaming. He has a masters degree in bioinformatics and a PhD in natural language processing, can count to over 1000 on his fingers, but doesn’t know how to drive a car.
I attended PyData Berlin last weekend. It was a blast – well done to the organizers.
Some interesting remarks and links on PyData conference in Berlin.
- Luigi was presented as a technological solution to the problem of data pipelines by Miguel Cabrera. I found this to be an interesting example of the usage of this technology for dealing with large amounts of data and various jobs for scraping data.
- A keynote by Matthew Rocklin of Continuum Analytics was given. Matthew is an exceptionally smart computational engineer and he explained the architecture of the out-of-core data structures he was developing with Dask. Even if you’re not likely to use the technology it is a very interesting one.
One key idea he had was the gigabyte level, terabyte level and petabyte level.
He pointed out that hadoop and spark where probably only needed at the petabyte level – and that otherwise you just need a good workstation. I agree with this, and afterwards we spoke about this, and he said ‘We should still be using PostgreSQL for a lot of things with good indexing. I think the rise of SSD is very important too – so often you don’t have a big data problem you have just need a bigger computer or workstation.
I checked the price online of an AWS instance – 240 GB of RAM is $2.80 per hour
- Ascribe – protecting IP by using the Block Chain. Trent was a very interesting and engaging as a speaker.
- What is data science – panel discussion – I chaired the panel at this event. This was fun but quite nerve-wracking 🙂
- Overview from Felix Wick
- FinTech discussions and risk analysis – these happened during coffee and beer and especially after the talk about CostCla by Alejandro Correra Bahnsen
- Agriculture and Mittelstand – the opportunity for data science to be applied in industries outside of e-commerce and social network analysis.
- Need for educated selling to management – ‘some of management are still not sold’. This was mentioned in the panel
- Challenges credibility wise of ‘just analyzing data’ – panel
- The need for good project management – some spoke of their failures with algorithm teams without good business direction. The need to manage expectations by sharing results. This reminds me of Ian Ozsvalds talk in Stockholm – http://ianozsvald.com/2015/05/13/data-science-deployed-opening-keynote-for-pyconse-2015/ when he shares his years of experience and reminds young data scientists to share results.
- Bokeh – interesting technology can’t wait to check it out. I found this tutorial a bit long, but it is really hard to give a tutorial to a massive room of attendees. So well done Christine 🙂
- Python for growth hacking – Ignacio Elola
- Alejandro Correra Bahnsen – Cost-sensitive machine learning – I found this a very interesting talk, Alejandro is a good friend of mine – but I think it covers one of the challenges of converting results from Machine learning into actionable financial numbers. Alejandro is a good friend of mine and he has done a great job running the meetup in Luxembourg. I will miss him when he returns to Colombia.
- Robert Obst of Pivotal gave an intriguing demo of the ‘connected car’ and I have no doubt that the ‘Internet of things’ will become a bigger and bigger thing for data analysts and for data scientists. It was interesting that he mentioned that there is a lack of interoperability from different standards in this area.
- I gave a talk on Python used as a framework for Rugby Analysis. This got a lot of interesting questions afterwards about Probabilistic Programming and Rugby Analytics. Thanks to Matthew Rocklin for an interesting discussion of Computational problems and how cool Theano is 🙂
- Those interested and attending the London PyData event, I’ll be giving a tutorial on PyMC, some PyMC3 and applications to Financial data and Rugby analysis in a few weeks. I’ll also discuss the differences between them and why you should use PyMC3.
People often wonder why I goto conferences. But the collection of ideas and techniques discussed above are things I’d never come across myself. Not to mention the fascinating conversations with other members of the Data Engineering, Data Analysis and Software Engineering communities.