Interview with a Data Scientist: Trent McConaghy

At PyData in Berlin I chaired a panel – one of the guests was Trent McConaghy and so I reached out to him, to hear his views about analytics. I liked his views on shipping it, and the challenges he’s run into in his own world.
1. What project have you worked on do you wish you could go back to, and do better?
Before I answer this I must say: I strongly prefer looking forward. There’s so much to build!
I’ve made many mistakes! One is having rose-colored glasses for criteria that ultimately mattered little. For example, for my first startup, I hired a professor who’d written 100+ papers, and textbooks. Sounds great, right? Well, he’d optimized his way of thinking for academia, but was not terribly effective on the novel ML problems in my startup. It was no fun for anyone. We had to let him go.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Do something that that you are passionate about, and that matters to the future. It starts with asking interesting scientific questions, and ends (ideally) with results that make a meaningful impact on the world’s knowledge.
3. What do you wish you knew earlier about being a data scientist?
As an AI researcher and an engineer: one thing that I didn’t know, but served me well because I did it anyway, was voracious reading of the literature. IEEE Transactions for breakfast:) That foundation has served me well my whole career.
4. How do you respond when you hear the phrase ‘big data’?
Marketing alert!!
That said: I like how unreasonably effective large amounts of data can be. And that it’s shifted some of focus away from algorithmic development on toy problems.
5. What is the most exciting thing about your field?
AI as a field has been around since the 50s. Some of the original aims of AI are still the most exciting! Getting computers to do tasks in superhuman fashions is amazing. These days it’s routine in narrow settings. When the world hits AI that can perform at the cognitive levels of humans or beyond, it changes everything. Wow! It’s my hope to help shepherd those changes in a way that is not catastrophic for humanity.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 
I follow steps, along the lines of the following.
-write down goals, what question(s) I’m trying to answer. Give yourself a time limit.
-get benchmark data, and measure(s) of quality. Draw mockups of graphs I might plot.
-test against dumbest possible initial off-the-shelf algorithm and problem framing (including where I get the data)
-is it good enough compared to the goals? Great, stop! (Yes, linear regression will solve some problems:)
-try the next highest bang-for-the-buck algorithm & problem framing. Ideally, it’s off the shelf too. Benchmark / plot / etc. Repeat. Stop as soon as successful, or when time limit is hit.

Trent McConaghy has been doing AI and ML research since the mid 90s. He co-founded ascribe GmbH, which enables copyright protection via internet-scale ML and the blockchain. Before that, he co-founded Solido where he applied ML to circuit design; the majority of big semis now use Solido. Before that, he co-founded ADA also doing ML + circuits; it was acquired in 2004. Before that he did ML research at the Canadian Department of Defense. He has written two books and 50 papers+patents on ML. He co-organizes the Berlin ML meetup. He keynoted Data Science Day Berlin 2014, PyData Beriln 2015, and more. He holds a PhD in ML from KU Leuven, Belgium.

ascribe is hiring software engineers for work on machine learning, computer vision, web crawling, and decentralized computing. Some might call parts of that work data science. But we emphasize that we ship code:)

Interview with a Data Scientist: Alejandro Correra Bahnsen

I recently caught up with Alejandro who co-organizes the Luxembourg Data Science Meetup. We’re friends and we regularly talk about Data Science. Alejandro is returning to Colombia soon when he obtains his PhD.
We recently spoke at the same event in Berlin. I recommend his talk since I think he targeted bridging the ‘academic to industry’ divide, which a lot of us struggle with.
1. What project have you worked on do you wish you could go back to, and do better?
At the beginning of my PhD I spent about 12 months preprocessing credit card transactional data without any guidance. Even that I learned a lot, most of that time was me just trying different technologies to extract features (octave, R, SQL, Python) without having real insights from the data until the very end. Currently, If that kind of problem arise again, I will know that there exist interesting communities (PyData, RUsers, Stackoverflow, among others) that can quite easily help you with very good starting points.
 2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
That one is easy. GET SOFT SKILLS!!!
Its quite often that I found myself having highly technical discussions that are unrelated with the actual business realities. this leads to focus on issues that may not be the most important for the customer/company. Also, extremely good hard skills (coding, statistics, software engineering) can only bring you so far, in most cases, whenever you’re working as a consultant or in a company, you’re going to be in a position in which you ‘re selling a data product to someone that don’t have any understanding of data science. That’s where the soft skills kick in. You must be able to clearly understand the customer needs, his background and expectations. Most of the time, you’re customer will be happy with the results of a logistic regression, therefore, all the time you spent tuning a SVM could have been utilize in other things.
3. What do you wish you knew earlier about being a data scientist?
To rely more on open-source software/platforms
4. How do you respond when you hear the phrase ‘big data’?
I hate that name. It has became a buzzword with no meaning whatsoever.
As was noted recently by @mrocklin, 90% of the databases are in the gigabyte territory, 9% in the terabyte and only 1% in the petabyte. So unless you’re in that last 1% you dont really have to worry about using “big data” tools. Moreover, I think most of the struggle with larger datasets, can be solved by using better the traditional tools like SQL. I recently read this quote by @dbasch “Many companies think they have a “big data” problem when they really have a big “data problem.” I don’t have more to say… 🙂
5. What is the most exciting thing about your field?
I would say the most interesting thing is to start seeing real commitment from industry leaders to actually get on board with data science.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 
In general I try to have a first good prototype asap, typically using a standard model such as a logistic regression or random forest in case of a classification problem. This helps to have a baseline. Afterwards it really depends on the particular problem. most often than you think, the result of a logistic regression is more than adequate for any given problem. I try to avoid spending to much time in dealing with feature selection, as its easy to loose a lot of time there.
7. Do you feel ‘Data Science’ is a thing – or do you feel it is just some Engineering functions rebranded? Do you think we could do more of the hypothesis driven scientific enquiry?
I think both. It really dependents on the context. I have seen a lot of people using the re-branding to sell more, but other than that they keep business as usual.
8. You worked as an Analytics professional in Colombia, could you comment on the difference between Data Scientist and Analytics Professional.
In my experience Analytics consists in making the analysis/modeling/predictions, and data science complement that by given more tools for data extraction and finally implementation of the different models. I think for doing analytics you can rely on statistical and data mining skills, whereas in data science you must complement that with skills from software engineering.
Bio: Alejandro Correa Bahnsen is currently working towards a Ph.D in Machine Learning at Luxembourg University. His research area relates to cost-sensitive classification and its application in a variety of real-world problems such as fraud detection, credit risk, direct marketing and churn modeling. Also, he works part time a fraud data scientist at CETREL a SIX Company applying his research for detecting fraud. Before starting his PhD, he worked for five years as a data scientist at GE Money and Scotiabank, applying data mining models in a variety of areas from advertisement to financial risk management. He have written and published many academic and industrial papers in the best per-review publications. Moreover, Alejandro also have experience as instructor of econometrics, financial risk management and machine learning.

Interview with a Data Scientst: Keith Bawden


Keith Bawden worked with me at Amazon, but not directly. Despite the fact we only had interactions over internal chat and a few emails he was a great influence on my thinking about Software Development and how to use statistics to solve problems in Industry. He has over 10 years experience in the Tech industry including working with Analytics folks and System Engineers at places like Amazon and Groupon. I consider him a fine example of what a business-focused technologist should be, and consider him a great guy at doing ‘tech leadership’ – which is something we all find hard to explain but we know it when we see it. I provide the interview with few edits.

Keith: I’m no data scientist nor am I an expert at anything in particular. But here are my answers.
1. What project have you worked on do you wish you could go back to, and do better?
Keith: There are none. Every project I have done I have found out something new. Going back and changing something may change what I have discovered. Not necessarily for the better. Warts and all I will keep my history as it is 🙂
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Keith: Be pragmatic where possible.
3. What do you wish you knew earlier about being a data scientist?
Keith: Still learning so I have no idea how to answer this.
4. How do you respond when you hear the phrase ‘big data’?
Keith: I try to understand what the speaker means when they say big data. The term is not so clear and therefore often needs clarification when used in conversation.
5. What is the most exciting thing about your field?
Keith: People.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 
Keith: Depends on the problem and the context. However, the definition of success should ideally be defined earlier rather than later. Then stop working and deliver what you have the instant you have hit your mark.
7. Do you feel ‘Data Science’ is a thing – or do you feel it is just some Engineering functions rebranded? Do you think we could do more of the hypothesis driven scientific enquiry?
Keith: Again it depends on the context and the usage. Most of the time, IMHO, it is an area (or subset) of engineering that has been re-branded. However, I have seen one team that are staffed by some pretty smart stats people doing experiments and using the scientific method in the work place. However, even with this team I’m not sure they do this kind of work all of the time.

On the Cultural divide between Data Scientists and Managers


My friend Chris recently quoted me from Twitter on his Blog.

I don’t have much to add to this – other than I feel it is a fair article is correct, and it discusses my own challenges.

Ian Ozsvald has a good keynote on this – from PyCon Sweden where he talks about the need to add value everyday.

I believe he has a good talk coming up at PyData London about the challenges of shipping it. Which I’ll also be speaking at! I’ll be giving a talk on PyMC and PyMC3

I recommend my Interviews with a Data Scientist page – for a discussion of this.

Interview with a Data Scientist: Andrew Clegg


As part of my regular Interview with a Data Scientist feature. I recently interviewed Andrew Clegg. Andrew is a really interesting and pragmatic Data Science professional and currently he’s doing some cool stuff at Etsy. You can visit his Twitter here and his most recent talk on his work at Etsy from Berlin Buzzwords.

  1. What project have you worked on do you wish you could go back to, and do better?

The one that most springs to mind was an analytics and visualization platform called Palomino that my team at Pearson built: a custom JS/HTML5 app on top of Elasticsearch, Hadoop and HBase, plus a bunch of other pipeline components, some open source and some in-house. It kind of worked, and we learnt a lot, but it was buggy, flaky at the scale we tried to push it to, and reliant on constant supervision. And it’s no longer in use, mostly for those reasons. It was pretty ambitious to begin with, but I got dazzled by shiny new toys and the lure of realtime intelligence, and brought in too many new bits of tech that there was no organisational support for. We discovered that distributed data stores and message queues are never as robust as they claim (c.f. Jepsen); that most people don’t really need realtime interactive analytics; and that supporting complex clustered applications (even internal ones) is really hard, especially in an organisation that doesn’t really have a devops culture. These days, I’d try very hard to find a solution using existing tools — Kibana for example looks much more mature and powerful than it did when we started out, and has a whole community and coherent ecosystem around it. And I’d definitely shoot for a much simpler architecture with fewer moving parts and unfamiliar components. Dan McKinley’s article Choose Boring Technology is very relevant here.

  1. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

I was asked this the other day by a recent PhD grad who was interested in a data science career, so I’ll pass on what I told him. I think there are broadly three kinds of work that take place under the general heading of “data scientist”, although, there are also plenty of exceptions to this. The first is about turning data into business insight, via statistical modelling, forecasting, predictive analytics, customer segmentation and clustering, survival analysis, churn prediction, visualization, online experiment design, and selection or design of meaningful metrics and KPIs. (Editor note: In the UK this used to be called an ‘Insight Analyst’ role, typical at retail firms or banks) The second is about developing data-driven products and features for the web, e.g. recommendation engines, trend detectors, anomaly detectors, search and ranking engines, ad placement algorithms, spam and abuse classifiers, content fingerprinting and similarity scoring, etc. The third is really a more modern take on what used to be called operational research, i.e. optimizing business processes algorithmically to reduce time or cost, or increase coverage or reported satisfaction. (Editor note: My own work at Amazon was very close to this, data analytics and operational research for supply chain operations. I imagine that a lot of e-commerce and logistics companies who employ Data Scientists do stuff like this too) In many companies these will be separate roles, and not all companies do all three. But you’ll also see roles that involve two or occasionally all three of these, in varying proportions. I guess a good start is to think about which appeals to you the most, and that will help guide you. Don’t get confused by the nomenclature: “data scientist” could mean any of those things, or something else entirely that’s been rebranded to look cool. And you could be doing any of those things and not be called a data scientist. Read the job specs closely and ask lots of questions.

  1. What do you wish you knew earlier about being a data scientist?

Well, I wish I’d taken double maths for A level, all those years ago! As it was, I took the single option, and chose the mechanics module over statistics, something that held me back ever since despite various post-graduate courses. There are certain things that are just harder to crowbar into an adult brain, if you don’t internalize the concepts early enough. I think languages and music are in that category too. (For our global readers: A-levels are the qualifications from the last two years of high school. You usually do three or four subjects, or at least you did in my day. You could do standard maths with mechanics or stats, or standard + further with both, which counted as two qualifications.) I had a similar experience with biology — I dropped it when I was 16 but ended up working in bioinformatics for several years. Statistics and biology are both subjects that are much more interesting than school makes them seem, and I wish I’d known that at the time.

  1. How do you respond when you hear the phrase ‘big data’?

Well, I used to react with anger and contempt, and have given some pretty opinionated talks on that subject before. It’s one of those things you can’t get away from in the enterprise IT world, but ironically, since I joined Etsy I’ve been numbed to the phrase by over-exposure… Just because the Github repo for our Scalding and Cascading code is called “BigData”. It’s a marketing term with very little information content — rather like “cloud”. But unlike “cloud” I actually think it’s actively misleading — it focuses attention on the size aspect, when most organisations have interesting and potentially valuable datasets that can fit on a laptop, or at least a medium-sized server. For that matter, a server with a terabyte of RAM isn’t much over $20K these days. “Big data” makes IT departments go all weak-kneed with delight or terror at the prospect of getting a Hadoop (or Spark) cluster, even though that’s often not the right fit at all. And as a noun phrase, it sucks, as it really doesn’t refer to anything. You can’t say “we solved this problem with big data” as big data isn’t really a thing with any consistent definition.

  1. What is the most exciting thing about your field?

That’s an interesting one. Deep learning is huge right now, but part of me still suspects it’s a passing fad, partly because I’m old enough to remember when plain-old neural networks were at the same stage of the hype cycle. Then they fell by the wayside for years. That said, the concrete improvements shown by convolutional nets on image recognition tasks are pretty impressive. Time will tell whether that feat can be replicated in other domains. Recent work on recurrent nets for modelling sequences (text, music, etc.) is interesting, and there’s been some fascinating work from Google (and their acquihires DeepMind) on learning to play video games or parse and execute code. These last two examples both combine deep learning with non-standard training methods (reinforcement learning and curriculum learning respectively), and my money’s on this being the direction that will really shake things up. But I’m a layman as far as this stuff goes. One problem with neural architectures is that they’re often black boxes, or at least pretty dark grey — hard to interpret or gain much insight from. There are still a lot of huge domains where this is a hard sell, education and healthcare being good examples. Maybe someone will invent a learning method with the transparency of decision trees but the power of deep nets, and win over those people in jobs where “just trust the machine” doesn’t work.

  1. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

It took me a long time to realise this, but short release cycles with small iterative improvements are the way to go. Any result that shows an improvement over your current baseline is a result — so even if you think there are much bigger wins to be had, get it into production, and test it on real data, while you work on its replacement. (Or if you’re in academia, get a quick workshop paper out while you work on its replacement!) This is also a great way to avoid overfitting, especially if you are in industry, or a service-driven academic field like bioinformatics. Instead of constantly bashing away at the error rate on a well-worn standard test set, get some new data from actual users (or cultures or sensors or whatever) and see if your model holds up in real life. And make sure you’re optimizing for the right thing — i.e. that your evaluation metrics really reflect the true cost of a misprediction. I worked in natural language processing for quite a while, and I’m sure that field was held back for a while by collective, cultural overfitting to the same-old datasets, like Penn Treebank section 23. There’s an old John Langford article about this and other non-obvious ways to overfit, which is always worth a re-read.

Bio Andrew joined Etsy in 2014, and lives in London, making him their first data scientist outside the USA. Prior to Etsy he spent almost 15 years designing machine learning workflows, and building search and analytics services, in academia, startups and enterprises, and in an ever-growing list of research areas including biomedical informatics, computational linguistics, social analytics, and educational gaming. He has a masters degree in bioinformatics and a PhD in natural language processing, can count to over 1000 on his fingers, but doesn’t know how to drive a car.

Links and retrospective from PyData Berlin


I attended PyData Berlin last weekend. It was a blast – well done to the organizers.

Some interesting remarks and links on PyData conference in Berlin.

  • Luigi was presented as a technological solution to the problem of data pipelines by Miguel Cabrera. I found this to be an interesting example of the usage of this technology for dealing with large amounts of data and various jobs for scraping data.
  • A keynote by Matthew Rocklin of Continuum Analytics was given. Matthew is an exceptionally smart computational engineer and he explained the architecture of the out-of-core data structures he was developing with Dask. Even if you’re not likely to use the technology it is a very interesting one.
    One key idea he had was the gigabyte level, terabyte level and petabyte level.
    He pointed out that hadoop and spark where probably only needed at the petabyte level – and that otherwise you just need a good workstation. I agree with this, and afterwards we spoke about this, and he said ‘We should still be using PostgreSQL for a lot of things with good indexing. I think the rise of SSD is very important too – so often you don’t have a big data problem you have just need a bigger computer or workstation.
    I checked the price online of an AWS instance – 240 GB of RAM is $2.80 per hour
  • Ascribe – protecting IP by using the Block Chain. Trent was a very interesting and engaging as a speaker.
  • What is data science – panel discussion – I chaired the panel at this event. This was fun but quite nerve-wracking 🙂
  • Overview from Felix Wick
  • FinTech discussions and risk analysis – these happened during coffee and beer and especially after the talk about CostCla by Alejandro Correra Bahnsen
  • Agriculture and Mittelstand – the opportunity for data science to be applied in industries outside of e-commerce and social network analysis.
  • Need for educated selling to management – ‘some of management are still not sold’. This was mentioned in the panel
  • Challenges credibility wise of ‘just analyzing data’ – panel
  • The need for good project management – some spoke of their failures with algorithm teams without good business direction. The need to manage expectations by sharing results. This reminds me of Ian Ozsvalds talk in Stockholm – when he shares his years of experience and reminds young data scientists to share results.
  • Bokeh – interesting technology can’t wait to check it out. I found this tutorial a bit long, but it is really hard to give a tutorial to a massive room of attendees. So well done Christine 🙂
  • Python for growth hacking – Ignacio Elola
  • Alejandro Correra Bahnsen – Cost-sensitive machine learning – I found this a very interesting talk, Alejandro is a good friend of mine – but I think it covers one of the challenges of converting results from Machine learning into actionable financial numbers. Alejandro is a good friend of mine and he has done a great job running the meetup in Luxembourg. I will miss him when he returns to Colombia.
  • Robert Obst of Pivotal gave an intriguing demo of the ‘connected car’ and I have no doubt that the ‘Internet of things’ will become a bigger and bigger thing for data analysts and for data scientists. It was interesting that he mentioned that there is a lack of interoperability from different standards in this area.
  • I gave a talk on Python used as a framework for Rugby Analysis. This got a lot of interesting questions afterwards about Probabilistic Programming and Rugby Analytics. Thanks to Matthew Rocklin for an interesting discussion of Computational problems and how cool Theano is 🙂
  • Those interested and attending the London PyData event, I’ll be giving a tutorial on PyMC, some PyMC3 and applications to Financial data and Rugby analysis in a few weeks. I’ll also discuss the differences between them and why you should use PyMC3.

People often wonder why I goto conferences. But the collection of ideas and techniques discussed above are things I’d never come across myself. Not to mention the fascinating conversations with other members of the Data Engineering, Data Analysis and Software Engineering communities.