Working in a major trend – Machine Learning

Standard

I saw recently this from the recent Amazon shareholder letter.

“These big trends are not that hard to spot…We’re in the middle of an obvious one right now: machine learning & artificial intelligence” — Jeff Bezos

One of the hard parts about working professionally on these technologies. Is I take them for granted. So I consider this a post to just reflect on the improvements in image processing, computer vision, translation, natural language processing, text understanding, forecasting, risk analysis.

I’ve worked on some of these technologies, and these challenges, and continue to work on extracting information from CVs and matching candidates to the best jobs for them at Elevate Direct. When you’re in the weeds you sometimes forget what you’re working on, and that you’re part of a major trend.

As Matt Turck says

Big Data provides the pipes, and AI provides the smarts.

Building Full-Stack Vertical Data Products

Standard

I’ve been in the Data Science space for a number of years now, I first got interested in AI/Machine Learning in 2009 and have a background typical of a number of people in my field – I come from Physics and Mathematics.

One trend I’ve run into both at Corporates and Startups is that there are many challenges to deploying Data Science in a bureaucratic organisation – or delivering Enterprise Intelligence. Running into this problem led me to be interested in building data products.

One of the first people I saw building AI startups was Bradford Cross – and he’s been writing lately about his predictions for the 2017 in the Machine Learning startups space.

I agree with his precis that we’ll begin to see successful vertically-oriented AI startups solving full-stack industry problems that require subject matter expertise, unique data, and a product that uses AI to deliver its core value proposition.

At Elevate Direct we’re working on this working on the problem of sourcing and hiring contractors – so one of the fundamental problems that companies have which is hiring the best contractor talent out there.

So what are some of the reasons that it can be hard to deploy Data Science internally at a corporate organisation? I think a number of the patterns are related to other patterns we see in terms of software.

  1. Not being capable of building consumer facing software – Large (non-tech) organisations sometimes struggle to build and deliver software internally – I’ve seen a number of organisations fail to do this – their build process can be 6 months.
  2. Organisational anti-patterns – I’ve seen some organisations that rapidly inhibit the ability to deploy product. Some of these anti-patterns are driven by concerns about the risk of deploying software. And often end up with diffuse ownership – where an R and D team can blame the operations team and vice versa.
  3. Building Data Products is risky – Building data products is hard and risky – I think you really need to approach data products in a lean-startup kinda way. Deploy often, if it works it works, if not cut it. Sometimes the middle-management of large corporates is risk-averse and so find these kinds of projects scary. It also needs a lot of expertise –  subject-matter expertise, software expertise, machine learning expertise.
  4. Not allowing talented technical practitioners to use Open Source/ pick the tools – I once worked at a FTSE 100 company that it took me about 6 weeks to be able to install Open Source software tools such as R and Python. It severely restricted my productivity, in that time at a startup my team probably deployed into production, to a customer facing app about 1000 changes. This reminds me of the number 3 here. Don’t restrict the ability of your talented and well-trained people to deliver value. It makes no sense from a business point of view. Data Science produces value only when it produces products or insights for the business or the customers.
  5. Not having a Data Strategy – Data Science is most valuable when it aligns with the business strategy. Too often I’ve seen companies hiring data scientists before they have actual problems for them to work on. I’ve written about this before.
  6. Long term outsourcing deals – This is an insidious one, and one that came from a period of time when “IT didn’t matter”, before big Tech companies proved the value in the consumer space of for example e-commerce. It’s impossible to predict what will be the key tech for the next 10 years, so don’t lock yourself to a vendor for that period of time. Luckily this trend is reversing – we’re seeing the rise of agile, MVP, cloud computing, design thinking, getting closer to the customer. A great article on this re-shoring is here.

I think fundamentally a lot of these anti-patterns come from not knowing how to handle risk correctly. I like the idea in that RedMonk article that big outsourcing is a bit like CDOs in finance. Bundling the risk into one big lump doesn’t make the risk go away.

I learn this day after day working on building data products and tools at Elevate. Being honest about the risks and working hard to de-risk projects and drive down that risk in an agile way is the best we can do.

Finally, I think we’re just getting started building Data Products and deploying data science. It’ll be interesting what we see what other anti-patterns emerge as we grow up as an industry. This is also one of the reasons I’ve joined a startup and why I’m very excited to work on an end-to-end Data Product, which is solving a real-business problem.

AI in the Enterprise (the problem)

Standard

I was recently chatting to a friend who works as a Data Science consultant in the London Area – and a topic dear to my heart came up. How to successfully do ‘AI’ (or Data Science) in the enterprise. Now I work for an Enterprise SaaS company in the recruitment space, so I’ve got a certain amount of professional interest in doing this successfully.

My aim in this essay is to outline what the problem is, and provide some solutions.

Firstly it’s worth reflecting on the changes we’ve seen in Consumer apps – Spotify, Google, Amazon, etc – all of these apps have personalised experiences which are enhanced by machine learning techniques depending on the labelled data that consumers provide.

I’ll quote what Daniel Tuckelang (formerly of Linkedin) said about the challenges of doing this in the enterprise.

First, most enterprise data still lives in silos, whereas the intelligence comes from joining across data sets. Second, the enterprise suffers from weak signals — there’s little in the way of the labels or behavioral data that consumer application developers take for granted. Third, there’s an incentive problem: everyone promotes data reuse and knowledge sharing, but most organizations don’t reward it

I’ve personally seen this when working with enterprises, and being a consultant. The data is often very noisy, and while there are techniques to overcome that such as ‘distant supervision‘ it does make things harder than say building Ad-Tech models in the consumer space or customer churn models. Where the problem is more explicitly solvable by supervised techniques.

In my experience and the experience of others. Enterprises are much more likely to try buy in off-the-shelf solutions, but (to be sweepingly general) they still don’t have the expertise to understand/validate/train the models.There are often individuals in small teams here & there who’ve self-taught or done some formal education, but they’re not supported. (My friend Martin Goodson highlights this here)  There needs to be a cultural shift. At a startup you might have a CTO who’s willing to trust a bunch of relatively young data science chaps to try figure out an ML-based solution that does something useful for the company without breaking anything. And it’s also worth highlighting that there’s a difference in risk aversion between enterprises (with established practices  etc) and the more exploratory or R and D mindset of a startup.

The somewhat more experienced of us these days tend to have a reasonable idea of what can be done, what’s feasible, and furthermore how to convince the CEO that it’s doing something useful for his valuation.

Startups are far more willing to give things a go, there’s an existential threat. And not to forget that often Venture Capitalists and the assorted machinery expect Artificial Intelligence, and this is encouraged.

Increasingly I speculate that established companies now outsource their R and D to startups, hence the recent acquisitions like the one by GE Digital.

So I see roughly speaking two solutions to this problem. Two ways to de-risk data science projects in the enterprise.

1) Build it as an internal consultancy with two goals: identifying problems which can be solved with data solutions, and exposing other departments to new cultural thinking & approaches. I know of one large retailer who implemented this by doing 13 week agile projects, they’d do a consultation, then choose one team to build a solution for.

2) Start putting staff through training schemes similar to what is offered by General Assembly (there are others), but do it whole teams at a time, the culture of code review and programmatic analysis has to come back and be implemented at work. Similarly, give the team managers additional training in agile project management etc.

The first can have varied success – you need the right problems, and the right internal customers – and the second I’ve never seen implemented.

I’d love to hear some of the solutions you have seen. I’d be glad to chat about this.

Acknowledgements: I’d like to thank the following people for their conversations: John Sandall, Martin Goodson, Eddie Bell, Ian Ozsvald, Mick Delaney and I’m sorry about anyone else I’ve forgotten.

 

A short email from Marvin Minsky – RIP

Standard

As a data scientist I regularly use results based upon the work of Marvin Minsky.

This is an email exchange I had with him about 6 years ago, when I was working in Education and deciding to go back to school for Graduate School.

On Mon, Jun 21, 2010 at 10:53 AM, Peadar Coyle <peadarcoyle@googlemail.com> wrote:

Hi Marvin,
I shan’t bore you with how much of an inspiration and role model your work has been for me.
I’m a Mathematics and Physics Graduate student, with an interest in all sorts of problems.
I am particularly writing in regards your OLPC memos, I found them terribly interesting and important especially in regards the Linguistic desert in Mathematics.
Wow, thanks!  Especially, because I have not received many comments about those memos!
I’ve taught Maths in High Schools, and do find that the richness of the subject is destroyed. ‘The National Curriculum’ is held up as some sort of Biblical text and subsequently many students leave without a sense of what a researcher does, nor that Mathematics is a beautiful art form in itself.
Another aspect: although I had the privilege to attend outstanding schools (Fieldston to 8th grade, and then Bronx Science and Andover) — I don’t recall having had the idea (until college) that it was still possible to invent new mathematics.  (I did know there there still was progress in Physics, Chemistry and Biology — but didn’t have the clear idea that Mathematics was still Alive!)
 I used to be taunted as a teenager for wanting to use words like ‘non-linear’ or ‘negative feedback’. This can be discouraging even for ambitious students like myself. I feel that things haven’t got much better. Seymour Papert was correct that we teach quadratic formulas due to technological constraints. Frank Quinn (a topologist) has written a book (on his website) about mathematics education and computers. With demonstrations and Mathematica and visualizations, there is no reason that students can’t learn somethings about Dynamics, Moments of Inertia. Yes some of the integrals are terribly difficult – I even struggle with some of the algebra – but with facilities like Wolfram Alpha there one can learn to check ones work, and not be hindered by such algebraic manipulations.
I haven’t actually used it much, but it surely will be exciting to see what happens when it gets combined with systems (that don’t yet exist) which exploit large collections of common-sense knowledge.
  Gian Carlo Rota pointed out that it is not enough to be computer literate, one should be computer literate squared.
Did you know Mr Rota? I believe he was at MIT as well.
Yes, Rota was a long-time friend.
 Thanks again fro your comments!

I provide this without commentary, to just share how great it is that some of the most inspiring people in my world of Artificial Intelligence and Mathematics have responded to emails.
This link is to his Obituary.

Interview with a Data Scientist: Nathalie Hockham

Standard
1038670
(Linkedin picture)
I was very happy to interview Natalie about her data science stuff – as she gave a really cool Machine Learning focused talk at PyData in London this year, which was full of insights into the challenges of doing Machine Learning with Imbalanced data sets.
Natalie leads the data team at GoCardless, a London startup specialising in online direct debit. She cut her teeth as a PhD student working on biomedical control systems before moving into finance, and eventually fintech. She is particularly interested in signal processing and machine learning and is presently swotting up on data engineering concepts, some knowledge of which is a must in the field.

What project have you worked on do you wish you could go back to, and do better?

Before I joined a startup, I was working as an analyst on the trading floor of one of the oil majors. I spent a lot of time building out models to predict futures timespreads based on our understanding of oil stocks around the world, amongst other things. The output was a simple binary indication of whether the timespreads were reasonably priced, so that we could speculate accordingly. I learned a lot about time series regression during this time but worked exclusively with Excel and eViews. Given how much I’ve learned about open source languages, code optimisation, and process automation since working at GoCardless, I’d love to go back in time and persuade the old me to embrace these sooner.

What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Don’t underestimate the software engineers out there! These guys and girls have been coding away in their spare time for years and it’s with their help that your models are going to make it into production. Get familiar with OOP as quickly as you can and make it your mission to learn from the backend and platform engineers so that you can work more independently.

What do you wish you knew earlier about being a data scientist?

It’s not all machine learning. I meet with some really smart candidates every week who are trying to make their entrance into the world of data science and machine learning is never far from the front of their minds. The truth is machine learning is only a small part of what we do. When we do undertake projects that involve machine learning, we do so because they are beneficial to the company, not just because we have a personal interest in them. There is so much other work that needs to be done including statistical inference, data visualization, and API integrations. And all this fundamentally requires spending vast amounts of time cleaning data.


How do you respond when you hear the phrase ‘big data’?

I haven’t had much experience with ‘big data’ yet but it seems to have superseded ‘machine learning’ on the hype scale. It definitely sounds like an exciting field – we’re just some way off going down this route at GoCardless.

What is the most exciting thing about your field?
Working in data is a great way to learn about all aspects of a business, and the lack of engineering resource that characterizes most startups means that you are constantly developing your own skill set. Given how quickly the field is progressing, I can’t see myself reaching saturation in terms of what I can learn for a long time yet. That makes me really happy.

How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
Our 3 co-founders all started out as management consultants and the importance of accurately defining a problem from the outset has been drilled into us. Prioritisation is key – we mainly undertake projects that will generate measurable benefits right now. Before we start a project, we check that the problem actually exists (you’d be surprised how many times we’ve avoided starting down the wrong path because someone has given us incorrect information). We then speak to the relevant stakeholders and try to get as much context as possible, agreeing a (usually quantitative) target to work towards. It’s usually easy enough to communicate to people what their expectations should be. Then the scoping starts within the data team and the build begins. It’s important to recognise that things may change over the course of a project so keeping everyone informed is essential. Our system isn’t perfect yet but we’re improving all the time.

How do you explain to C-level execs the importance of Data Science? How do you deal with the ‘educated selling’ parts of the job?
Luckily, our management team is very embracing of data in general. Our data team naturally seeks out opportunities to meet with other data professionals to validate the work we’re doing. We try hard to make our work as transparent as possible to the rest of the company by giving talks and making our data widely available, so that helps to instill trust. Minor clashes are inevitable every now and then, which can put projects on hold, but we often come back to them later when there is a more compelling reason to continue.

What is the most exciting thing you’ve been working on lately and tell us a bit about GoCardless.
We’ve recently overhauled our fraud detection system, which meant working very closely with the backend engineers for a prolonged period of time – that was a lot of fun.
GoCardless is an online direct debit provider, founded in 2011. Since then, we’ve grown to 60+ employees, with a data team of 3. Our data is by no means ‘big’ but it can be complex and derives from a variety of sources. We’re currently looking to expand our team with the addition of a data engineer, who will help to bridge the gap between data and platform.

What is the biggest challenge of leading a data science team?

The biggest challenge has been making sure that everyone is working on something they find interesting most of the time. To avoid losing great people, they need to be developing all the time. Sometimes this means bringing forward projects to provide interest and raise morale. Moreover, there are so many developments in the field that its hard to keep track, but attending meetups and interacting with other professionals means that we are always seeking out opportunities to put into practice the new things that we have learned.

Interview with a Data Scientist: Thomas Wiecki

Standard

I interviewed Thomas Wiecki recently – Thomas is Data Science Lead at Quantopian Inc which is a crowd-sourced hedge fund and algotrading platform. Thomas is a cool guy and came to give a great talk in Luxembourg last year – which I found so fascinating that I decided to learn some PyMC3 🙂

1. What project have you worked on do you wish you could go back to, and do better?
While I was doing my masters in CS I got a stipend to develop an object recognition framework. This was before deep learning dominated every benchmark data set and bag-of-features was the way to go. I am proud of the resulting software, called Pynopticon (https://code.google.com/p/pynopticon/wiki/Introduction), even though it never gained any traction. I spent a lot of time developing a streamed data piping mechanism that was pretty general and flexible. This was in anticipation of the large size of data sets. In retrospect though it was overkill and I should have spent less time coming up with the best solution and instead spend time improving usability! Resources are limited and a great core is not worth a whole lot if the software is difficult to use. The lesson I learned is to make something useful first, place it into the hands of users, and then worry about performance.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Spend time learning the basics. This will make more advanced concepts much easier to understand as it’s merely an extension of core principles and integrates much better into an existing mental framework. Moreover, things like math and stats, at least for me, require time and continuous focus to dive in deep. The benefit of taking that time, however, is a more intuitive understanding of the concepts. So if possible, I would advise people to study these things while still in school as that’s where you have the time and mental space. Things like the new data science tools or languages are much easier to learn and have a greater risk of being ‘out-of-date’ soon. More concretely, I’d start with Linear Algebra (the Strang lectures are a great resource) and Statistics (for something applied I recommend Kruschke’s Doing Bayesian Analysis, for fundamentals “The Elements of Statistical Learning” is a classic).
3. What do you wish you knew earlier about being a data scientist?
How important non-technical skills are. Communication is key, but so are understanding business requirements and constraints. Academia does a pretty good job of training you for the former (verbal and written), although mostly it is assumed that communicate to an expert audience. This certainly will not be the case in industry where you have to communicate your results (as well as how you obtained them) to people with much more diverse backgrounds. This I find very challenging.
As to general business skills, the best way to learn is probably to just start doing it. That’s why my advice for grad-students who are looking to move to industry would be to not obsess over their technical skills (or their Kaggle score) but rather try to get some real-world experience.
4. How do you respond when you hear the phrase ‘big data’?
As has been said before, it’s quite an overloaded term. On one side, it’s a buzzword in business where I think the best interpretation is that ‘big data’ actually means that data is a ‘big deal’ — i.e. the fact that more and more people realize that by analyzing their data they can have an edge over the competition and make more money.
Then there’s the more technical interpretation where it means that data increases in size and some data sets do not fit into RAM anymore. I’m still undecided of whether this is actually more of a data engineering problem (i.e. the infrastructure to store the data, like hadoop) or an actual data science problem (i.e. how to actually perform analyses on large data). A lot of times, as a data scientist I think you can get by by sub-sampling the data (Andreas Müller has a great talk of how to do ML on large data sets https://www.youtube.com/watch?v=l43VIw5xhTg).
Then again, more data also has the potential to allow us to build more complex models that capture reality more accurately, but I don’t think we are there yet. Currently, if you have little data, you can only do very simple things. If you have medium data, you are in the sweet spot where you can do more complex analyses like Probabilistic Programming. However, with “big data”, the advanced inference algorithms fail to scale so you’re back to doing very simple things. This “big data needs big models” narrative is expressed in a talk by Michael Betancourt: https://www.youtube.com/watch?v=pHsuIaPbNbY
5. What is the most exciting thing about your field?
The fast pace the field is moving. It seems like every week there is another cool tool announced. Personally I’m very excited about the blaze ecosystem including dask which has a very elegant approach to distributed analytics which relies on existing functionality in well established packages like pandas, instead of trying to reinvent the wheel. But also data visualization is coming along quite nicely where the current frontier seems to be interactive web-based plots and dashboards as worked on by bokeh, plotly and pyxley.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 
I try to keep the loop between data analysis and communication to consumers very tight. This also extends to any software to perform certain analyses which I try to place into the hands of others even if it’s not perfect yet. That way there is little chance to ween off track too far and there is a clearer sense of how usable something is. I suppose it’s borrowing from the agile approach and applying it to data science.

Interview with a Data Scientist (Hadley Wickham)

Standard

I recently interviewed Hadley Wickham the creator of Ggplot2 and a famous R Stats person. He works for RStudio and his job is to work on Open Source software aimed at Data Geeks. Hadley is famous for his contributions to Data Science tooling and inspires a lot of other languages! I include some light edits.

1. What project have you worked on do you wish you could go back to, and do better? ggplot2! ggplot2 is such a heavily used package, and I knew so little about R programming when I wrote it. It works (mostly) and I love it, but the internals are pretty hairy. Even I don’t understand how a lot of works these days. Fortunately, I’ve had the opportunity to spend some more time on it lately, as I’ve been working on the 2nd edition of the ggplot2 book.

One thing that I’m particularly excited about is adding an official extension mechanism, so that others can extend ggplot2 by creating their own geoms, stats etc. Some people have figured out how to do that already, but it’s not documented and the current system causes hassles with CRAN. Winston and I have been working on this fairly heavily for the last couple of weeks, and it’s also been good for us. Thinking about what other people will need to do to write a new geom has forced us to think carefully about how geoms should work, and has resulted in a lot of internal tidying up. That makes life more pleasant for us!  

2. What advice do you have to younger analytics professionals and in  particular PhD students in the Sciences? My general advice is that’s better to be really good at one thing rather than pretty good at a few things. For example, I think you’re better off spending your time mastering either R or python, rather than developing a middling understanding of both. In the short term, that will cost you some time (because there are something that are much easier to do in R than in python and vice versa), but in the long term you end up with a more powerful toolset that few others can match.

That said, you also need to have some surface knowledge of many other areas. You need to accept that there are many subjects that you could master if you put the time into them, but you don’t have the time. For example, if you’re interestied in data, I think it’s really important that you have a working knowledge SQL. If you have any interest in working in industry, the chances are that your data will live in a database that speaks SQL, and if you don’t know the basics you won’t be useful.

Similarly, if you work a lot with text, you should learn regular expressions; and if you work with html or xml, you should learn xpath.

3. What do you wish you knew earlier about being a data scientist? To be honest, I’m not sure that I am a data scientist. I spend most of my time making tools for data scientists to use, rather than actually doing data science myself.

Obviously I do some data analysis, but its mostly exploratory and for fun. I do worry that I am out of the loop when it comes to users needs in Data Science.

4. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? Again, I’m not sure that I’m a particularly skilled data analyst, but I do do a little. Mostly it’s on personal data, or data about R or RStudio that’s of interest to me. I’m also on the look out for interesting datasets to use for teaching and in documentation and books (which is one of the reasons I make data packages)

A couple of interesting datasets that I’ve been playing around with lately are on liquor sales in Iowa and parking violations in Philadelphia These datasets are particularly interesting to be because they’re almost 1 GB. I want to make sure that my tools scale to handle this volume of data on a commodity laptop.

Mostly I’m done with them once I get bored, which makes me a poor role model for practicing data scientists!

5. How do you respond when you hear the phrase ‘big data’? Big data is extremely overhyped and not terribly well defined. Many people think they have big data, when they actually don’t.

I think there are two particularly important transition points:

* From in-memory to disk. If your data fits in memory, it’s small data. And these days you can get 1 TB of ram, so even small data is big! Moving from in-memory to on-disk is an important transition because access speeds are so different. You can do quite naive computations on in-memory data and it’ll be fast enough. You need to plan (and index) much more with on-disk data

* From one computer to many computers. The next important threshold occurs when you data no longer fits on one disk on one computer. Moving to a distributed environment makes computation much more challenging because you don’t have all the data needed for a computation in one place. Designing distributed algorithms is much harder, and you’re fundamentally limited by the way the data is split up between computers.

I personally believe it’s impossible for one system to span from in-memory to on-disk to distributed. R is a fantastic environment for the rapid exploration of in-memory data, but there’s no elegant way to scale it to much larger datasets. Hadoop works well when you have thousands of computers, but is incredible slow on just one machine. Fortunately, I don’t think one system needs to solve all big data problems.

To me there are three main classes of problem:

1. Big data problems that are actually small data problems, once you have the right subset/sample/summary. Inventing numbers on the spot, I’d say 90% of big data problems fall into this category. To solve this problem you need a distributed database (like hive, impala, teradata etc), and a tool like dplyr to let you rapidly iterate to the right small dataset (which still might be gigabytes in size).

2. Big data problems that are actually lots and lots of small data problems, e.g. you need to fit one model per individual for thousands of individuals. I’d say ~9% of big data problems fall into this category. This sort of problem is known as a trivially parallelisable problem and you need some way to distribute computation over multiple machines. The foreach package is a nice solution to this problem because it abstracts away the backend, allowing you to focus on the computation, not the details of distributing it.

3. Finally, there are irretrievably big problems where you do need all the data, perhaps because you fitting a complex model. An example of this type of problem is recommender systems which really do benefit from lots of data because they need to recognise interactions that occur only rarely. These problems tend to be solved by dedicated systems specifically designed to solve a particular problem.