I’ve worked on Data Science projects and delivered Machine Learning models both in production code and more research type work at a few companies now. Some of these companies were around the Seed stage/ Series A stage and some are established companies listed on stock exchanges. The aim of this article is to simply share what I’ve learned — I don’t think I know everything. I think my audience consists of both managers and technical specialists who’ve just started working in the corporate world — perhaps after some years in Academia or in a Startup. My aim is to simply articulate some of the problems, and propose some solutions — and highlight the importance of culture in enabling data science.
I’ve been reflecting over the years as a practitioner why some of this ‘big data’ stuff is hard to do. I’ll present in this article a take that’s similar to some other commentary on the internet, so this won’t be unusual.
My views are inspired by http://mattturck.com/2016/02/01/big-data-landscape/ in this article Matt says:
Big Data success is not about implementing one piece of technology (like Hadoop or anything else), but instead requires putting together an assembly line of technologies, people and processes. You need to capture data, store data, clean data, query data, analyse data, visualise data. Some of this will be done by products, and some of it will be done by humans. Everything needs to be integrated seamlessly. Ultimately, for all of this to work, the entire company, starting from senior management, needs to commit to building a data-driven culture, where Big Data is not “a” thing, but “the” thing.
Often while speaking about our nascent profession with friends working in other companies we speak about ‘change management’. Change is very hard — particularly for established and non-digital native companies, companies who don’t produce e-commerce websites, social networks or search engines. These companies often have legacy infrastructure and don’t necessarily have technical product managers nor technical cultures. Also for them traditional Business Intelligence systems work quite well — reporting is done correctly, and it’s hard to make a case for machine learning in risk-averse environments like that.
Why is delivering Data Science in the Enterprise hard?
There’s roughly speaking 5 challenges (I’ve encountered these myself and others I speak to in other organisations also say the same — so this is anecdote.)
I’ve taken this from a friend Enda — who works in retail as a Head of Data Science. What are some of the problems and the reasons why delivering Data Science in an Enterprise (i.e. not a pre-IPO startup which has been established for say greater than 10 years).
- Org structure and the customer —you need to demonstrate value fast, this can be harder in non-Agile ways of working and can be harder when the organisation is not ready to execute based on data science. This is even harder in organisations that treat data science as a support function.
- Enabling the team — something that is closely related is that the team might not be able to deploy models, nor have access to the right data sources. Democratising data access in certain political organisations can be a tough sell — sometimes the reasons are valid but often they aren’t. IT, infosec and architecture in some organisations have a tradition of control, are generally slow and are often incentivised to maintain the status quo. We shouldn’t underestimate how disruptive data science is. A solution to this is to build a tactical environment in a kind of lab. This would be a restricted environment not controlled by IT in most organisations allow Data Scientists and Engineers to use the newest tools, without the same level of governance as other IT programs.
- Making insights actionable — you need data science turned into algorithms in products. This can be extremely difficult if you have a product development function who are not familiar with Data Science methods and code and think through a product lens — which is not the same as a Data Science lens. This is hard line to thread — I’ll explore this more in future posts.
- Data Community — Data Science needs access to data and customers. But this can be difficult if there are gatekeepers, if say a BI or Analytics function views Data Science as a perceived threat, if a BI function rebrands as Data Science and if there’s confusion with the customer. I regularly have to explain to people that ‘data science is not BI’. I consider it necessary that we explain our craft to others as much as possible -sometimes all an internal team needs is a report, and sometimes it’s an algorithm.
- Getting and keeping the right people — you need key hires and the market is competitive. However you may have existing pay structures and job formats which mean you have constraints in terms of compensation and paying top performers, hiring agencies. One solution to this is to have a clear progression structure and performance management with a training budget. Avoiding hiring geniuses is a good idea too, especially if you need practicality. Not every organisation needs someone like Yann Le Cunn 🙂
So why do I still feel that we’re on the cusp of a great opportunity?
Well there are a number of reasons: firstly I think that the ‘store data’ subject has definitely gotten through to a number of companies, however often this data is stored on legacy systems and there is considerable hard work in doing the ‘plumbing’. I think (and Matt Turck mentions this too in his article) that the ‘plumbing’ or ‘infrastructure’ parts of ‘Big Data’ which we may now refer to as ‘data engineering’ are non-trivial, involve a lot of hard work and often involve navigating the security concerns of an organisation.
Secondly — we’ll start to see more companies embracing the Public and Hybrid Cloud. To those of us who have grown up using cloud systems as consumers and in the start-up world this will often seem to be ‘backward’, but it’s a slow battle to evangelise Public Cloud particularly at companies that have a lot of valid concerns about customer data and regulatory compliance. At least for the next 5–10 years we’ll see some sort of mismatch and we’re starting to see the later adopters embrace these technologies. The example of say Capital One last year indicated that it was possible for conservative financial companies to embrace Cloud as a part of their technology strategy — and it’ll be interesting to see who keynotes at the various Cloud conferences over the next few years.
Thirdly — we’re starting to see successful deployments of technologies and maturing processes in businesses — we’re seeing variants of agile that lead to de-risking of Data Science projects, we’re starting to see the rise of Spark (supported by Cloudera and IBM giving it a lot of credibility) and a maturing ecosystem of tools such as AWS, GPUs, libraries like Tensor Flow, Scikit-learn, MLLib etc are all production ready, have case studies and are often supported by established companies like some of those mentioned in the article.
Where do we go from here?
If we accept that data science is a disruptive thing that will bring a lot of innovation and change to these non-digital native companies, then we’ll need to speak about what exactly that involves.
People, processes and things — in that order.
People — data scientists/ evangelists inside the organisation, good project management, executive sponsorship, etc. A recent article by McKinsey highlights the importance of executive sponsorship. While Data Science is technical — it should be always remembered that the goal is business change so the executives need to communicate those goals throughout the organisation. It is worth pointing out that data science is a team sport — and it involves a lot more than just hiring data scientists. A good discussion of the the challenges of where to put Data Scientists in the organisation is in the following article on FirstRound Review.
As Jeremy Stanley the VP of Data Science at Instacart (an Internet-based grocery delivery service) says:
In many ways, data science takes a village — a data scientist in a vacuum can achieve nothing.
Processes — one useful technique is to have a good ‘engagement model’. This is basically a model or framework you have for engaging with business stakeholders. This can help make sure that you’re answering the right questions. A good introduction to this is the CoNVO framework from Thinking with Data.
We’ll now talk through some examples of this framework being applied.
- Context — Who are we working with, Big picture goals.
- Need — What particular knowledge are we missing?
- Vision — What would it look like to solve the problem?
“We will build a predictive model using behavioural
and social media data to identify users likely to quit;
early enough to intervene.”
- Outcome — Who will be responsible for next steps?
-How will we know if we are correct?
“Our dev team will implement the model in a daily batch process,
automatically sending email offers. We will hold out some users, and the CEO will receive a weekly email of precision and recall to judge success”
Things — This is where we can speak about Hadoop, Spark, Macbooks and other technology. I feel there are enough articles about that so you’ll have to go elsewhere. The technology to support a ‘Data Science Lab’ which is designed from the ground up to support innovation could also be an example of a ‘thing’. A good case for this is made in the following blog post by someone at the Data Science consultancy Mango Solutions.
The Data Revolution is just beginning
We’ll need more success stories — better tooling but also established business leaders who can empathise and understand how to deliver a fundamentally disruptive set of technologies, processes and things in an enterprise. It’s going to be difficult for a range of companies, especially non-digital native companies as they don’t have the luxury of starting from scratch and have legacy systems. There’ll be a lot of hard work — but it’s a truly exciting opportunity and some of the hard fought battles have already been fought.
As Pete Wang the CTO of the startup Continuum Analytics — who provide consultancy and products aimed at the Enterprise — says
The data revolution in 2015/2016 is the same as the internet in 2008 it’s just beginning.
We’re in for exciting times and this is why I’m very happy to be involved in this industry.