AI in the Enterprise (the problem)

Standard

I was recently chatting to a friend who works as a Data Science consultant in the London Area – and a topic dear to my heart came up. How to successfully do ‘AI’ (or Data Science) in the enterprise. Now I work for an Enterprise SaaS company in the recruitment space, so I’ve got a certain amount of professional interest in doing this successfully.

My aim in this essay is to outline what the problem is, and provide some solutions.

Firstly it’s worth reflecting on the changes we’ve seen in Consumer apps – Spotify, Google, Amazon, etc – all of these apps have personalised experiences which are enhanced by machine learning techniques depending on the labelled data that consumers provide.

I’ll quote what Daniel Tuckelang (formerly of Linkedin) said about the challenges of doing this in the enterprise.

First, most enterprise data still lives in silos, whereas the intelligence comes from joining across data sets. Second, the enterprise suffers from weak signals — there’s little in the way of the labels or behavioral data that consumer application developers take for granted. Third, there’s an incentive problem: everyone promotes data reuse and knowledge sharing, but most organizations don’t reward it

I’ve personally seen this when working with enterprises, and being a consultant. The data is often very noisy, and while there are techniques to overcome that such as ‘distant supervision‘ it does make things harder than say building Ad-Tech models in the consumer space or customer churn models. Where the problem is more explicitly solvable by supervised techniques.

In my experience and the experience of others. Enterprises are much more likely to try buy in off-the-shelf solutions, but (to be sweepingly general) they still don’t have the expertise to understand/validate/train the models.There are often individuals in small teams here & there who’ve self-taught or done some formal education, but they’re not supported. (My friend Martin Goodson highlights this here)  There needs to be a cultural shift. At a startup you might have a CTO who’s willing to trust a bunch of relatively young data science chaps to try figure out an ML-based solution that does something useful for the company without breaking anything. And it’s also worth highlighting that there’s a difference in risk aversion between enterprises (with established practices  etc) and the more exploratory or R and D mindset of a startup.

The somewhat more experienced of us these days tend to have a reasonable idea of what can be done, what’s feasible, and furthermore how to convince the CEO that it’s doing something useful for his valuation.

Startups are far more willing to give things a go, there’s an existential threat. And not to forget that often Venture Capitalists and the assorted machinery expect Artificial Intelligence, and this is encouraged.

Increasingly I speculate that established companies now outsource their R and D to startups, hence the recent acquisitions like the one by GE Digital.

So I see roughly speaking two solutions to this problem. Two ways to de-risk data science projects in the enterprise.

1) Build it as an internal consultancy with two goals: identifying problems which can be solved with data solutions, and exposing other departments to new cultural thinking & approaches. I know of one large retailer who implemented this by doing 13 week agile projects, they’d do a consultation, then choose one team to build a solution for.

2) Start putting staff through training schemes similar to what is offered by General Assembly (there are others), but do it whole teams at a time, the culture of code review and programmatic analysis has to come back and be implemented at work. Similarly, give the team managers additional training in agile project management etc.

The first can have varied success – you need the right problems, and the right internal customers – and the second I’ve never seen implemented.

I’d love to hear some of the solutions you have seen. I’d be glad to chat about this.

Acknowledgements: I’d like to thank the following people for their conversations: John Sandall, Martin Goodson, Eddie Bell, Ian Ozsvald, Mick Delaney and I’m sorry about anyone else I’ve forgotten.

 

Advertisements

Why Code review? Or why should I care as a data scientist.

Standard

The insightful Data Scientist Trey Causey talks about Software Development Skills for Data Scientists I’m going to write about my views on Code Review – as a Data Scientist with a few years experience, and experience delivering Data Products at organizations of varying sizes. I’m not perfect and I’m still maturing as an Engineer.

A good thorough introduction to Code Review comes from the excellent team at Lyst I suggest that as follow up reading!

The fundamental nugget is that ‘code reviews allow you to more effectively collaborate with your peers‘ and a lot of new Engineers and Data Scientists don’t know how to do that. This is one reason why I wrote ‘soft skills for data scientists‘. This article talks about a technical skill but I consider this a kind of ‘technical communication’.

Here are some views on ‘why code review’ – I share them here as reference, largely to remind myself. I steal a lot of these from this video series.

  • Peer to peer quality engineering and training 

As a Data Science community that is forming – and with us coming from various backgrounds there’s a lot of invaluable knowledge from others in the team. Don’t waste your chance at getting that 🙂

  • Catches bugs easily

There are many bugs that we all write when we write code.

Keeps team members on the same page

  • Domain knowledge 
    How do we share knowledge about our domain to others without sharing code?
  • Project style and architecture
    I’m a big believer in using structured projects like Cookiecutter Data Science and I’m sure there exist alternatives in other languages. Before hand I had a messy workflow like hacked together IPython notebooks and no idea what was what – refactoring code into modules is a good practice for a reason 🙂
  • Programming skills
    I learn a lot myself by reading other peoples code – a lot of the value of being part of an open source project like PyMC3 – is that I learn a lot from reading peoples code 🙂

Other good practices

  • PEP8 and Pylint (according to team standards)
  • Code review often, but by request of the author only

I think it’s a good idea (I think Roland Swingler mentioned this to me)

To not obsess too much about style – having a linter doing that is better, otherwise code reviews can become overly critical and pedantic. This can stop people sharing code and leads to criticism that can shake Junior Engineers in particular – who need psychological safety. As I mature as an Engineer and a Data Scientist I’m aware of this more and more 🙂

Keep code small

  • < 20 minutes, < 100 lines is best
  • Large code reviews make suggestions harder and can lead to bikeshedding

These are my own lessons so far and are based on experience writing code as a Data Scientist – I’d love to hear your views.

Data Science Delivered – Consulting skills

Standard

I recently gave a lighting talk at PyData Meetup London where I talked about ‘Consulting skills for Data Scientists’.

Here are the slides here

https://speakerdeck.com/springcoil/consulting-skills-for-data-scientists 

My thoughts

Some thoughts – these are not just related to ‘consulting skills’ but something more nuanced – general soft skills and business skills – which are essential for those of us working in a commercial environment. I’m still improving these skills but these are important for me and I take these seriously. I present some bullet points that are worth further thought – I’ll try to tackle these in more detail in future blog posts.

  • Business skills are necessary as you get more experience as a data scientist – you take part in a commercial environment.
  • All projects involve risk and this needs to be communicated clearly to clients – whether their internal or external.
  • Negotiation is a useful skill to pick up on too
  • Maturing as an engineer involves being able to make estimates, stick to them, and take part in a joint activity with other people.
  • Leadership of technical projects is something I’m exploring lately – a great post is by John Allspaw (current CTO of Etsy). http://www.kitchensoap.com/2012/10/25/on-being-a-senior-engineer/ 
  • My friend John Sandall talked about this at the meetup too. He talked more about ‘soft skills’ and has some links to some books etc.
  • Learning to write and communicate is incredibly valuable. I recommend the Pyramid Principle as a book for this.
  • For the product delivery and de-risking projects – I recommend the book ‘The Lean Startup‘ can be really good regardless of the organization you’re in.
  • Modesty forbids me to recommend my own book but it has some conversations with data scientists about communication, delivery, and adding value throughout the data science process.
  • Editing and presenting results is really important in Data Science. In one project, I simplified a lot of complex modelling to just an if-statement – by focusing on the business deliverables and the most important results of the analysis. Getting an if-statement into production is trivial – a random forest model is a lot more complicated. John Foreman has written about this too.

In short we’re a new discipline – but we have a lot to learn from other consulting disciplines and other engineering disciplines. Data science may be new – but people aren’t 🙂

 

Interview with a Data Scientist: Phillip Higgins

Standard

Phillip Higgins is a data science consultant based in New Zealand. His experience includes financial services and working for SAS, amongst other experience including some in Germany.

What project have you worked on do you wish you could go back and do better?

Hindsight is a wonderful thing, we can always find things we could have done better in projects.  On the other hand, analytic and modelling projects are often frought with uncertainty- uncertainty that despite the best planning, is not available to foresight. Most modelling projects that I have worked on could have been improved with the benefit of better foresight!

What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

Firstly, I would advise younger analytics professionals to develop both deep knowledge of a particular area and at the same time, to broaden their knowledge and to maintain this focus of learning on both specialised and general subjects throughout their careers.  Secondly, its important to gain as much practice as possible – data science is precisely that because it deals with real-world problems.  I think PhD students should cultivate industry contacts and network widely- staying abreast of business and technology trends is essential.

What do you wish you knew earlier about being a data scientist?
Undoubtedly I wish I knew the importance of communication skills in the whole analytics life-cycle.  Its particularly important to be able to communicate findings to a wide audience and so refined presentation skills are a must.

How do you respond when you hear the phrase ‘Big Data’?

I think Big Data offers data scientists with new possibilities in terms of the work they are able to perform and the significance of their work.  I don’t think it’s a coincidence that the importance and demand of data scientists has risen sharply right at the time that Big Data has become mainstream- for Big Data to yield insights, “Big Analytics” need to be performed – they go hand in hand.

What is the most exciting thing about your field?

For me personally it’s the interesting people I meet along the way.  I’m continually astounded by the talented people I meet.

How do you go about framing a data problem – in particular, how do you manage expectations etc.  How do you know what is good enough?

I think its important to never lose sight of the business objectives that are the rationale for most data-scientific projects.  Although it is essential that businesses allow for data science to disprove hypotheses, at the end of the day, most evidence will be proving hypotheses (or disproving the null hypothesis).  The path to formulating those hypotheses lies obviously mostly in exploratory data analysis (combined with domain knowledge).  It is important to communicate this uncertainty as to framing from the outset, so that there are no surprises.

You spent some time as a consultant in data analytics.  How did you manage cultural challenges, dealing with stakeholders and executives?  What advice do you have for new starters about this?

In consulting you get to mix with a wide variety of stakeholders and that’s certainly an enjoyable aspect of the job.  I have dealt with a wide range of stakeholders, from C-level executives through to mid- level managers and analysts and each group requires a different approach.  A stakeholder analysis matrix is a good place to start- analysing stakeholders by importance and influence.  Certainly, adjusting your pitch and being aware of the politics behind and around any project is very important.

 

What does a Data Scientist need to know about Data Governance?

Standard

One term that has surprised me on data projects is ‘governance’ or ‘data quality’ or ‘master data management’. It’s surprised me because I’m not an expert in this discipline and it’s quite different to my Machine Learning work.

The aim of this blog post is to just jot down some ideas on ‘data governance’ and what that means for practitioners like myself.

I chatted to a friend Friso who gave a talk on Dirty Data at Berlin Buzzwords.

In his talk he mentions ‘data governance’ and so I reached out to him to clarify.

I came to the following conclusions which I think are worth sharing, and are similar to some of the principles that Enda Ridge talks about when he speaks of ‘Guerilla Analytics‘.

  • Insight 1: Lots of MDM, Data Governance, etc solutions are just ‘buy our product’. None of these tools replace good process and good people. Technology is only ever an enabler.
  • Insight 2: Good process and good people are two hard things to get done right.
  • Insight 3: Often companies care about ‘fit for purpose’ data which is much the same as any process – insights from statistical quality control or anomaly detection can be useful here.

Practical considerations are make sure you have a map (or workflow capturing your data provenance) and some sort of documentation (metadata or whatever is necessary) to go from the ‘raw data’ given to you by a stakeholder and the outputted data.

I think adding a huge operational overhead of lots of complicated products, vendors, meetings etc is a distraction, and can lead to a lot of pain.

Adopting some of the ‘infrastructure as code’ ideas are really useful. Since code and reproducibility are really important in understanding ‘fit for purpose’ data.

Another good summary comes from Adam Drake on ‘Data Governance

If anyone has other views or critiques I’d love to hear about them.

A map of the PyData Stack

Standard

One question you have when you use Python is what do I do with my data. How do I process it and analyze it. The aim of this flow chart is to simply provide a simple to use ‘map’ of the PyData stack.

At PyData Amsterdam I’ll present this and explain it in more detail but I hope this helps.

landscape_infographic_colour.png

Thanks to Thomas Wiecki, Matt Rocklin, Stephan Hoyer and Rob Story for their feedback and discussion over the last year about this kind of problem. There’ll be a few iterations based on their feedback.

CC-0 (Creative Commons-0) 2016 Peadar Coyle

 

(I’ll share the source file eventually).

Interview with a Data Scientist: Ivana Balazevic

Standard

Ivana Balazevic is a Data Scientist at a Berkeley based startup Wise.io, where she is working in a small team of data scientists on solving problems in customer service for different clients. She did her bachelor’s degree in Computer Science at the Faculty of Electrical Engineering and Computing in Zagreb and she recently finished her master’s degree in Computer Science with the focus on Machine Learning at the Technical University Berlin.

 

1. What do you think about ‘big data’?

I try not to think about it that much, although nowadays that’s quite hard to avoid. 🙂 It’s definitely an overused term, a buzzword.

I think that adding more and more data can certainly be helpful up to a point, but the outcome of majority of the problems that people are trying to solve depends primarily on the feature engineering process, i.e. on extracting the necessary information from the data and deciding which features to create. However, I’m certain there are problems out there which require large amounts of data, but they are definitely not so common for the whole world to obsess about.

 

2. What is the hardest thing for you to learn about data science?

I would say the hardest things are those which can’t be learned at school, but which you gain through experience. Coming out of school and working mostly on toy datasets, you are rarely prepared for the messiness of the real-world data. It takes time to learn how to deal with it, how to clean it up, select the important pieces of information, and transform this information into good features. Although that can be quite challenging, it is a core process of the whole data science creativity and one of the things that make data science so interesting.

 

3. What advice do you have for graduate students in the sciences who wish to become Data Scientists?

I don’t know if I’m qualified enough to give such advice, being a recent graduate myself, but I’ll try to write down things that I learned from my own experience.

Invest time in your math and statistics courses, because you’re going to need it. Take a side project, which might give you a chance to learn some new programming concepts and introduce you to interesting datasets. Do your homeworks and don’t be afraid to ask questions whenever you don’t understand something in the lecture, since the best time to learn the basics is now and it’s much harder to fill those holes in knowledge than to learn everything the right way from the beginning.

 

4. What project would you back to do and change? How would you change it?

Most of them! I often catch myself looking back at a project I did a couple of years ago and wishing I knew then what I know now. The most recent project is my master’s thesis, I wish I tried out some things I didn’t have time for, but I hope I’ll manage to catch some time to work on it further in the next couple of months.

 

5. How do you go about scoping a data science project?

Usually when I’m faced with a new dataset, I get very excited about it and can’t wait to dig into it, which gets in the way of all the planning that should have been done beforehand. I hope I’ll manage to become more patient about it with time and learn to do it the “right” way.

One of the things that I find a bit limiting about the industry is that you often have to decide whether something is worth the effort of trying it out, since there are always certain deadlines you need to hold on to. Therefore, it is very important to have a clear final goal right from the beginning. However, one needs to be flexible and take into account that things at the end user’s side might change along the way and be prepared to adapt to the user’s needs accordingly.

 

6. What do you wish you knew earlier about being a data scientist?

That you don’t spend all of your time doing the fun stuff! A lot of the work done by the data scientists is invested into getting the data, making it into the right format, cleaning it up, battling different encoding issues, writing tests for the code you wrote, etc. When you sum everything up, you spend only a part of your time doing the actual “data science magic”.

 

7. What is the most exciting thing you’ve been working on lately?

We are a small team of data scientists at Wise who are working on many interesting projects. I am mostly involved with the natural language processing tasks, since that is the field I’m planning to do my PhD in starting this fall. My most recent project is on expanding the customer service support to multilingual datasets, which can be quite challenging considering the highly skewed language distribution (80% English, 20% all other languages) in the majority of datasets we are dealing with.

 

8. How do you manage learning the ‘soft’ skills and the ‘hard’ skills? Any tips?

Learning the hard skills requires a lot of time, patience, and persistence, and I highly doubt there is a golden formula for it. You just have to read a lot of books and papers, talk to people that are smarter and/or have more experience than you and be patient, because it will all pay off.

Soft skills, on the other hand, somehow come naturally to me. I’m quite an open person and I’ve never had problems talking to people. However, if you do have problems with it, I suggest you to take a deep breath, try to relax, focus and tell yourself that the people you are dealing with are just humans like you, with their good and bad days, their strengths and imperfections. I believe that picturing things this way takes a lot of pressure off your chest and gives you the opportunity to think much more clearly.