What does a Data Scientist need to know about Data Governance?

Standard

One term that has surprised me on data projects is ‘governance’ or ‘data quality’ or ‘master data management’. It’s surprised me because I’m not an expert in this discipline and it’s quite different to my Machine Learning work.

The aim of this blog post is to just jot down some ideas on ‘data governance’ and what that means for practitioners like myself.

I chatted to a friend Friso who gave a talk on Dirty Data at Berlin Buzzwords.

In his talk he mentions ‘data governance’ and so I reached out to him to clarify.

I came to the following conclusions which I think are worth sharing, and are similar to some of the principles that Enda Ridge talks about when he speaks of ‘Guerilla Analytics‘.

  • Insight 1: Lots of MDM, Data Governance, etc solutions are just ‘buy our product’. None of these tools replace good process and good people. Technology is only ever an enabler.
  • Insight 2: Good process and good people are two hard things to get done right.
  • Insight 3: Often companies care about ‘fit for purpose’ data which is much the same as any process – insights from statistical quality control or anomaly detection can be useful here.

Practical considerations are make sure you have a map (or workflow capturing your data provenance) and some sort of documentation (metadata or whatever is necessary) to go from the ‘raw data’ given to you by a stakeholder and the outputted data.

I think adding a huge operational overhead of lots of complicated products, vendors, meetings etc is a distraction, and can lead to a lot of pain.

Adopting some of the ‘infrastructure as code’ ideas are really useful. Since code and reproducibility are really important in understanding ‘fit for purpose’ data.

Another good summary comes from Adam Drake on ‘Data Governance

If anyone has other views or critiques I’d love to hear about them.

Advertisements

Where does ‘Big Data’ fit into Procurement?

Standard

I spent about a year working as an Energy Analyst in Procurement at a large Telecommunications company. I’m by no means an expert but these are my own thoughts on where I feel ‘big data’ fits into procurement.

Firstly for the stake of this argument let us consider procurement as a the purchase of goods for the rest of a large company – and fundamentally it is a cost-control function for a business. These are some ideas of where ‘big data’ can fit in a procurement organization. It is by no means exhaustive.

  1. Tools for supporting pricing information. I worked on tools like this in the past, but getting good pricing information helps you benchmark your performance. This is really important if your prices are subject to markets like energy markets or commodity markets.
  2. Machine learning for recognizing contracts – lots of procurement is about dealing with contracts – one could apply natural language processing to finding similar contracts or similar documents. This could be invaluable for lowering costs in organizations.
  3. Total Cost Modelling – when you analyse a complex item in a supply chain like a
    phone mast – you’ll find a number of residual parts such as steel, batteries, etc etc. For services this gets even more complicated because of the nature and lack of visibility of the costs. One can leverage applied statistics and monte-carlo simulations to help better understand the natures of these variable costs, and better model your total cost of ownership.

 

Since traditional methods for reducing costs are fast evaporating, CPOs (Chief Procurement Officers) should increase the time and effort invested in total cost modelling. In doing so, they will not only inform internal decisions, but also deliver to procurement an opportunity to drive strategy, thereby developing the top line impact modern businesses desire from them.

When it comes to practicalities, building an analytics capability has to start with a definition of the problem and a clear understanding of the boundary conditions. Limiting procurement’s scope by simply working with the data that is easily available will also limit the outcomes. CPOs need to contemplate the relationships between data sources and data points and look for indications of likely trends without direct access to ‘proof’ data.

Particularly of interest to procurement professionals will be the deluge of information from the ‘internet of things’. However this data needs good governance (it needs to be fit for purpose) and good analysis to take advantage of. We’ll talk more about such things in the future.

PyData Amsterdam

Standard

I recently attended and keynoted at PyData Amsterdam 2016.

(Clockwise from top right – ‘The Sunset when the event was closing’, ‘Peadar Coyle giving a keynote at PyDataAmsterdam’, ‘Video interviews with Holden Karau a Spark expert from IBM’, ‘The organizing committee’, ‘Maciej Kula of Lyst talking about Recommendation Engines’.)

Firstly this was a wonderful conference, the location (a boat), the food, and the quality of speakers and discussion was excellent. The energy of the organizers – most of them from GoDataDriven (a boutique data science/ engineering consultancy in Amsterdam) was great, and there was a good mixture of advanced, intermediate and basic tutorials and talks.

Some highlights – Andreas Mueller one of the core contributors to Scikit Learn gave a great advanced tutorial, he talked about neural networks, the out of core functionality, grid search and Bayesian Hyperparameter optimization. Like any advanced tutorial it’s hard to know you’re audience but I know I’ll be looking at his notebooks again and again.

pydata_seanowen

(Sean Owen of Cloudera giving the opening keynote on Data Engineering and Genomics)pydata_bbq(

(The BBQ was awesome on Saturday, we had a competition to consume Beer and Burgers – which Giovanni won 🙂 )

pydata_andreasmueller

(Andreas Mueller of NYU and a core-contributor of Scikit Learn gave a great Advanced Tutorial, the room was so packed it moved the boat!)

pydata_datascience_prod

(Sergii Khomenko of Stylight gave a talk on Data Science going into Production)

Friso van Vollenhoven the CTO of GoDataDriven gave a nice comparison of meetup communities, this was largely an introductory talk but there were some nice ideas in there, like how to use Neo4j, some variants of matplotlib and using Word2vec via the excellent Gensim library.

James Powell of the NumFOCUS core members gave an entertaining series of hacks about Python 3 and python 2.7 – it’s worth watching just because his hackery and subversion are remarkable. This was slightly different than some of the other data focused talks.

The first keynote was by Sean Owen from Cloudera and this was largely focused on genomics and the data challenges that are out there – and the challenges of growing the data engineering toolkits to keep up with such data.

We had explanations of Julia, NLP, Spark Streaming, PySpark, Search relevance, Bayesian methods, Out-of-core computation, Pandas, the use of python in modelling Oil/Gas, Pokemon Recommendation engines, deploying machine learning models, financial mathematics (network theory applied to Finance), Search Quality analysis, etc and sadly I feel during the conference that I didn’t digest everything correctly. Thankfully the videos and notebooks/ slides will go up soon.

I liked Lucas Bernardi (of Booking.com) discussion of little tips and tricks of how to accelerate certain Machine learning libraries.

My Keynote – I felt very nervous before this – but the feedback was positive and over 100 people attended my Sunday morning 9.00 am discussion of the ‘Map of the PyData stack’ I talked about some of the projects I’m most excited about and gave case studies and/or code. I mentioned Blaze, Dask, Xarray, Bcolz and Ibis. The notebooks are available online and the conversation afterwards was very interesting. One of the most exciting things about using python for your own professional work – is that the ecosystem is getting more and more improved. I reminded the audience of a theme that came up in beers with various open source contributors. Open Source needs support, bug fixes, documentation and it rarely happens for free.

A highlight for me – was Maceij Kula of Lyst a UK based fashion startup gave a thorough introduction to his work on hybrid recommendation engines. A lot of the audience was very excited about this, since recommendation engines are a common aim for
data science teams. He spoke of the mathematics, the learning-to-rank, the speed improvements, why to use it, the advantages
such as topic extraction, the comparison to word2vec. And he’s a very engaging speaker, and some of the insights he shared
were fascinating. Such as how they use postgres for deploying their models. I’ll soon be working on Recommendation engines
in my next job, so I’ll be carefully reading and reading his notes/ code.

The videos will go up soon.

I felt the discussion was excellent, the food and beers (there were lots of beers), it was great to chat with some of the luminaries and core SciPy contributors. I was especially happy to speak to some of the non-technical specialists who attended for some of the conference – it is a reminder that data science teams need Marketing, Sales, Recruitment and other functions to help them achieve success. And it was great to see 300 python and data enthusiasts discuss their real world challenges, how they conquer them. The sponsors who included Optiver, GoDataDriven, ING, Continuum, Booking.com and Dato are also a great view of how the Amsterdam data scene is. As far as I am aware they are all hiring data scientists and engineers so I hope someone who attended the environment found interesting job opportunities from it.

pydatalucasbernardi

(Lucas Bernardi of Booking.com sharing some pragmatic advice for data scientists)

(lunchpydata

(Lunch was excellent)

It is great that we have a community that shares such case studies, and best practices. And a community that allows young people like myself to give keynotes in front of such demanding audiences. It is a very exciting time to be doing data science – and I don’t think any other career is more exciting. We hear a lot of hype about ‘big data’ and ‘machine learning’, conferences like this where people share their success stories are great, I’m glad there is so much innovation going on in European Data Science!

I look forward to my next PyData – check out www.pydata.org to see where the next one in your own geographical area is.

A map of the PyData Stack

Standard

One question you have when you use Python is what do I do with my data. How do I process it and analyze it. The aim of this flow chart is to simply provide a simple to use ‘map’ of the PyData stack.

At PyData Amsterdam I’ll present this and explain it in more detail but I hope this helps.

landscape_infographic_colour.png

Thanks to Thomas Wiecki, Matt Rocklin, Stephan Hoyer and Rob Story for their feedback and discussion over the last year about this kind of problem. There’ll be a few iterations based on their feedback.

CC-0 (Creative Commons-0) 2016 Peadar Coyle

 

(I’ll share the source file eventually).