Talks and Workshops

workshop
Sticky

I enjoy giving talks and workshops on Data Analytics. Here is a list of some of the talks I’ve given. In my Mathematics master I regularly gave talks on technical topics, and previously I worked as a Tutor and Technician in a School in Northern Ireland. I consider the evangelism of data and analytics to be an important part of my job as a professional data scientist!

Upcoming

I’m giving a tutorial called ‘Lies damned lies and statistics’ at PyData London 2016. I’ll be discussing different statistical and machine learning approaches to the same kinds of problems. The aim will be to help those who know either Bayesian statistics or Machine Learning bridge the gap to others.

Slides and Videos from Past Events

In April 2016 I gave an invited talk at the Toulouse Data Science meetup which was a slightly adjusted version of  Map of the Stack‘.

At PyData Amsterdam in March 2016- I gave the second Keynote on a ‘Map of the Stack‘.

PyCon Ireland From the Lab to the Factory (Dublin, Ireland October 2015) – I gave a talk on the business side of delivering data products – a trope I used was it is like ‘going from the lab to the factory’. This was a well-received talk based on the feedback and I gave my audience a collection of tools they could use to solve these challenges.

EuroSciPy 2015 (Cambridge, England Summer 2015): I gave a talk on Probabilistic Programming applied to Sports Analytics – slides are here.

My PyData London tutorial was an extended version of the above talk.

I spoke at PyData in Berlin.
The link is here

The blurb for my PyData Berlin talk is mentioned here.
Abstract: “Probabilistic Programming and Bayesian Methods are called by some a new paradigm. There are numerous interesting applications such as to Quantitative Finance.
I’ll discuss what probabilistic programming is, why should you care and how to use PyMC and PyMC3 from Python to implement these methods. I’ll be applying these methods to studying the problem of ‘rugby sports analytics’ particularly how to model the winning team in the recent Six Nations in Rugby. I will discuss the framework and how I was able to quickly and easily produce an innovative and powerful model as a non-expert.”

In May 2015 I gave a preview of my PyData Talk in Berlin at the Data Science Meetup in Luxembourg on ‘Probabilistic Programming and Rugby Analytics‘ – where I presented a case study and introduction to Bayesian Statistics to a technical audience. My case study was the problem of ‘how to predict the winner of the Six Nations’. I used the PyMC library in Python to build up statistical models as part of the Probabilistic Programming paradigm. This was based on my popular Blog Post which I later submitted to the acclaimed open source textbook Probabilistic Programming and Bayesian Methods for Hackers. I gave this talk using an IPython notebook, which proved to be a great method for presenting this technical material.

In October 2014 I gave a talk at Impactory in Luxembourg – a co-working space and Tech Accelerator. This was an introductory talk to a business audience about ‘Data Science and your business‘. I talked about my experience at different small firms, and large firms and the opportunities for Data Science in various industries.

In October 2014 I also gave a talk at the Data Science Meetup in Luxembourg. This was on ‘Data Science Models in Production‘ discussing my work with a small company on developing a mathematical modelling engine that was the backbone of a ‘data product’. This talk was highly successful and I gave a version of this talk at PyCon Italy – held in Florence – in April 2015. The aim of this talk was to explain what a ‘data product’ was, and discuss some of the challenges of getting data science models into production code. I also talked about the tool choices I made in my own case study. It was well-received, high level and got a great response from the audience. Edit: Those interested can see my video here, it was a really interesting talk to give, and the questions were fascinating.

When I was a freelance consultant in the Benelux I gave a private 5 minute talk on Data Science in the Game industry. Here are the slides. – This is from July 2014

My Mathematical research and talks as a Masters student are all here. I specialized in Statistics and Concentration of Measure. It was from this research that I became interested in Machine Learning and Bayesian Models.

Thesis

My Masters Thesis on ‘Concentration Inequalities and some applications to Statistical Learning Theory‘ is an introduction to the world of Concentration of Measure, VC Theory and I used this to apply to understanding the generalization error of Econometric Forecasting Models.

Interview with a Data Scientist: Phillip Higgins

Standard

Phillip Higgins is a data science consultant based in New Zealand. His experience includes financial services and working for SAS, amongst other experience including some in Germany.

What project have you worked on do you wish you could go back and do better?

Hindsight is a wonderful thing, we can always find things we could have done better in projects.  On the other hand, analytic and modelling projects are often frought with uncertainty- uncertainty that despite the best planning, is not available to foresight. Most modelling projects that I have worked on could have been improved with the benefit of better foresight!

What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

Firstly, I would advise younger analytics professionals to develop both deep knowledge of a particular area and at the same time, to broaden their knowledge and to maintain this focus of learning on both specialised and general subjects throughout their careers.  Secondly, its important to gain as much practice as possible – data science is precisely that because it deals with real-world problems.  I think PhD students should cultivate industry contacts and network widely- staying abreast of business and technology trends is essential.

What do you wish you knew earlier about being a data scientist?
Undoubtedly I wish I knew the importance of communication skills in the whole analytics life-cycle.  Its particularly important to be able to communicate findings to a wide audience and so refined presentation skills are a must.

How do you respond when you hear the phrase ‘Big Data’?

I think Big Data offers data scientists with new possibilities in terms of the work they are able to perform and the significance of their work.  I don’t think it’s a coincidence that the importance and demand of data scientists has risen sharply right at the time that Big Data has become mainstream- for Big Data to yield insights, “Big Analytics” need to be performed – they go hand in hand.

What is the most exciting thing about your field?

For me personally it’s the interesting people I meet along the way.  I’m continually astounded by the talented people I meet.

How do you go about framing a data problem – in particular, how do you manage expectations etc.  How do you know what is good enough?

I think its important to never lose sight of the business objectives that are the rationale for most data-scientific projects.  Although it is essential that businesses allow for data science to disprove hypotheses, at the end of the day, most evidence will be proving hypotheses (or disproving the null hypothesis).  The path to formulating those hypotheses lies obviously mostly in exploratory data analysis (combined with domain knowledge).  It is important to communicate this uncertainty as to framing from the outset, so that there are no surprises.

You spent some time as a consultant in data analytics.  How did you manage cultural challenges, dealing with stakeholders and executives?  What advice do you have for new starters about this?

In consulting you get to mix with a wide variety of stakeholders and that’s certainly an enjoyable aspect of the job.  I have dealt with a wide range of stakeholders, from C-level executives through to mid- level managers and analysts and each group requires a different approach.  A stakeholder analysis matrix is a good place to start- analysing stakeholders by importance and influence.  Certainly, adjusting your pitch and being aware of the politics behind and around any project is very important.

 

Why I joined Channel 4

Standard

On the first of this month I joined Channel 4 as a Senior Data Scientist. I’ve not had much time to do any Data Science, but I’ll speak a bit about my projects over the next few months.

channel4

Picture: My new workplace

As a data scientist I’ve spent some time at Amazon and Vodafone. And I chatted to my friends in the community about where I’d go next. Ian mentioned that he was doing some coaching with the data science team at Channel 4.

Firstly, I didn’t know they had a team. Channel 4 is a company that is famous for innovation in the creative arts, and I wasn’t aware they were doing things with data.

I went through the process and found the team interesting to speak to, and after they gave me a few tricky interview questions, I was given an offer.

I was initially a bit scared, I was based in Luxembourg at the time, where I’d spent several years of my life there, and my life was there. And as we all know any change or move is a hard decision to make.

After speaking to my friends and family, and my future colleagues. I eventually agreed to move cities and come.

So why work on data challenges in media? Well firstly as part of Channel 4’s strategy we have data on about 14 million 16-34 year olds in the UK. As a data geek that’s tremendously exciting. Over the past few years the teams at Channel 4 have invested heavily in their data infrastructure, leveraging Spark, Hadoop, and all sorts of other tools. This is one of the better set ups I’ve seen in a mature company. This tech stack will evolve and I’ll be working on driving that too.

I’m fascinated by human behaviour, and helping a organization that brought me content I love like IT Crowd and Father Ted, become a more data-driven organization was too big an opportunity to miss.

My team has already worked on some powerful data driven products including a new show recommendation engine, customer classifiers for ad serving and customer segmentations.

I’m looking forward to working on these projects, helping the team grow and seeing what other cool things there are in the media world. On my first day I was already being asked questions by my colleague Will on HiveQL, was listening to Thomas talk about topic modelling and participated in a standup meeting where I heard about the different projects ongoing.

We’re sponsoring the PyData conference in May, which makes me very proud as a data scientist that my employer is involved in such an amazing event. I’ll be speaking about Machine Learning and Statistical models, what their differences are and how to debug both frequentist and Bayesian models.

I’m extremely excited about my next steps, and I look forward to tackling those challenges, particularly in regards personalisation and recommendations. I’ll be undoubtedly speaking about some of the cool stuff we get up to at Channel 4.

If tackling big data challenges in media interests you – we’re hiring. So reach out to me if that would interest you:). Here is an example job ad with the details.

What does a Data Scientist need to know about Data Governance?

Standard

One term that has surprised me on data projects is ‘governance’ or ‘data quality’ or ‘master data management’. It’s surprised me because I’m not an expert in this discipline and it’s quite different to my Machine Learning work.

The aim of this blog post is to just jot down some ideas on ‘data governance’ and what that means for practitioners like myself.

I chatted to a friend Friso who gave a talk on Dirty Data at Berlin Buzzwords.

In his talk he mentions ‘data governance’ and so I reached out to him to clarify.

I came to the following conclusions which I think are worth sharing, and are similar to some of the principles that Enda Ridge talks about when he speaks of ‘Guerilla Analytics‘.

  • Insight 1: Lots of MDM, Data Governance, etc solutions are just ‘buy our product’. None of these tools replace good process and good people. Technology is only ever an enabler.
  • Insight 2: Good process and good people are two hard things to get done right.
  • Insight 3: Often companies care about ‘fit for purpose’ data which is much the same as any process – insights from statistical quality control or anomaly detection can be useful here.

Practical considerations are make sure you have a map (or workflow capturing your data provenance) and some sort of documentation (metadata or whatever is necessary) to go from the ‘raw data’ given to you by a stakeholder and the outputted data.

I think adding a huge operational overhead of lots of complicated products, vendors, meetings etc is a distraction, and can lead to a lot of pain.

Adopting some of the ‘infrastructure as code’ ideas are really useful. Since code and reproducibility are really important in understanding ‘fit for purpose’ data.

Another good summary comes from Adam Drake on ‘Data Governance

If anyone has other views or critiques I’d love to hear about them.

Where does ‘Big Data’ fit into Procurement?

Standard

I spent about a year working as an Energy Analyst in Procurement at a large Telecommunications company. I’m by no means an expert but these are my own thoughts on where I feel ‘big data’ fits into procurement.

Firstly for the stake of this argument let us consider procurement as a the purchase of goods for the rest of a large company – and fundamentally it is a cost-control function for a business. These are some ideas of where ‘big data’ can fit in a procurement organization. It is by no means exhaustive.

  1. Tools for supporting pricing information. I worked on tools like this in the past, but getting good pricing information helps you benchmark your performance. This is really important if your prices are subject to markets like energy markets or commodity markets.
  2. Machine learning for recognizing contracts – lots of procurement is about dealing with contracts – one could apply natural language processing to finding similar contracts or similar documents. This could be invaluable for lowering costs in organizations.
  3. Total Cost Modelling – when you analyse a complex item in a supply chain like a
    phone mast – you’ll find a number of residual parts such as steel, batteries, etc etc. For services this gets even more complicated because of the nature and lack of visibility of the costs. One can leverage applied statistics and monte-carlo simulations to help better understand the natures of these variable costs, and better model your total cost of ownership.

 

Since traditional methods for reducing costs are fast evaporating, CPOs (Chief Procurement Officers) should increase the time and effort invested in total cost modelling. In doing so, they will not only inform internal decisions, but also deliver to procurement an opportunity to drive strategy, thereby developing the top line impact modern businesses desire from them.

When it comes to practicalities, building an analytics capability has to start with a definition of the problem and a clear understanding of the boundary conditions. Limiting procurement’s scope by simply working with the data that is easily available will also limit the outcomes. CPOs need to contemplate the relationships between data sources and data points and look for indications of likely trends without direct access to ‘proof’ data.

Particularly of interest to procurement professionals will be the deluge of information from the ‘internet of things’. However this data needs good governance (it needs to be fit for purpose) and good analysis to take advantage of. We’ll talk more about such things in the future.

PyData Amsterdam

Standard

I recently attended and keynoted at PyData Amsterdam 2016.

(Clockwise from top right – ‘The Sunset when the event was closing’, ‘Peadar Coyle giving a keynote at PyDataAmsterdam’, ‘Video interviews with Holden Karau a Spark expert from IBM’, ‘The organizing committee’, ‘Maciej Kula of Lyst talking about Recommendation Engines’.)

Firstly this was a wonderful conference, the location (a boat), the food, and the quality of speakers and discussion was excellent. The energy of the organizers – most of them from GoDataDriven (a boutique data science/ engineering consultancy in Amsterdam) was great, and there was a good mixture of advanced, intermediate and basic tutorials and talks.

Some highlights – Andreas Mueller one of the core contributors to Scikit Learn gave a great advanced tutorial, he talked about neural networks, the out of core functionality, grid search and Bayesian Hyperparameter optimization. Like any advanced tutorial it’s hard to know you’re audience but I know I’ll be looking at his notebooks again and again.

pydata_seanowen

(Sean Owen of Cloudera giving the opening keynote on Data Engineering and Genomics)pydata_bbq(

(The BBQ was awesome on Saturday, we had a competition to consume Beer and Burgers – which Giovanni won:) )

pydata_andreasmueller

(Andreas Mueller of NYU and a core-contributor of Scikit Learn gave a great Advanced Tutorial, the room was so packed it moved the boat!)

pydata_datascience_prod

(Sergii Khomenko of Stylight gave a talk on Data Science going into Production)

Friso van Vollenhoven the CTO of GoDataDriven gave a nice comparison of meetup communities, this was largely an introductory talk but there were some nice ideas in there, like how to use Neo4j, some variants of matplotlib and using Word2vec via the excellent Gensim library.

James Powell of the NumFOCUS core members gave an entertaining series of hacks about Python 3 and python 2.7 – it’s worth watching just because his hackery and subversion are remarkable. This was slightly different than some of the other data focused talks.

The first keynote was by Sean Owen from Cloudera and this was largely focused on genomics and the data challenges that are out there – and the challenges of growing the data engineering toolkits to keep up with such data.

We had explanations of Julia, NLP, Spark Streaming, PySpark, Search relevance, Bayesian methods, Out-of-core computation, Pandas, the use of python in modelling Oil/Gas, Pokemon Recommendation engines, deploying machine learning models, financial mathematics (network theory applied to Finance), Search Quality analysis, etc and sadly I feel during the conference that I didn’t digest everything correctly. Thankfully the videos and notebooks/ slides will go up soon.

I liked Lucas Bernardi (of Booking.com) discussion of little tips and tricks of how to accelerate certain Machine learning libraries.

My Keynote – I felt very nervous before this – but the feedback was positive and over 100 people attended my Sunday morning 9.00 am discussion of the ‘Map of the PyData stack’ I talked about some of the projects I’m most excited about and gave case studies and/or code. I mentioned Blaze, Dask, Xarray, Bcolz and Ibis. The notebooks are available online and the conversation afterwards was very interesting. One of the most exciting things about using python for your own professional work – is that the ecosystem is getting more and more improved. I reminded the audience of a theme that came up in beers with various open source contributors. Open Source needs support, bug fixes, documentation and it rarely happens for free.

A highlight for me – was Maceij Kula of Lyst a UK based fashion startup gave a thorough introduction to his work on hybrid recommendation engines. A lot of the audience was very excited about this, since recommendation engines are a common aim for
data science teams. He spoke of the mathematics, the learning-to-rank, the speed improvements, why to use it, the advantages
such as topic extraction, the comparison to word2vec. And he’s a very engaging speaker, and some of the insights he shared
were fascinating. Such as how they use postgres for deploying their models. I’ll soon be working on Recommendation engines
in my next job, so I’ll be carefully reading and reading his notes/ code.

The videos will go up soon.

I felt the discussion was excellent, the food and beers (there were lots of beers), it was great to chat with some of the luminaries and core SciPy contributors. I was especially happy to speak to some of the non-technical specialists who attended for some of the conference – it is a reminder that data science teams need Marketing, Sales, Recruitment and other functions to help them achieve success. And it was great to see 300 python and data enthusiasts discuss their real world challenges, how they conquer them. The sponsors who included Optiver, GoDataDriven, ING, Continuum, Booking.com and Dato are also a great view of how the Amsterdam data scene is. As far as I am aware they are all hiring data scientists and engineers so I hope someone who attended the environment found interesting job opportunities from it.

pydatalucasbernardi

(Lucas Bernardi of Booking.com sharing some pragmatic advice for data scientists)

(lunchpydata

(Lunch was excellent)

It is great that we have a community that shares such case studies, and best practices. And a community that allows young people like myself to give keynotes in front of such demanding audiences. It is a very exciting time to be doing data science – and I don’t think any other career is more exciting. We hear a lot of hype about ‘big data’ and ‘machine learning’, conferences like this where people share their success stories are great, I’m glad there is so much innovation going on in European Data Science!

I look forward to my next PyData – check out www.pydata.org to see where the next one in your own geographical area is.

A map of the PyData Stack

Standard

One question you have when you use Python is what do I do with my data. How do I process it and analyze it. The aim of this flow chart is to simply provide a simple to use ‘map’ of the PyData stack.

At PyData Amsterdam I’ll present this and explain it in more detail but I hope this helps.

landscape_infographic_colour.png

Thanks to Thomas Wiecki, Matt Rocklin, Stephan Hoyer and Rob Story for their feedback and discussion over the last year about this kind of problem. There’ll be a few iterations based on their feedback.

CC-0 (Creative Commons-0) 2016 Peadar Coyle

 

(I’ll share the source file eventually).

A short email from Marvin Minsky – RIP

Standard

As a data scientist I regularly use results based upon the work of Marvin Minsky.

This is an email exchange I had with him about 6 years ago, when I was working in Education and deciding to go back to school for Graduate School.

On Mon, Jun 21, 2010 at 10:53 AM, Peadar Coyle <peadarcoyle@googlemail.com> wrote:

Hi Marvin,
I shan’t bore you with how much of an inspiration and role model your work has been for me.
I’m a Mathematics and Physics Graduate student, with an interest in all sorts of problems.
I am particularly writing in regards your OLPC memos, I found them terribly interesting and important especially in regards the Linguistic desert in Mathematics.
Wow, thanks!  Especially, because I have not received many comments about those memos!
I’ve taught Maths in High Schools, and do find that the richness of the subject is destroyed. ‘The National Curriculum’ is held up as some sort of Biblical text and subsequently many students leave without a sense of what a researcher does, nor that Mathematics is a beautiful art form in itself.
Another aspect: although I had the privilege to attend outstanding schools (Fieldston to 8th grade, and then Bronx Science and Andover) — I don’t recall having had the idea (until college) that it was still possible to invent new mathematics.  (I did know there there still was progress in Physics, Chemistry and Biology — but didn’t have the clear idea that Mathematics was still Alive!)
 I used to be taunted as a teenager for wanting to use words like ‘non-linear’ or ‘negative feedback’. This can be discouraging even for ambitious students like myself. I feel that things haven’t got much better. Seymour Papert was correct that we teach quadratic formulas due to technological constraints. Frank Quinn (a topologist) has written a book (on his website) about mathematics education and computers. With demonstrations and Mathematica and visualizations, there is no reason that students can’t learn somethings about Dynamics, Moments of Inertia. Yes some of the integrals are terribly difficult – I even struggle with some of the algebra – but with facilities like Wolfram Alpha there one can learn to check ones work, and not be hindered by such algebraic manipulations.
I haven’t actually used it much, but it surely will be exciting to see what happens when it gets combined with systems (that don’t yet exist) which exploit large collections of common-sense knowledge.
  Gian Carlo Rota pointed out that it is not enough to be computer literate, one should be computer literate squared.
Did you know Mr Rota? I believe he was at MIT as well.
Yes, Rota was a long-time friend.
 Thanks again fro your comments!

I provide this without commentary, to just share how great it is that some of the most inspiring people in my world of Artificial Intelligence and Mathematics have responded to emails.
This link is to his Obituary.