Talks and Workshops

workshop
Sticky

I enjoy giving talks and workshops on Data Analytics. Here is a list of some of the talks I’ve given. In my Mathematics master I regularly gave talks on technical topics, and previously I worked as a Tutor and Technician in a School in Northern Ireland. I consider the evangelism of data and analytics to be an important part of my job as a professional data scientist!

Upcoming

I’m giving a tutorial called ‘Lies damned lies and statistics’ at PyData London 2016. I’ll be discussing different statistical and machine learning approaches to the same kinds of problems. The aim will be to help those who know either Bayesian statistics or Machine Learning bridge the gap to others.

Slides and Videos from Past Events

In April 2016 I gave an invited talk at the Toulouse Data Science meetup which was a slightly adjusted version of  Map of the Stack‘.

At PyData Amsterdam in March 2016- I gave the second Keynote on a ‘Map of the Stack‘.

PyCon Ireland From the Lab to the Factory (Dublin, Ireland October 2015) – I gave a talk on the business side of delivering data products – a trope I used was it is like ‘going from the lab to the factory’. This was a well-received talk based on the feedback and I gave my audience a collection of tools they could use to solve these challenges.

EuroSciPy 2015 (Cambridge, England Summer 2015): I gave a talk on Probabilistic Programming applied to Sports Analytics – slides are here.

My PyData London tutorial was an extended version of the above talk.

I spoke at PyData in Berlin.
The link is here

The blurb for my PyData Berlin talk is mentioned here.
Abstract: “Probabilistic Programming and Bayesian Methods are called by some a new paradigm. There are numerous interesting applications such as to Quantitative Finance.
I’ll discuss what probabilistic programming is, why should you care and how to use PyMC and PyMC3 from Python to implement these methods. I’ll be applying these methods to studying the problem of ‘rugby sports analytics’ particularly how to model the winning team in the recent Six Nations in Rugby. I will discuss the framework and how I was able to quickly and easily produce an innovative and powerful model as a non-expert.”

In May 2015 I gave a preview of my PyData Talk in Berlin at the Data Science Meetup in Luxembourg on ‘Probabilistic Programming and Rugby Analytics‘ – where I presented a case study and introduction to Bayesian Statistics to a technical audience. My case study was the problem of ‘how to predict the winner of the Six Nations’. I used the PyMC library in Python to build up statistical models as part of the Probabilistic Programming paradigm. This was based on my popular Blog Post which I later submitted to the acclaimed open source textbook Probabilistic Programming and Bayesian Methods for Hackers. I gave this talk using an IPython notebook, which proved to be a great method for presenting this technical material.

In October 2014 I gave a talk at Impactory in Luxembourg – a co-working space and Tech Accelerator. This was an introductory talk to a business audience about ‘Data Science and your business‘. I talked about my experience at different small firms, and large firms and the opportunities for Data Science in various industries.

In October 2014 I also gave a talk at the Data Science Meetup in Luxembourg. This was on ‘Data Science Models in Production‘ discussing my work with a small company on developing a mathematical modelling engine that was the backbone of a ‘data product’. This talk was highly successful and I gave a version of this talk at PyCon Italy – held in Florence – in April 2015. The aim of this talk was to explain what a ‘data product’ was, and discuss some of the challenges of getting data science models into production code. I also talked about the tool choices I made in my own case study. It was well-received, high level and got a great response from the audience. Edit: Those interested can see my video here, it was a really interesting talk to give, and the questions were fascinating.

When I was a freelance consultant in the Benelux I gave a private 5 minute talk on Data Science in the Game industry. Here are the slides. – This is from July 2014

My Mathematical research and talks as a Masters student are all here. I specialized in Statistics and Concentration of Measure. It was from this research that I became interested in Machine Learning and Bayesian Models.

Thesis

My Masters Thesis on ‘Concentration Inequalities and some applications to Statistical Learning Theory‘ is an introduction to the world of Concentration of Measure, VC Theory and I used this to apply to understanding the generalization error of Econometric Forecasting Models.

Interview with a Data Scientist – Jessica Graves

pexels-photo-26549
Standard

Jessica Graves is a Data Scientist who currently works on fashion problems in New York City. She’s worked with Hilary Mason at Fast Forward Labs and keeps in regular contact with the London startup scene. After many months of asking her for an interview she finally gave in – and she shares her unique perspective on the datafication of Fashion. She comes from a background in visual and performing arts, as well as fashion design. In her spare time you’ll find her reading a stack of papers or studying dance.

Cover image: unsplash.com CCO

Jessica Graves_02-1

  1. What project have you worked on do you wish you could go back to, and do better?

I worked with Dr. Laurens Mets on an iteration of the technology behind Electrochaea, a device where microbes convert waste electricity to clean natural gas. My job was to translate models from electrochemistry journals into code, to help simulate, measure and optimize the parameters of the device. We needed to facilitate electron transport and keep the microbes happy. Read papers, write code, and design alternative energy technology with math + data?! I would hand my past self How to Design Programs as a guide and learn to re-implement from scratch in an open source language. 

  1. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

Listen! If you are a data scientist, listen carefully to the business problems of your industry, and see the problems for what they are, rather than putting the technical beauty of and personal interest in the solution first and foremost. You may find it’s more important to you to work with a certain type of problem than it is to work at a certain type of company, or vice versa. Watch very carefully when your team expresses frustration in general – articulate problems that no one knows they should be asking you to solve. At the same time, it can be tempting to work on a solution that has no problem. If you’re most interested in a specific machine learning technique, can you justify its use over another, or will high technical debt be a serious liability? Will a project be leveragable (legally, financially, technically, operationally)? Can you quantify the risk of not doing a project? 

  1. What do you wish you knew earlier about being a data scientist?

I wish I realized that data science is classical realist painting.

Classical realists train to accurately represent a 3D observation as a 2D image. In the strictest cases, you might not be allowed to use color for 1-3 years, working only with a stick of graphite, graduating to charcoal and pencils, eventually monotone paintings. Only after mastering the basics of form, line, value, shade, tone, are you allowed a more impactful weapon, color. With oil painting in particular, it matters immensely in what order at what layer you add which colors, which chemicals compose each color, of which quality pigment, at what thickness, with what ratio of which medium, with which shape of brush, at what angle, after what period of drying. Your primary objective is to continuously correct your mistakes of translating what you observe and suspending your preconception of what an object should look like.

There are many parallels with data science. At no point as a classical realist painter should you say, ‘well it’s a face, so I’m going to draw the same lines as last time’ just like as a data scientist, you should look carefully at the data before applying algorithm x, even if that’s what every blog post Google surfaces to the top of your results says to do in that situation. You have to be really true to what you observe and not what you know – sometimes a hand looks more like a potato than a hand, and obsessing over anatomical details because you know it’s a hand is a mistake. Does it produce desirable results in the domain of problems that you’re in? Are you assuming Gaussian distributions on skewed data? Did you go directly to deep learning when logistic regression would have sufficed? I wish I knew how often data science course offerings are paint by numbers. You won’t get very far once the lines are removed, the data is too big to extract on your laptop, and an out-of-memory error pops up running what you thought was a pretty standard algorithm on the subset you used instead. Let alone that you have to create or harvest the data set in the first place – or sweet talk someone into letting you have access to it.  

In addition, Nulla dies sine linea – it’s true for drawing, ballet, writing. It’s true for data science. No day without a line. It’s very difficult to achieve sophistication without crossing off days and days of working through code or theoretical examples (I think this is why Recurse Center is so special for programmers). Sets of bland but well-executed tiny piece of software. Unspectacular, careful work in high volumes raises the quality of all subsequent complex works. Bigger, slower projects benefit from myriads of partially explored pathways you already know not to take.

Also side notes to my past self: Linux. RAM. Thunderbolt ports. 

  1. How do you respond when you hear the phrase ‘big data’?

Big data? Like in the cloud? Or are we in the fog now? Honestly the first thing I see in my mind is PETABYTES. I think of petabytes of selfies raining from the sky and flowing into a data lake. Stagnant. Data-efficient AI is all the rage — less data, more primitives, smarter agents. In the meantime, optimizing hardware and code to work with large datasets is pretty fun. Fetishizing the size of the data works well …as long as you don’t care about robustness to diverse inputs. Can your algorithm do well with really niche patterns? What can you do with the bare minimum amount of data? 

  1. What is the most exciting thing about your field?

Fashion is visual. It’s inescapable. Every culture has garb or adornment, however minimal. A few trillion dollars of apparel, textiles, and accessories across the globe. The problems of the industry are very diverse and largely unsolved. A biologist might come to fashion to grow better silk. An AI researcher might turn to deep learning to sift through the massive semi-structured set of apparel images available online. So many problems that may have a tech solution are unsolved. Garment manufacturing is one of the most neglected areas of open source software development. LVMH and Richemont don’t fight over who provided the most sophisticated open-source tools to researchers the way that Amazon and Google do. You can start a deep learning company on a couple grand and use state-of-the-art software tools for cheap or free. You cannot start an apparel manufacturing vertical using state-of-the-art tools without serious investment, because the climate is still extremely unfavorable to support a true ecosystem of small-scale independent designers. The smartest software tools for the most innovative hardware are excessively expensive, closed-source, and barely marketed — or simply not talked about in publically accessible ways. Sewing has resisted automation for decades, although is finally now at a place now were the joining of fabrics into a seam is robot-automatable with computer vision used on a thread-by-thread basis to determine the location of the next stitch. 

High end, low end, or somewhere in between, the apparel side of fashion’s output is a physical object that has to be brought to life from scratch, or delivered seamlessly, to a human, who will put the object on their body. Many people participate in apparel by default, but the fashion crowd is largely self-selected and passionate, so it’s exciting (and difficult) to build for such an engaged group that don’t fit standard applications of standard machine learning algorithms.

  1. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

Artists learn this eventually: volume of works produced trumps perfectionism. Even to match something in classical realism, you start with ridiculous abstractions. Cubes and cylinders to approximate heads and arms. Break it down into the smallest possible unit. Listen to Polya, “If you can’t solve a problem, then there is an easier problem you can solve: find it.”

As for when to finish? Nothing is never good enough. The thing that is implemented is better than the abstract, possibly better thing, for now, and will probably outlive its original intentions. But make sure that solution correlates thoroughly with the problem, as described in the words of the stakeholder. Otherwise, for a consumer-facing product or feature, your users will usually give you clues as to what’s working. 

  1. You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?

Be open. Fashion has a lot of space for innovation if you understand and quantify your impact on problems that are actually occurring and costing money or time, and show that you can solve them fast enough. “We built this new thing” has absolutely nothing to do with “We built this useful thing” and certainly not “We built this backwards-compatible thing”. You might be tempted to recommend a “new thing” and then complain that fashion isn’t sophisticated enough or “data” enough for it. As an industry that in some cases has largely ignored data for gut feelings with a serious payoff, I think the attitude should be more of pure respect than of condescension, and of transitioning rather than scrapping. That or build your own fashion thing instead of updating existing ones.  

  1. You have worked in fashion. Can you talk about the biggest opportunities for data in the fashion industry. Are there cultural challenges with datafication in such a ‘creative industry’.

Fashion needs ‘datafication’ that clearly benefits fashion. If you apply off-the-shelf collaborative filtering to fashion items with a fixed seasonal shelf life to users that never really interact with, you’re going to get poor results. Algorithms that work badly in other domains might work really well in fashion with a few tweaks. NIPS had an ecommerce workshop last year, and KDD has a fashion-specific workshop this year, which is exciting to see, although I’ll point out that researchers have been trying to solve textile manufacturing problems with neural networks since the 90s.

A fashion creative might very well LOVE artificial intelligence, machine learning, and data science if you tailor your language into what makes their lives easier. Louis Vuitton uses an algorithm to arrange handbag pattern pieces advantageously on a piece of leather (not all surfaces of the leather are appropriate for all pattern pieces of the handbag) and marks the lines with lasers before artisans hand-cut the pieces. The artisans didn’t seem particularly upset about this. 

The two main problems I still see right now are the doorman problem and fit. Use data and software to make it simple for designers of all scales to adjust garments to fit their real markets instead of their imagined muses. And, use as little input as possible to help online shoppers know which existing items will fit. Once they buy, make sure they get their packages on time, securely, discreetly. 

Data Science Delivered – Consulting skills

Standard

I recently gave a lighting talk at PyData Meetup London where I talked about ‘Consulting skills for Data Scientists’.

Here are the slides here

https://speakerdeck.com/springcoil/consulting-skills-for-data-scientists 

My thoughts

Some thoughts – these are not just related to ‘consulting skills’ but something more nuanced – general soft skills and business skills – which are essential for those of us working in a commercial environment. I’m still improving these skills but these are important for me and I take these seriously. I present some bullet points that are worth further thought – I’ll try to tackle these in more detail in future blog posts.

  • Business skills are necessary as you get more experience as a data scientist – you take part in a commercial environment.
  • All projects involve risk and this needs to be communicated clearly to clients – whether their internal or external.
  • Negotiation is a useful skill to pick up on too
  • Maturing as an engineer involves being able to make estimates, stick to them, and take part in a joint activity with other people.
  • Leadership of technical projects is something I’m exploring lately – a great post is by John Allspaw (current CTO of Etsy). http://www.kitchensoap.com/2012/10/25/on-being-a-senior-engineer/ 
  • My friend John Sandall talked about this at the meetup too. He talked more about ‘soft skills’ and has some links to some books etc.
  • Learning to write and communicate is incredibly valuable. I recommend the Pyramid Principle as a book for this.
  • For the product delivery and de-risking projects – I recommend the book ‘The Lean Startup‘ can be really good regardless of the organization you’re in.
  • Modesty forbids me to recommend my own book but it has some conversations with data scientists about communication, delivery, and adding value throughout the data science process.
  • Editing and presenting results is really important in Data Science. In one project, I simplified a lot of complex modelling to just an if-statement – by focusing on the business deliverables and the most important results of the analysis. Getting an if-statement into production is trivial – a random forest model is a lot more complicated. John Foreman has written about this too.

In short we’re a new discipline – but we have a lot to learn from other consulting disciplines and other engineering disciplines. Data science may be new – but people aren’t:)

 

Data Science Delivered

Standard

Quick note

  • At the excellent PyData London conference, there was a lot of food for thought.
  • One thing that came up was the concept of ‘data strategy’, there’s a lot of discussion about how to align, or write, or explain how data can help drive business transformation and be part of a business strategy.
  • A similar concept was product delivery, product management and the challenges of delivering value from data.

I bring this up because although it was a technical conference, there was a lot of discussion about non-technical aspects.

I’m sure at some point there’s a track or a training to be done on this sort of thing:)

 

Interview with a Data Scientist: Phillip Higgins

Standard

Phillip Higgins is a data science consultant based in New Zealand. His experience includes financial services and working for SAS, amongst other experience including some in Germany.

What project have you worked on do you wish you could go back and do better?

Hindsight is a wonderful thing, we can always find things we could have done better in projects.  On the other hand, analytic and modelling projects are often frought with uncertainty- uncertainty that despite the best planning, is not available to foresight. Most modelling projects that I have worked on could have been improved with the benefit of better foresight!

What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

Firstly, I would advise younger analytics professionals to develop both deep knowledge of a particular area and at the same time, to broaden their knowledge and to maintain this focus of learning on both specialised and general subjects throughout their careers.  Secondly, its important to gain as much practice as possible – data science is precisely that because it deals with real-world problems.  I think PhD students should cultivate industry contacts and network widely- staying abreast of business and technology trends is essential.

What do you wish you knew earlier about being a data scientist?
Undoubtedly I wish I knew the importance of communication skills in the whole analytics life-cycle.  Its particularly important to be able to communicate findings to a wide audience and so refined presentation skills are a must.

How do you respond when you hear the phrase ‘Big Data’?

I think Big Data offers data scientists with new possibilities in terms of the work they are able to perform and the significance of their work.  I don’t think it’s a coincidence that the importance and demand of data scientists has risen sharply right at the time that Big Data has become mainstream- for Big Data to yield insights, “Big Analytics” need to be performed – they go hand in hand.

What is the most exciting thing about your field?

For me personally it’s the interesting people I meet along the way.  I’m continually astounded by the talented people I meet.

How do you go about framing a data problem – in particular, how do you manage expectations etc.  How do you know what is good enough?

I think its important to never lose sight of the business objectives that are the rationale for most data-scientific projects.  Although it is essential that businesses allow for data science to disprove hypotheses, at the end of the day, most evidence will be proving hypotheses (or disproving the null hypothesis).  The path to formulating those hypotheses lies obviously mostly in exploratory data analysis (combined with domain knowledge).  It is important to communicate this uncertainty as to framing from the outset, so that there are no surprises.

You spent some time as a consultant in data analytics.  How did you manage cultural challenges, dealing with stakeholders and executives?  What advice do you have for new starters about this?

In consulting you get to mix with a wide variety of stakeholders and that’s certainly an enjoyable aspect of the job.  I have dealt with a wide range of stakeholders, from C-level executives through to mid- level managers and analysts and each group requires a different approach.  A stakeholder analysis matrix is a good place to start- analysing stakeholders by importance and influence.  Certainly, adjusting your pitch and being aware of the politics behind and around any project is very important.

 

Why I joined Channel 4

Standard

On the first of this month I joined Channel 4 as a Senior Data Scientist. I’ve not had much time to do any Data Science, but I’ll speak a bit about my projects over the next few months.

channel4

Picture: My new workplace

As a data scientist I’ve spent some time at Amazon and Vodafone. And I chatted to my friends in the community about where I’d go next. Ian mentioned that he was doing some coaching with the data science team at Channel 4.

Firstly, I didn’t know they had a team. Channel 4 is a company that is famous for innovation in the creative arts, and I wasn’t aware they were doing things with data.

I went through the process and found the team interesting to speak to, and after they gave me a few tricky interview questions, I was given an offer.

I was initially a bit scared, I was based in Luxembourg at the time, where I’d spent several years of my life there, and my life was there. And as we all know any change or move is a hard decision to make.

After speaking to my friends and family, and my future colleagues. I eventually agreed to move cities and come.

So why work on data challenges in media? Well firstly as part of Channel 4’s strategy we have data on about 14 million 16-34 year olds in the UK. As a data geek that’s tremendously exciting. Over the past few years the teams at Channel 4 have invested heavily in their data infrastructure, leveraging Spark, Hadoop, and all sorts of other tools. This is one of the better set ups I’ve seen in a mature company. This tech stack will evolve and I’ll be working on driving that too.

I’m fascinated by human behaviour, and helping a organization that brought me content I love like IT Crowd and Father Ted, become a more data-driven organization was too big an opportunity to miss.

My team has already worked on some powerful data driven products including a new show recommendation engine, customer classifiers for ad serving and customer segmentations.

I’m looking forward to working on these projects, helping the team grow and seeing what other cool things there are in the media world. On my first day I was already being asked questions by my colleague Will on HiveQL, was listening to Thomas talk about topic modelling and participated in a standup meeting where I heard about the different projects ongoing.

We’re sponsoring the PyData conference in May, which makes me very proud as a data scientist that my employer is involved in such an amazing event. I’ll be speaking about Machine Learning and Statistical models, what their differences are and how to debug both frequentist and Bayesian models.

I’m extremely excited about my next steps, and I look forward to tackling those challenges, particularly in regards personalisation and recommendations. I’ll be undoubtedly speaking about some of the cool stuff we get up to at Channel 4.

If tackling big data challenges in media interests you – we’re hiring. So reach out to me if that would interest you:). Here is an example job ad with the details.

What does a Data Scientist need to know about Data Governance?

Standard

One term that has surprised me on data projects is ‘governance’ or ‘data quality’ or ‘master data management’. It’s surprised me because I’m not an expert in this discipline and it’s quite different to my Machine Learning work.

The aim of this blog post is to just jot down some ideas on ‘data governance’ and what that means for practitioners like myself.

I chatted to a friend Friso who gave a talk on Dirty Data at Berlin Buzzwords.

In his talk he mentions ‘data governance’ and so I reached out to him to clarify.

I came to the following conclusions which I think are worth sharing, and are similar to some of the principles that Enda Ridge talks about when he speaks of ‘Guerilla Analytics‘.

  • Insight 1: Lots of MDM, Data Governance, etc solutions are just ‘buy our product’. None of these tools replace good process and good people. Technology is only ever an enabler.
  • Insight 2: Good process and good people are two hard things to get done right.
  • Insight 3: Often companies care about ‘fit for purpose’ data which is much the same as any process – insights from statistical quality control or anomaly detection can be useful here.

Practical considerations are make sure you have a map (or workflow capturing your data provenance) and some sort of documentation (metadata or whatever is necessary) to go from the ‘raw data’ given to you by a stakeholder and the outputted data.

I think adding a huge operational overhead of lots of complicated products, vendors, meetings etc is a distraction, and can lead to a lot of pain.

Adopting some of the ‘infrastructure as code’ ideas are really useful. Since code and reproducibility are really important in understanding ‘fit for purpose’ data.

Another good summary comes from Adam Drake on ‘Data Governance

If anyone has other views or critiques I’d love to hear about them.

Where does ‘Big Data’ fit into Procurement?

Standard

I spent about a year working as an Energy Analyst in Procurement at a large Telecommunications company. I’m by no means an expert but these are my own thoughts on where I feel ‘big data’ fits into procurement.

Firstly for the stake of this argument let us consider procurement as a the purchase of goods for the rest of a large company – and fundamentally it is a cost-control function for a business. These are some ideas of where ‘big data’ can fit in a procurement organization. It is by no means exhaustive.

  1. Tools for supporting pricing information. I worked on tools like this in the past, but getting good pricing information helps you benchmark your performance. This is really important if your prices are subject to markets like energy markets or commodity markets.
  2. Machine learning for recognizing contracts – lots of procurement is about dealing with contracts – one could apply natural language processing to finding similar contracts or similar documents. This could be invaluable for lowering costs in organizations.
  3. Total Cost Modelling – when you analyse a complex item in a supply chain like a
    phone mast – you’ll find a number of residual parts such as steel, batteries, etc etc. For services this gets even more complicated because of the nature and lack of visibility of the costs. One can leverage applied statistics and monte-carlo simulations to help better understand the natures of these variable costs, and better model your total cost of ownership.

 

Since traditional methods for reducing costs are fast evaporating, CPOs (Chief Procurement Officers) should increase the time and effort invested in total cost modelling. In doing so, they will not only inform internal decisions, but also deliver to procurement an opportunity to drive strategy, thereby developing the top line impact modern businesses desire from them.

When it comes to practicalities, building an analytics capability has to start with a definition of the problem and a clear understanding of the boundary conditions. Limiting procurement’s scope by simply working with the data that is easily available will also limit the outcomes. CPOs need to contemplate the relationships between data sources and data points and look for indications of likely trends without direct access to ‘proof’ data.

Particularly of interest to procurement professionals will be the deluge of information from the ‘internet of things’. However this data needs good governance (it needs to be fit for purpose) and good analysis to take advantage of. We’ll talk more about such things in the future.