Talks and Workshops

workshop
Sticky

I enjoy giving talks and workshops on Data Analytics. Here is a list of some of the talks I’ve given. In my Mathematics master I regularly gave talks on technical topics, and previously I worked as a Tutor and Technician in a School in Northern Ireland. I consider the evangelism of data and analytics to be an important part of my job as a professional data scientist!

Upcoming

I’m giving a tutorial called ‘Lies damned lies and statistics’ at PyData London 2016. I’ll be discussing different statistical and machine learning approaches to the same kinds of problems. The aim will be to help those who know either Bayesian statistics or Machine Learning bridge the gap to others.

Slides and Videos from Past Events

In April 2016 I gave an invited talk at the Toulouse Data Science meetup which was a slightly adjusted version of  Map of the Stack‘.

At PyData Amsterdam in March 2016- I gave the second Keynote on a ‘Map of the Stack‘.

PyCon Ireland From the Lab to the Factory (Dublin, Ireland October 2015) – I gave a talk on the business side of delivering data products – a trope I used was it is like ‘going from the lab to the factory’. This was a well-received talk based on the feedback and I gave my audience a collection of tools they could use to solve these challenges.

EuroSciPy 2015 (Cambridge, England Summer 2015): I gave a talk on Probabilistic Programming applied to Sports Analytics – slides are here.

My PyData London tutorial was an extended version of the above talk.

I spoke at PyData in Berlin.
The link is here

The blurb for my PyData Berlin talk is mentioned here.
Abstract: “Probabilistic Programming and Bayesian Methods are called by some a new paradigm. There are numerous interesting applications such as to Quantitative Finance.
I’ll discuss what probabilistic programming is, why should you care and how to use PyMC and PyMC3 from Python to implement these methods. I’ll be applying these methods to studying the problem of ‘rugby sports analytics’ particularly how to model the winning team in the recent Six Nations in Rugby. I will discuss the framework and how I was able to quickly and easily produce an innovative and powerful model as a non-expert.”

In May 2015 I gave a preview of my PyData Talk in Berlin at the Data Science Meetup in Luxembourg on ‘Probabilistic Programming and Rugby Analytics‘ – where I presented a case study and introduction to Bayesian Statistics to a technical audience. My case study was the problem of ‘how to predict the winner of the Six Nations’. I used the PyMC library in Python to build up statistical models as part of the Probabilistic Programming paradigm. This was based on my popular Blog Post which I later submitted to the acclaimed open source textbook Probabilistic Programming and Bayesian Methods for Hackers. I gave this talk using an IPython notebook, which proved to be a great method for presenting this technical material.

In October 2014 I gave a talk at Impactory in Luxembourg – a co-working space and Tech Accelerator. This was an introductory talk to a business audience about ‘Data Science and your business‘. I talked about my experience at different small firms, and large firms and the opportunities for Data Science in various industries.

In October 2014 I also gave a talk at the Data Science Meetup in Luxembourg. This was on ‘Data Science Models in Production‘ discussing my work with a small company on developing a mathematical modelling engine that was the backbone of a ‘data product’. This talk was highly successful and I gave a version of this talk at PyCon Italy – held in Florence – in April 2015. The aim of this talk was to explain what a ‘data product’ was, and discuss some of the challenges of getting data science models into production code. I also talked about the tool choices I made in my own case study. It was well-received, high level and got a great response from the audience. Edit: Those interested can see my video here, it was a really interesting talk to give, and the questions were fascinating.

When I was a freelance consultant in the Benelux I gave a private 5 minute talk on Data Science in the Game industry. Here are the slides. – This is from July 2014

My Mathematical research and talks as a Masters student are all here. I specialized in Statistics and Concentration of Measure. It was from this research that I became interested in Machine Learning and Bayesian Models.

Thesis

My Masters Thesis on ‘Concentration Inequalities and some applications to Statistical Learning Theory‘ is an introduction to the world of Concentration of Measure, VC Theory and I used this to apply to understanding the generalization error of Econometric Forecasting Models.

AI in the Enterprise (the problem)

Standard

I was recently chatting to a friend who works as a Data Science consultant in the London Area – and a topic dear to my heart came up. How to successfully do ‘AI’ (or Data Science) in the enterprise. Now I work for an Enterprise SaaS company in the recruitment space, so I’ve got a certain amount of professional interest in doing this successfully.

My aim in this essay is to outline what the problem is, and provide some solutions.

Firstly it’s worth reflecting on the changes we’ve seen in Consumer apps – Spotify, Google, Amazon, etc – all of these apps have personalised experiences which are enhanced by machine learning techniques depending on the labelled data that consumers provide.

I’ll quote what Daniel Tuckelang (formerly of Linkedin) said about the challenges of doing this in the enterprise.

First, most enterprise data still lives in silos, whereas the intelligence comes from joining across data sets. Second, the enterprise suffers from weak signals — there’s little in the way of the labels or behavioral data that consumer application developers take for granted. Third, there’s an incentive problem: everyone promotes data reuse and knowledge sharing, but most organizations don’t reward it

I’ve personally seen this when working with enterprises, and being a consultant. The data is often very noisy, and while there are techniques to overcome that such as ‘distant supervision‘ it does make things harder than say building Ad-Tech models in the consumer space or customer churn models. Where the problem is more explicitly solvable by supervised techniques.

In my experience and the experience of others. Enterprises are much more likely to try buy in off-the-shelf solutions, but (to be sweepingly general) they still don’t have the expertise to understand/validate/train the models.There are often individuals in small teams here & there who’ve self-taught or done some formal education, but they’re not supported. (My friend Martin Goodson highlights this here)  There needs to be a cultural shift. At a startup you might have a CTO who’s willing to trust a bunch of relatively young data science chaps to try figure out an ML-based solution that does something useful for the company without breaking anything. And it’s also worth highlighting that there’s a difference in risk aversion between enterprises (with established practices  etc) and the more exploratory or R and D mindset of a startup.

The somewhat more experienced of us these days tend to have a reasonable idea of what can be done, what’s feasible, and furthermore how to convince the CEO that it’s doing something useful for his valuation.

Startups are far more willing to give things a go, there’s an existential threat. And not to forget that often Venture Capitalists and the assorted machinery expect Artificial Intelligence, and this is encouraged.

Increasingly I speculate that established companies now outsource their R and D to startups, hence the recent acquisitions like the one by GE Digital.

So I see roughly speaking two solutions to this problem. Two ways to de-risk data science projects in the enterprise.

1) Build it as an internal consultancy with two goals: identifying problems which can be solved with data solutions, and exposing other departments to new cultural thinking & approaches. I know of one large retailer who implemented this by doing 13 week agile projects, they’d do a consultation, then choose one team to build a solution for.

2) Start putting staff through training schemes similar to what is offered by General Assembly (there are others), but do it whole teams at a time, the culture of code review and programmatic analysis has to come back and be implemented at work. Similarly, give the team managers additional training in agile project management etc.

The first can have varied success – you need the right problems, and the right internal customers – and the second I’ve never seen implemented.

I’d love to hear some of the solutions you have seen. I’d be glad to chat about this.

Acknowledgements: I’d like to thank the following people for their conversations: John Sandall, Martin Goodson, Eddie Bell, Ian Ozsvald, Mick Delaney and I’m sorry about anyone else I’ve forgotten.

 

2016: In Review

Standard

I’m mostly writing this for me, but maybe it will be interesting to you too! Here’s are some things that happened in 2016. (just to me personally, mostly about programming) This is based on the excellent post by Julia Evans.

Open Source

I continued being involved in PyMC3. This has taught me a lot about programming, the challenges of shipping software. The code reviews by Thomas Wiecki and the others have been amazing.

I helped pick the new logo, worked on becoming a fiscally sponsored project by NumFOCUS. For those of you who don’t know NumFOCUS is an organisation that supports diversity in open source, open source projects and the conferences associated with Open Source. It largely focuses on the Python ecosystem but has branched out to other projects.

Learning about this has taught me a lot about the governance aspects of OSS – and our responsibilities to encourage more people into this ecosystem. I consider that an important part of my duties as a member of the Open Source world.

Talks

  • spoke at PyData London – About Statistics with Python.
  • spoke at the Toulouse Data Science Meetup – I spoke about the PyData ecosystem
  • I keynoted at PyData Amsterdam – I spoke about the current PyData ecosystem and what various tools like Dask, NumPy, Numba, etc are all for.
  • Gave a talk at the Bayesian Mixer in London on the state of PyMC3 I spoke a bit about the new tools in Variational Inference, which has been a research topic of mine for the past year. I wish I had time to finally write some slides on that.

Doing the PyData keynote was kind of exciting/scary (me??? keynote???) and I think it turned out well and I’m happy I did it. I love the PyData community and I’m happy with the talk I gave.

It’s been fun to experience some of the other places that are doing Data Science and forming communities. At each of these events I’ve met a lot of cool people. It’s great to see our industry grow up!

In 2017 I’ll be keynoting in Colombia in Feb at their PyCon Colombia conference. I’m excited to give this talk. I want to goto a conference like NIPS/KDD/ICLR/ICML to stay a bit closer to some of the improvements in the Machine Learning world from Academia/ Industry.

cool: Writing for Hakka Labs

  • I was honoured to be featured on Hakka Labs, Hakka Labs run the excellent Data Eng Conference and some awesome content on their blog. I wrote about Three Things I learned about Machine Learning, this is an ongoing journey where I realise how little I know.

cool: Blog

Some of my favourite posts this year have been.

  • A map of the PyData Stack  – This was an idea floated with Thomas Wiecki before. I finally got around to doing this for my keynote, the aim was to give some people a ‘map of the pydata stack’ and what different tools were for.
  • I interviewed one of my heroes – Greg Linden who helped devise the first Collaborative Filtering algorithm in production at Amazon.
  • I did some other interviews – I liked this one too with Masaaki Horikoshi one of the most prolific contributors to the PyData ecosystem.

I’ll continue to do some interviews over the next year, and hopefully add them to a revised book.

cool: moving to London

I moved to London in late March. I’ve found it very exciting to be close to the Machine Learning community and Data Science community out there. It was a hectic few months adjusting to new job(s) however I’m glad I made the move.

I think everyone should spend sometime in a major city when they’re young.

I hope to blog a bit more about work in the next few months.

cool: Teaching Data Science

My friend John Sandall  mentioned a Teaching Assistant gig at General Assembly.

I helped about 20 students learn more about Data Science, they came from various backgrounds and sharing my own experiences – reminded me that 1) I knew stuff and 2) teaching is hard.

I recommend to all Data Scientists and Engineers if they get the time to teach. It’s a great experience and I learned a lot about what was easy and hard in Machine Learning.

conclusions?

some things that worked:

  • asking a lot of questions about how computers work (not a surprise)
  • working on a team of people who know more stuff than me, and listening to what they have to say
  • asking for advice from people who are more experienced than me.
  • at work, figuring out what’s important to do and then doing the work to get it done, especially if that work is boring / tedious
  • working on one thing at a time (or at least not too many things)
  • getting a bit better at software “process” things like design documents and project plans
  • learning how to mentor junior data scientists – this is something I’m continuing to do
  • learning more about leading teams in ML – which is hard. I’ll not probably be doing too much people stuff over the next few months.

Interview with a Data Scientist: Greg Linden

pexels-photo-90807
Standard
I caught up with Greg Linden via email recently
Greg was one of the first people to work on data science in Industry – he invented the item-to-item collaborative filtering algorithm at Amazon.com in the late 90s.
I’ll quote his bio from Linkedin:
“Much of my past work was in artificial intelligence, personalization, recommendations, search, and advertising. Over the years, I have worked at Amazon, Google, and Microsoft, founded and run my own startups, and advised several other startups, some of which were acquired. I invented the now widely used item-to-item collaborative filtering algorithm, contributed to many patents and academic publications, and have been quoted often in books and in the press. I have an MS in Computer Science from University of Washington and an MBA from Stanford.”
img_6330

Greg Linden: Source Personal Website

1. What project have you worked on do you wish you could go back to, and do better?
All of them! There’s always more to do, more improvements to make, another thing to try. Every time you build anything, you learn what you could do to make it better next time.

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Learn to code. Computers are a tool, and coding is the way to get the most out of that tool. If you can code, you can do things in your field that others cannot. Coding is a major force multiplier. It makes you more powerful.

3. What do you wish you knew earlier about being a data scientist?
I was doing what is now called data science at Amazon.com in 1997.The term wasn’t even coined until 2008 (by Jeff Hammerbacher and DJ Patil). It’s hard to be much earlier. As for what I wish, I mostly wish I had the powerful tools we have now back then; today is a wonderland of data, tools, and computation. It’s a great time to be a data scientist.

4. How do you respond when you hear the phrase ‘big data’?
I usually think of Peter Norvig talking about the unreasonable effectiveness of data and Michele Banko and Eric Brill finding that more data beat better algorithms in their 2001 paper. Big data is why Amazon’s recommendations work so well. Big data is what tunes search and helps us find what we need. Big data is what makes web and mobile intelligent.

5. What is the most exciting thing about your field?
I very much enjoy looking at huge amounts of data that no one has looked at yet. Being one of only a few to explore a previously unmined new source of information is very fun. Low hanging fruit galore! It’s also fraught with peril, as you’re the first to find all the problems in the data as well.

6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
Data problems should be iterative. Start simple. Solve a small problem. Explore the data. Then solve a harder problem. Then a harder one. Each time you take a step, you’ll get ideas on where to go next, and you also get something out at each step. Too many people start trying to solve the entire problem at the beginning, flailing for a long time, usually to discover that it was the wrong problem to solve when they finally struggle to completion. Start with easier problems, learn where to go, and you might be surprised by all the goodies you find along the way.

What happens when you import modules in Python

Standard

 

I’ve been using Python for a number of years now – but like most things I didn’t really understand this until I investigated it.

Firstly let’s introduce what a module is, this is one of Python’s main abstraction layers, and probably the most natural one.

Abstraction layers allow a programmer to separate code into
parts that hold related data and functionality.

In python you use ‘import’ statements to use modules.

Importing modules

The

import modu

statement will look for the definition
of modu in a file called `modu.py` in the same directory as the caller
if a file with that name exists.

If it is not found, the Python interpreter will search for modu.py in `Python’s search path`.

Python search path can be inspected really easily

import sys
`>>> sys.path`

Here is mine for a conda env.

['', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/pymc3-3.0rc1-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/numpydoc-0.6.0-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/nbsphinx-0.2.9-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/Sphinx-1.5a1-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/recommonmark-0.4.0-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/CommonMark-0.5.4-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/tqdm-4.8.4-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/joblib-0.10.3.dev0-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/pandas-0.19.0rc1-py3.5-macosx-10.6-x86_64.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/Theano-0.8.2-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/numpy-1.11.2rc1-py3.5-macosx-10.6-x86_64.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/imagesize-0.7.1-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/alabaster-0.7.9-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/Babel-2.3.4-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/snowballstemmer-1.2.1-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python35.zip', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/plat-darwin', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/lib-dynload', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/setuptools-27.2.0-py3.5.egg']

What is a namespace?

We say that the modules variables, functions, and classes will be available
to the caller through the modules `namespace`, a central concept in programming that
is particularly helpful and powerful in Python. Namespaces provide a scope containing
named attributes that are visible to each other but not directly accessible outside of the namespace.

So there you have it this is an explanation of what happens when you import, and what a namespace is.

This is based on the Hitchikers guide which is well worth a read 🙂

Are RNN’s ready to replace journalists?

Standard

I recently was experimenting with RNN’s in Keras. I used the example and edited it slightly.

This is what I got for Nietzsche – as you can see the answer above to my question is No.

——– diversity: 0.2
——- Generating with seed: “iginal text, homo natura; to bring it ab”
iginal text, homo natura; to bring it absolute and the sense of the superficial for the suffering of the sense of the things of the sayment of the conception of the fact of the suffering and an an and an animation and an art of the subject, the sense of the experience of the souls of the sense of the contrason of the soul” and as a pleasure of the things of the superficially and an anything the suffering of the souls of the senses of th

——– diversity: 0.5
——- Generating with seed: “iginal text, homo natura; to bring it ab”
iginal text, homo natura; to bring it absolute that is to find ancient which is comparison that the belief in a soul in his own school of his love, and be a pulses of working to the reciantiating, morality and such a regnisoristic and impatiently
and an animation of the sayment of the actions and proudion of the conscience, the sensible and saint and incensed nowadays something of
the most terest to the superficial and decist of the sen

——– diversity: 1.0
——- Generating with seed: “iginal text, homo natura; to bring it ab”
iginal text, homo natura; to bring it able and moral fecth and thus, did alsopisible stinds of what virtuoth experiences–or another which is as still like dne conscience of any men this ethical musiates.

o8i xusted has
among the soul’ yet it is as we
pleasion to ones to you
more courage in the this thus, nexy what is certains by those deming an a myments only
“sight of expsequential time they do all things, that the sensible, for inte

——– diversity: 1.2
——- Generating with seed: “iginal text, homo natura; to bring it ab”
iginal text, homo natura; to bring it abcrude”.

142. can mutly, society, of the long, to beom an
yot. divystess–with theseful, his
poorness of asias and
tactless
life it!–” such one, through pucisomen, just merehonding
hastensce
an
him, old te, the profounded generals, seen fies
everygaing
bale because it
for meardy itsed upon
esprisf. how imvanemed, how he gives to soid of adierch) a pediorice simusreds has slee” in the pri
himse

Why Code review? Or why should I care as a data scientist.

Standard

The insightful Data Scientist Trey Causey talks about Software Development Skills for Data Scientists I’m going to write about my views on Code Review – as a Data Scientist with a few years experience, and experience delivering Data Products at organizations of varying sizes. I’m not perfect and I’m still maturing as an Engineer.

A good thorough introduction to Code Review comes from the excellent team at Lyst I suggest that as follow up reading!

The fundamental nugget is that ‘code reviews allow you to more effectively collaborate with your peers‘ and a lot of new Engineers and Data Scientists don’t know how to do that. This is one reason why I wrote ‘soft skills for data scientists‘. This article talks about a technical skill but I consider this a kind of ‘technical communication’.

Here are some views on ‘why code review’ – I share them here as reference, largely to remind myself. I steal a lot of these from this video series.

  • Peer to peer quality engineering and training 

As a Data Science community that is forming – and with us coming from various backgrounds there’s a lot of invaluable knowledge from others in the team. Don’t waste your chance at getting that 🙂

  • Catches bugs easily

There are many bugs that we all write when we write code.

Keeps team members on the same page

  • Domain knowledge 
    How do we share knowledge about our domain to others without sharing code?
  • Project style and architecture
    I’m a big believer in using structured projects like Cookiecutter Data Science and I’m sure there exist alternatives in other languages. Before hand I had a messy workflow like hacked together IPython notebooks and no idea what was what – refactoring code into modules is a good practice for a reason 🙂
  • Programming skills
    I learn a lot myself by reading other peoples code – a lot of the value of being part of an open source project like PyMC3 – is that I learn a lot from reading peoples code 🙂

Other good practices

  • PEP8 and Pylint (according to team standards)
  • Code review often, but by request of the author only

I think it’s a good idea (I think Roland Swingler mentioned this to me)

To not obsess too much about style – having a linter doing that is better, otherwise code reviews can become overly critical and pedantic. This can stop people sharing code and leads to criticism that can shake Junior Engineers in particular – who need psychological safety. As I mature as an Engineer and a Data Scientist I’m aware of this more and more 🙂

Keep code small

  • < 20 minutes, < 100 lines is best
  • Large code reviews make suggestions harder and can lead to bikeshedding

These are my own lessons so far and are based on experience writing code as a Data Scientist – I’d love to hear your views.

3 tips for successful Data Science Projects

Standard

I’ve been doing Data Science projects, delivering software and doing Mathematical modelling for nearly 7 years (if you include grad school).

I really don’t know everything, but these are a few things I’ve learned.

Consider this like a ‘joel test‘ for Data Science.

  1. Use a reproducible framework like Cookiecutter Data Science. My workflow used to be use an IPython notebook and forget to name things correctly – and discover messy, badly written code 🙂 I’ve now turned to a project structure like Cookiecutter – this has helped me write better, more maintainable code and reminded me to document things and make my work reproducible.
  2. Have a spec for a data science project- all projects should start with an agreed spec between the business stakeholder and the project. This forces people to clarify what they really want. This project should have a ‘goal’. Just to clarify – I mean a well defined goal that is Specific, Measurable, Achievable, Realistic and Time bounded – SMART.
  3. Make sure your stakeholders are realistic about the ‘failure’ aspect of R and D. One of the anti-patterns I’ve encountered in Data Science is stakeholders being immature and not realizing that for example ‘this Bayesian model doesn’t work for this kind of problem’ isn’t a statement of incompetence but it is a statement of a fact of the matter about the world. If organizations can’t accept that, they deserve suboptimal Data Science. R and D work is not engineering – failures teach us something too!

What are your views? I’d love to hear them 🙂