Talks and Workshops

workshop
Sticky

I enjoy giving talks and workshops on Data Analytics. Here is a list of some of the talks I’ve given. In my Mathematics master I regularly gave talks on technical topics, and previously I worked as a Tutor and Technician in a School in Northern Ireland. I consider the evangelism of data and analytics to be an important part of my job as a professional data scientist!

Upcoming

I’m giving a tutorial called ‘Lies damned lies and statistics’ at PyData London 2016. I’ll be discussing different statistical and machine learning approaches to the same kinds of problems. The aim will be to help those who know either Bayesian statistics or Machine Learning bridge the gap to others.

Slides and Videos from Past Events

In April 2016 I gave an invited talk at the Toulouse Data Science meetup which was a slightly adjusted version of  Map of the Stack‘.

At PyData Amsterdam in March 2016- I gave the second Keynote on a ‘Map of the Stack‘.

PyCon Ireland From the Lab to the Factory (Dublin, Ireland October 2015) – I gave a talk on the business side of delivering data products – a trope I used was it is like ‘going from the lab to the factory’. This was a well-received talk based on the feedback and I gave my audience a collection of tools they could use to solve these challenges.

EuroSciPy 2015 (Cambridge, England Summer 2015): I gave a talk on Probabilistic Programming applied to Sports Analytics – slides are here.

My PyData London tutorial was an extended version of the above talk.

I spoke at PyData in Berlin.
The link is here

The blurb for my PyData Berlin talk is mentioned here.
Abstract: “Probabilistic Programming and Bayesian Methods are called by some a new paradigm. There are numerous interesting applications such as to Quantitative Finance.
I’ll discuss what probabilistic programming is, why should you care and how to use PyMC and PyMC3 from Python to implement these methods. I’ll be applying these methods to studying the problem of ‘rugby sports analytics’ particularly how to model the winning team in the recent Six Nations in Rugby. I will discuss the framework and how I was able to quickly and easily produce an innovative and powerful model as a non-expert.”

In May 2015 I gave a preview of my PyData Talk in Berlin at the Data Science Meetup in Luxembourg on ‘Probabilistic Programming and Rugby Analytics‘ – where I presented a case study and introduction to Bayesian Statistics to a technical audience. My case study was the problem of ‘how to predict the winner of the Six Nations’. I used the PyMC library in Python to build up statistical models as part of the Probabilistic Programming paradigm. This was based on my popular Blog Post which I later submitted to the acclaimed open source textbook Probabilistic Programming and Bayesian Methods for Hackers. I gave this talk using an IPython notebook, which proved to be a great method for presenting this technical material.

In October 2014 I gave a talk at Impactory in Luxembourg – a co-working space and Tech Accelerator. This was an introductory talk to a business audience about ‘Data Science and your business‘. I talked about my experience at different small firms, and large firms and the opportunities for Data Science in various industries.

In October 2014 I also gave a talk at the Data Science Meetup in Luxembourg. This was on ‘Data Science Models in Production‘ discussing my work with a small company on developing a mathematical modelling engine that was the backbone of a ‘data product’. This talk was highly successful and I gave a version of this talk at PyCon Italy – held in Florence – in April 2015. The aim of this talk was to explain what a ‘data product’ was, and discuss some of the challenges of getting data science models into production code. I also talked about the tool choices I made in my own case study. It was well-received, high level and got a great response from the audience. Edit: Those interested can see my video here, it was a really interesting talk to give, and the questions were fascinating.

When I was a freelance consultant in the Benelux I gave a private 5 minute talk on Data Science in the Game industry. Here are the slides. – This is from July 2014

My Mathematical research and talks as a Masters student are all here. I specialized in Statistics and Concentration of Measure. It was from this research that I became interested in Machine Learning and Bayesian Models.

Thesis

My Masters Thesis on ‘Concentration Inequalities and some applications to Statistical Learning Theory‘ is an introduction to the world of Concentration of Measure, VC Theory and I used this to apply to understanding the generalization error of Econometric Forecasting Models.

Interview with a Data Scientist: Juan Pablo Isaza AristizĂĄbal

Standard
I recently gave a keynote at www.pycon.co the first PyCon conference in Colombia. I spoke on Data Science Models in Production, lessons learned and the cultural aspects.
I interviewed a Colombian Data Scientist – Juan Pablo Isaza AristizĂĄbal
1. What project have you worked on do you wish you could go back to, and do better?
Back in 2015 I was working for Tappsi, a popular app to call taxis. They have a huge problem with fulfilling cab demand on peak hours, because BogotĂĄ has a horrible traffic congestion problem, that is only getting worse. So we were using algorithms to try to fulfil as much demand as we could. One of the projects was to predict demand on a 10 minutes future window for each neighbourhood, so that drivers could head to neighbourhoods with the highest odds. I did the real time data ingestion and machine learning, then we published an MVP of the feature, but the algorithm was too slow. In the end the project was never completed because developers were doing other stuff and I ended up optimising other algorithms that increased metrics with half the hassle and complexity. Afterwards I realised that the lack of experience led me to write the program in the wrong language and I made wrong assumptions that led to low performance.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
I think academia is a great place to learn abstract and complex subjects, while industry is a great place to learn practical and social skills. Its easier to succeed at a working environment if you find a balance between being academic enough without forgetting practical aspects and communication skills with non-tech people. At the university there are brilliant people at cutting edge topics, although they might not know how to deal with more concrete aspects. While in industry you can see fast practical developers that know the latest tools but fall short trying to optimise a SQL query because they don’t know how an indexed query works behind the scenes in a database.
So I would advise you to study as much as you can, but always try to think in the possible applications of what you are learning. Also, communications skills with non tech people is extremely important, I have seen a couple of guys having meetings with sales people and try to explain statistical tests and p-values with no success.
3. What do you wish you knew earlier about being a data scientist?
I wish I had an earlier chance of working on startups. Being in Colombia made it difficult to work in the tech industry; before I had a couple of jobs not related to data science or software development.
4. How do you respond when you hear the phrase ‘big data‘?
I find the term a little misleading and simplistic. Its popular because it’s easy to understand for the general public, while algorithms are not. Although technically the term might just be a synonym with a couple of tools such Amazon redshift and Hadoop. Big data  has enabled new and exciting applications but by no means it’s the only or biggest factor contributing to current advances in the field; new algorithms such as deep neural networks, reinforcement learning and a strong open source community has enabled a lot of improvement over the last few years.
5. What is the most exciting thing about your field?
For me is the excitement of science, which I have always embraced since I was a little kid, and the development speed and practicality of engineering. Being able to take an idea and transforming it to a working prototype in a few days is an amazing feeling; specially machine learning applications are really exciting to work with.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
Usually I take an iterative approach, starting with the most obvious relations and the easiest data to handle. Trying to get to the answer in a series of incremental steps as I refine each input and expand the data set. Its similar to how you would build a MVP, first it is simple, then it becomes better with each version, finally the customer or user says: that’s good enough!
7. Can you talk a bit about the state of the tech industry and data science in Colombia? What would you change? What gives you hope?
Data science is coming as a byproduct of software development; there isn’t much, but still we are improving at giant steps. In the last few years startups such as Tappsi, Domicilios, Mercadoni, Rappi or Bunny have become more commonplace than in the past. What gives me hope is the peace deal with FARC guerrilla group, this crucial event will make many more foreigners come here as well as investment that can power new ideas.
I would shift the local focus of many startups for a more global one. This is difficult because the market is small and the economy is far behind other nations, making our problems and solutions different from the more advances economies. Still there are Colombian startups with global focus such as Bunny or VOIQ.
Thanks and best regards!

Cookies

Standard

Based on http://larahogan.me/donuts/

Taken from her blog

Years ago, I found that whenever something awesome happened in my career – maybe I got published, or promoted, or launched a project – I wouldn’t take the time to celebrate the achievement. I’m an achiever by nature, the kind who feels like every day starts at zero. Not deliberately marking these moments left me feeling like I wasn’t actually accomplishing anything. “Oh cool, that A List Apart article went up,” I would think, then move on with my day.

Once I realized that this was happening, I decided to be deliberate about marking achievements by eating one donut. Well, sometimes more than one, if it’s a really big deal. The act of donut-eating has actually helped me feel like I’m accomplishing my career goals. As I started to share this idea with more people, I found that it resonated with others, especially young career-driven women who are routinely achieving goals and furthering their career but don’t take the time to note their own success.

I decided to start celebrating in a public way so that more people may be inspired to find their own ways of marking their career achievements. These are those donuts.

I’ve decided to use chocolate chip cookies instead.

cookie_1.jpg

Cookie 13/2/17 to celebrate giving the keynote at PyCon Colombia

AI in the Enterprise (the problem)

Standard

I was recently chatting to a friend who works as a Data Science consultant in the London Area – and a topic dear to my heart came up. How to successfully do ‘AI’ (or Data Science) in the enterprise. Now I work for an Enterprise SaaS company in the recruitment space, so I’ve got a certain amount of professional interest in doing this successfully.

My aim in this essay is to outline what the problem is, and provide some solutions.

Firstly it’s worth reflecting on the changes we’ve seen in Consumer apps – Spotify, Google, Amazon, etc – all of these apps have personalised experiences which are enhanced by machine learning techniques depending on the labelled data that consumers provide.

I’ll quote what Daniel Tuckelang (formerly of Linkedin) said about the challenges of doing this in the enterprise.

First, most enterprise data still lives in silos, whereas the intelligence comes from joining across data sets. Second, the enterprise suffers from weak signals — there’s little in the way of the labels or behavioral data that consumer application developers take for granted. Third, there’s an incentive problem: everyone promotes data reuse and knowledge sharing, but most organizations don’t reward it

I’ve personally seen this when working with enterprises, and being a consultant. The data is often very noisy, and while there are techniques to overcome that such as ‘distant supervision‘ it does make things harder than say building Ad-Tech models in the consumer space or customer churn models. Where the problem is more explicitly solvable by supervised techniques.

In my experience and the experience of others. Enterprises are much more likely to try buy in off-the-shelf solutions, but (to be sweepingly general) they still don’t have the expertise to understand/validate/train the models.There are often individuals in small teams here & there who’ve self-taught or done some formal education, but they’re not supported. (My friend Martin Goodson highlights this here)  There needs to be a cultural shift. At a startup you might have a CTO who’s willing to trust a bunch of relatively young data science chaps to try figure out an ML-based solution that does something useful for the company without breaking anything. And it’s also worth highlighting that there’s a difference in risk aversion between enterprises (with established practices  etc) and the more exploratory or R and D mindset of a startup.

The somewhat more experienced of us these days tend to have a reasonable idea of what can be done, what’s feasible, and furthermore how to convince the CEO that it’s doing something useful for his valuation.

Startups are far more willing to give things a go, there’s an existential threat. And not to forget that often Venture Capitalists and the assorted machinery expect Artificial Intelligence, and this is encouraged.

Increasingly I speculate that established companies now outsource their R and D to startups, hence the recent acquisitions like the one by GE Digital.

So I see roughly speaking two solutions to this problem. Two ways to de-risk data science projects in the enterprise.

1) Build it as an internal consultancy with two goals: identifying problems which can be solved with data solutions, and exposing other departments to new cultural thinking & approaches. I know of one large retailer who implemented this by doing 13 week agile projects, they’d do a consultation, then choose one team to build a solution for.

2) Start putting staff through training schemes similar to what is offered by General Assembly (there are others), but do it whole teams at a time, the culture of code review and programmatic analysis has to come back and be implemented at work. Similarly, give the team managers additional training in agile project management etc.

The first can have varied success – you need the right problems, and the right internal customers – and the second I’ve never seen implemented.

I’d love to hear some of the solutions you have seen. I’d be glad to chat about this.

Acknowledgements: I’d like to thank the following people for their conversations: John Sandall, Martin Goodson, Eddie Bell, Ian Ozsvald, Mick Delaney and I’m sorry about anyone else I’ve forgotten.

 

2016: In Review

Standard

I’m mostly writing this for me, but maybe it will be interesting to you too! Here’s are some things that happened in 2016. (just to me personally, mostly about programming) This is based on the excellent post by Julia Evans.

Open Source

I continued being involved in PyMC3. This has taught me a lot about programming, the challenges of shipping software. The code reviews by Thomas Wiecki and the others have been amazing.

I helped pick the new logo, worked on becoming a fiscally sponsored project by NumFOCUS. For those of you who don’t know NumFOCUS is an organisation that supports diversity in open source, open source projects and the conferences associated with Open Source. It largely focuses on the Python ecosystem but has branched out to other projects.

Learning about this has taught me a lot about the governance aspects of OSS – and our responsibilities to encourage more people into this ecosystem. I consider that an important part of my duties as a member of the Open Source world.

Talks

  • spoke at PyData London – About Statistics with Python.
  • spoke at the Toulouse Data Science Meetup – I spoke about the PyData ecosystem
  • I keynoted at PyData Amsterdam – I spoke about the current PyData ecosystem and what various tools like Dask, NumPy, Numba, etc are all for.
  • Gave a talk at the Bayesian Mixer in London on the state of PyMC3 I spoke a bit about the new tools in Variational Inference, which has been a research topic of mine for the past year. I wish I had time to finally write some slides on that.

Doing the PyData keynote was kind of exciting/scary (me??? keynote???) and I think it turned out well and I’m happy I did it. I love the PyData community and I’m happy with the talk I gave.

It’s been fun to experience some of the other places that are doing Data Science and forming communities. At each of these events I’ve met a lot of cool people. It’s great to see our industry grow up!

In 2017 I’ll be keynoting in Colombia in Feb at their PyCon Colombia conference. I’m excited to give this talk. I want to goto a conference like NIPS/KDD/ICLR/ICML to stay a bit closer to some of the improvements in the Machine Learning world from Academia/ Industry.

cool: Writing for Hakka Labs

  • I was honoured to be featured on Hakka Labs, Hakka Labs run the excellent Data Eng Conference and some awesome content on their blog. I wrote about Three Things I learned about Machine Learning, this is an ongoing journey where I realise how little I know.

cool: Blog

Some of my favourite posts this year have been.

  • A map of the PyData Stack  – This was an idea floated with Thomas Wiecki before. I finally got around to doing this for my keynote, the aim was to give some people a ‘map of the pydata stack’ and what different tools were for.
  • I interviewed one of my heroes – Greg Linden who helped devise the first Collaborative Filtering algorithm in production at Amazon.
  • I did some other interviews – I liked this one too with Masaaki Horikoshi one of the most prolific contributors to the PyData ecosystem.

I’ll continue to do some interviews over the next year, and hopefully add them to a revised book.

cool: moving to London

I moved to London in late March. I’ve found it very exciting to be close to the Machine Learning community and Data Science community out there. It was a hectic few months adjusting to new job(s) however I’m glad I made the move.

I think everyone should spend sometime in a major city when they’re young.

I hope to blog a bit more about work in the next few months.

cool: Teaching Data Science

My friend John Sandall  mentioned a Teaching Assistant gig at General Assembly.

I helped about 20 students learn more about Data Science, they came from various backgrounds and sharing my own experiences – reminded me that 1) I knew stuff and 2) teaching is hard.

I recommend to all Data Scientists and Engineers if they get the time to teach. It’s a great experience and I learned a lot about what was easy and hard in Machine Learning.

conclusions?

some things that worked:

  • asking a lot of questions about how computers work (not a surprise)
  • working on a team of people who know more stuff than me, and listening to what they have to say
  • asking for advice from people who are more experienced than me.
  • at work, figuring out what’s important to do and then doing the work to get it done, especially if that work is boring / tedious
  • working on one thing at a time (or at least not too many things)
  • getting a bit better at software “process” things like design documents and project plans
  • learning how to mentor junior data scientists – this is something I’m continuing to do
  • learning more about leading teams in ML – which is hard. I’ll not probably be doing too much people stuff over the next few months.

Interview with a Data Scientist: Greg Linden

pexels-photo-90807
Standard
I caught up with Greg Linden via email recently
Greg was one of the first people to work on data science in Industry – he invented the item-to-item collaborative filtering algorithm at Amazon.com in the late 90s.
I’ll quote his bio from Linkedin:
“Much of my past work was in artificial intelligence, personalization, recommendations, search, and advertising. Over the years, I have worked at Amazon, Google, and Microsoft, founded and run my own startups, and advised several other startups, some of which were acquired. I invented the now widely used item-to-item collaborative filtering algorithm, contributed to many patents and academic publications, and have been quoted often in books and in the press. I have an MS in Computer Science from University of Washington and an MBA from Stanford.”
img_6330

Greg Linden: Source Personal Website

1. What project have you worked on do you wish you could go back to, and do better?
All of them! There’s always more to do, more improvements to make, another thing to try. Every time you build anything, you learn what you could do to make it better next time.

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Learn to code. Computers are a tool, and coding is the way to get the most out of that tool. If you can code, you can do things in your field that others cannot. Coding is a major force multiplier. It makes you more powerful.

3. What do you wish you knew earlier about being a data scientist?
I was doing what is now called data science at Amazon.com in 1997.The term wasn’t even coined until 2008 (by Jeff Hammerbacher and DJ Patil). It’s hard to be much earlier. As for what I wish, I mostly wish I had the powerful tools we have now back then; today is a wonderland of data, tools, and computation. It’s a great time to be a data scientist.

4. How do you respond when you hear the phrase ‘big data’?
I usually think of Peter Norvig talking about the unreasonable effectiveness of data and Michele Banko and Eric Brill finding that more data beat better algorithms in their 2001 paper. Big data is why Amazon’s recommendations work so well. Big data is what tunes search and helps us find what we need. Big data is what makes web and mobile intelligent.

5. What is the most exciting thing about your field?
I very much enjoy looking at huge amounts of data that no one has looked at yet. Being one of only a few to explore a previously unmined new source of information is very fun. Low hanging fruit galore! It’s also fraught with peril, as you’re the first to find all the problems in the data as well.

6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
Data problems should be iterative. Start simple. Solve a small problem. Explore the data. Then solve a harder problem. Then a harder one. Each time you take a step, you’ll get ideas on where to go next, and you also get something out at each step. Too many people start trying to solve the entire problem at the beginning, flailing for a long time, usually to discover that it was the wrong problem to solve when they finally struggle to completion. Start with easier problems, learn where to go, and you might be surprised by all the goodies you find along the way.

What happens when you import modules in Python

Standard

 

I’ve been using Python for a number of years now – but like most things I didn’t really understand this until I investigated it.

Firstly let’s introduce what a module is, this is one of Python’s main abstraction layers, and probably the most natural one.

Abstraction layers allow a programmer to separate code into
parts that hold related data and functionality.

In python you use ‘import’ statements to use modules.

Importing modules

The

import modu

statement will look for the definition
of modu in a file called `modu.py` in the same directory as the caller
if a file with that name exists.

If it is not found, the Python interpreter will search for modu.py in `Python’s search path`.

Python search path can be inspected really easily

import sys
`>>> sys.path`

Here is mine for a conda env.

['', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/pymc3-3.0rc1-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/numpydoc-0.6.0-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/nbsphinx-0.2.9-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/Sphinx-1.5a1-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/recommonmark-0.4.0-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/CommonMark-0.5.4-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/tqdm-4.8.4-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/joblib-0.10.3.dev0-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/pandas-0.19.0rc1-py3.5-macosx-10.6-x86_64.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/Theano-0.8.2-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/numpy-1.11.2rc1-py3.5-macosx-10.6-x86_64.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/imagesize-0.7.1-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/alabaster-0.7.9-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/Babel-2.3.4-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/snowballstemmer-1.2.1-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python35.zip', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/plat-darwin', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/lib-dynload', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/setuptools-27.2.0-py3.5.egg']

What is a namespace?

We say that the modules variables, functions, and classes will be available
to the caller through the modules `namespace`, a central concept in programming that
is particularly helpful and powerful in Python. Namespaces provide a scope containing
named attributes that are visible to each other but not directly accessible outside of the namespace.

So there you have it this is an explanation of what happens when you import, and what a namespace is.

This is based on the Hitchikers guide which is well worth a read 🙂

Are RNN’s ready to replace journalists?

Standard

I recently was experimenting with RNN’s in Keras. I used the example and edited it slightly.

This is what I got for Nietzsche – as you can see the answer above to my question is No.

——– diversity: 0.2
——- Generating with seed: “iginal text, homo natura; to bring it ab”
iginal text, homo natura; to bring it absolute and the sense of the superficial for the suffering of the sense of the things of the sayment of the conception of the fact of the suffering and an an and an animation and an art of the subject, the sense of the experience of the souls of the sense of the contrason of the soul” and as a pleasure of the things of the superficially and an anything the suffering of the souls of the senses of th

——– diversity: 0.5
——- Generating with seed: “iginal text, homo natura; to bring it ab”
iginal text, homo natura; to bring it absolute that is to find ancient which is comparison that the belief in a soul in his own school of his love, and be a pulses of working to the reciantiating, morality and such a regnisoristic and impatiently
and an animation of the sayment of the actions and proudion of the conscience, the sensible and saint and incensed nowadays something of
the most terest to the superficial and decist of the sen

——– diversity: 1.0
——- Generating with seed: “iginal text, homo natura; to bring it ab”
iginal text, homo natura; to bring it able and moral fecth and thus, did alsopisible stinds of what virtuoth experiences–or another which is as still like dne conscience of any men this ethical musiates.

o8i xusted has
among the soul’ yet it is as we
pleasion to ones to you
more courage in the this thus, nexy what is certains by those deming an a myments only
“sight of expsequential time they do all things, that the sensible, for inte

——– diversity: 1.2
——- Generating with seed: “iginal text, homo natura; to bring it ab”
iginal text, homo natura; to bring it abcrude”.

142. can mutly, society, of the long, to beom an
yot. divystess–with theseful, his
poorness of asias and
tactless
life it!–” such one, through pucisomen, just merehonding
hastensce
an
him, old te, the profounded generals, seen fies
everygaing
bale because it
for meardy itsed upon
esprisf. how imvanemed, how he gives to soid of adierch) a pediorice simusreds has slee” in the pri
himse