Talks and Workshops

workshop
Sticky

I enjoy giving talks and workshops on Data Analytics. Here is a list of some of the talks I’ve given. In my Mathematics master I regularly gave talks on technical topics, and previously I worked as a Tutor and Technician in a School in Northern Ireland. I consider the evangelism of data and analytics to be an important part of my job as a professional data scientist!

Upcoming

I’m giving a tutorial called ‘Lies damned lies and statistics’ at PyData London 2016. I’ll be discussing different statistical and machine learning approaches to the same kinds of problems. The aim will be to help those who know either Bayesian statistics or Machine Learning bridge the gap to others.

Slides and Videos from Past Events

In April 2016 I gave an invited talk at the Toulouse Data Science meetup which was a slightly adjusted version of  Map of the Stack‘.

At PyData Amsterdam in March 2016- I gave the second Keynote on a ‘Map of the Stack‘.

PyCon Ireland From the Lab to the Factory (Dublin, Ireland October 2015) – I gave a talk on the business side of delivering data products – a trope I used was it is like ‘going from the lab to the factory’. This was a well-received talk based on the feedback and I gave my audience a collection of tools they could use to solve these challenges.

EuroSciPy 2015 (Cambridge, England Summer 2015): I gave a talk on Probabilistic Programming applied to Sports Analytics – slides are here.

My PyData London tutorial was an extended version of the above talk.

I spoke at PyData in Berlin.
The link is here

The blurb for my PyData Berlin talk is mentioned here.
Abstract: “Probabilistic Programming and Bayesian Methods are called by some a new paradigm. There are numerous interesting applications such as to Quantitative Finance.
I’ll discuss what probabilistic programming is, why should you care and how to use PyMC and PyMC3 from Python to implement these methods. I’ll be applying these methods to studying the problem of ‘rugby sports analytics’ particularly how to model the winning team in the recent Six Nations in Rugby. I will discuss the framework and how I was able to quickly and easily produce an innovative and powerful model as a non-expert.”

In May 2015 I gave a preview of my PyData Talk in Berlin at the Data Science Meetup in Luxembourg on ‘Probabilistic Programming and Rugby Analytics‘ – where I presented a case study and introduction to Bayesian Statistics to a technical audience. My case study was the problem of ‘how to predict the winner of the Six Nations’. I used the PyMC library in Python to build up statistical models as part of the Probabilistic Programming paradigm. This was based on my popular Blog Post which I later submitted to the acclaimed open source textbook Probabilistic Programming and Bayesian Methods for Hackers. I gave this talk using an IPython notebook, which proved to be a great method for presenting this technical material.

In October 2014 I gave a talk at Impactory in Luxembourg – a co-working space and Tech Accelerator. This was an introductory talk to a business audience about ‘Data Science and your business‘. I talked about my experience at different small firms, and large firms and the opportunities for Data Science in various industries.

In October 2014 I also gave a talk at the Data Science Meetup in Luxembourg. This was on ‘Data Science Models in Production‘ discussing my work with a small company on developing a mathematical modelling engine that was the backbone of a ‘data product’. This talk was highly successful and I gave a version of this talk at PyCon Italy – held in Florence – in April 2015. The aim of this talk was to explain what a ‘data product’ was, and discuss some of the challenges of getting data science models into production code. I also talked about the tool choices I made in my own case study. It was well-received, high level and got a great response from the audience. Edit: Those interested can see my video here, it was a really interesting talk to give, and the questions were fascinating.

When I was a freelance consultant in the Benelux I gave a private 5 minute talk on Data Science in the Game industry. Here are the slides. – This is from July 2014

My Mathematical research and talks as a Masters student are all here. I specialized in Statistics and Concentration of Measure. It was from this research that I became interested in Machine Learning and Bayesian Models.

Thesis

My Masters Thesis on ‘Concentration Inequalities and some applications to Statistical Learning Theory‘ is an introduction to the world of Concentration of Measure, VC Theory and I used this to apply to understanding the generalization error of Econometric Forecasting Models.

Are RNN’s ready to replace journalists?

Standard

I recently was experimenting with RNN’s in Keras. I used the example and edited it slightly.

This is what I got for Nietzsche – as you can see the answer above to my question is No.

——– diversity: 0.2
——- Generating with seed: “iginal text, homo natura; to bring it ab”
iginal text, homo natura; to bring it absolute and the sense of the superficial for the suffering of the sense of the things of the sayment of the conception of the fact of the suffering and an an and an animation and an art of the subject, the sense of the experience of the souls of the sense of the contrason of the soul” and as a pleasure of the things of the superficially and an anything the suffering of the souls of the senses of th

——– diversity: 0.5
——- Generating with seed: “iginal text, homo natura; to bring it ab”
iginal text, homo natura; to bring it absolute that is to find ancient which is comparison that the belief in a soul in his own school of his love, and be a pulses of working to the reciantiating, morality and such a regnisoristic and impatiently
and an animation of the sayment of the actions and proudion of the conscience, the sensible and saint and incensed nowadays something of
the most terest to the superficial and decist of the sen

——– diversity: 1.0
——- Generating with seed: “iginal text, homo natura; to bring it ab”
iginal text, homo natura; to bring it able and moral fecth and thus, did alsopisible stinds of what virtuoth experiences–or another which is as still like dne conscience of any men this ethical musiates.

o8i xusted has
among the soul’ yet it is as we
pleasion to ones to you
more courage in the this thus, nexy what is certains by those deming an a myments only
“sight of expsequential time they do all things, that the sensible, for inte

——– diversity: 1.2
——- Generating with seed: “iginal text, homo natura; to bring it ab”
iginal text, homo natura; to bring it abcrude”.

142. can mutly, society, of the long, to beom an
yot. divystess–with theseful, his
poorness of asias and
tactless
life it!–” such one, through pucisomen, just merehonding
hastensce
an
him, old te, the profounded generals, seen fies
everygaing
bale because it
for meardy itsed upon
esprisf. how imvanemed, how he gives to soid of adierch) a pediorice simusreds has slee” in the pri
himse

Why Code review? Or why should I care as a data scientist.

Standard

The insightful Data Scientist Trey Causey talks about Software Development Skills for Data Scientists I’m going to write about my views on Code Review – as a Data Scientist with a few years experience, and experience delivering Data Products at organizations of varying sizes. I’m not perfect and I’m still maturing as an Engineer.

A good thorough introduction to Code Review comes from the excellent team at Lyst I suggest that as follow up reading!

The fundamental nugget is that ‘code reviews allow you to more effectively collaborate with your peers‘ and a lot of new Engineers and Data Scientists don’t know how to do that. This is one reason why I wrote ‘soft skills for data scientists‘. This article talks about a technical skill but I consider this a kind of ‘technical communication’.

Here are some views on ‘why code review’ – I share them here as reference, largely to remind myself. I steal a lot of these from this video series.

  • Peer to peer quality engineering and training 

As a Data Science community that is forming – and with us coming from various backgrounds there’s a lot of invaluable knowledge from others in the team. Don’t waste your chance at getting that🙂

  • Catches bugs easily

There are many bugs that we all write when we write code.

Keeps team members on the same page

  • Domain knowledge 
    How do we share knowledge about our domain to others without sharing code?
  • Project style and architecture
    I’m a big believer in using structured projects like Cookiecutter Data Science and I’m sure there exist alternatives in other languages. Before hand I had a messy workflow like hacked together IPython notebooks and no idea what was what – refactoring code into modules is a good practice for a reason🙂
  • Programming skills
    I learn a lot myself by reading other peoples code – a lot of the value of being part of an open source project like PyMC3 – is that I learn a lot from reading peoples code🙂

Other good practices

  • PEP8 and Pylint (according to team standards)
  • Code review often, but by request of the author only

I think it’s a good idea (I think Roland Swingler mentioned this to me)

To not obsess too much about style – having a linter doing that is better, otherwise code reviews can become overly critical and pedantic. This can stop people sharing code and leads to criticism that can shake Junior Engineers in particular – who need psychological safety. As I mature as an Engineer and a Data Scientist I’m aware of this more and more🙂

Keep code small

  • < 20 minutes, < 100 lines is best
  • Large code reviews make suggestions harder and can lead to bikeshedding

These are my own lessons so far and are based on experience writing code as a Data Scientist – I’d love to hear your views.

3 tips for successful Data Science Projects

Standard

I’ve been doing Data Science projects, delivering software and doing Mathematical modelling for nearly 7 years (if you include grad school).

I really don’t know everything, but these are a few things I’ve learned.

Consider this like a ‘joel test‘ for Data Science.

  1. Use a reproducible framework like Cookiecutter Data Science. My workflow used to be use an IPython notebook and forget to name things correctly – and discover messy, badly written code🙂 I’ve now turned to a project structure like Cookiecutter – this has helped me write better, more maintainable code and reminded me to document things and make my work reproducible.
  2. Have a spec for a data science project- all projects should start with an agreed spec between the business stakeholder and the project. This forces people to clarify what they really want. This project should have a ‘goal’. Just to clarify – I mean a well defined goal that is Specific, Measurable, Achievable, Realistic and Time bounded – SMART.
  3. Make sure your stakeholders are realistic about the ‘failure’ aspect of R and D. One of the anti-patterns I’ve encountered in Data Science is stakeholders being immature and not realizing that for example ‘this Bayesian model doesn’t work for this kind of problem’ isn’t a statement of incompetence but it is a statement of a fact of the matter about the world. If organizations can’t accept that, they deserve suboptimal Data Science. R and D work is not engineering – failures teach us something too!

What are your views? I’d love to hear them🙂

Interview with a Data Scientist – Jessica Graves

pexels-photo-26549
Standard

Jessica Graves is a Data Scientist who currently works on fashion problems in New York City. She’s worked with Hilary Mason at Fast Forward Labs and keeps in regular contact with the London startup scene. After many months of asking her for an interview she finally gave in – and she shares her unique perspective on the datafication of Fashion. She comes from a background in visual and performing arts, as well as fashion design. In her spare time you’ll find her reading a stack of papers or studying dance.

Cover image: unsplash.com CCO

Jessica Graves_02-1

  1. What project have you worked on do you wish you could go back to, and do better?

I worked with Dr. Laurens Mets on an iteration of the technology behind Electrochaea, a device where microbes convert waste electricity to clean natural gas. My job was to translate models from electrochemistry journals into code, to help simulate, measure and optimize the parameters of the device. We needed to facilitate electron transport and keep the microbes happy. Read papers, write code, and design alternative energy technology with math + data?! I would hand my past self How to Design Programs as a guide and learn to re-implement from scratch in an open source language. 

  1. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

Listen! If you are a data scientist, listen carefully to the business problems of your industry, and see the problems for what they are, rather than putting the technical beauty of and personal interest in the solution first and foremost. You may find it’s more important to you to work with a certain type of problem than it is to work at a certain type of company, or vice versa. Watch very carefully when your team expresses frustration in general – articulate problems that no one knows they should be asking you to solve. At the same time, it can be tempting to work on a solution that has no problem. If you’re most interested in a specific machine learning technique, can you justify its use over another, or will high technical debt be a serious liability? Will a project be leveragable (legally, financially, technically, operationally)? Can you quantify the risk of not doing a project? 

  1. What do you wish you knew earlier about being a data scientist?

I wish I realized that data science is classical realist painting.

Classical realists train to accurately represent a 3D observation as a 2D image. In the strictest cases, you might not be allowed to use color for 1-3 years, working only with a stick of graphite, graduating to charcoal and pencils, eventually monotone paintings. Only after mastering the basics of form, line, value, shade, tone, are you allowed a more impactful weapon, color. With oil painting in particular, it matters immensely in what order at what layer you add which colors, which chemicals compose each color, of which quality pigment, at what thickness, with what ratio of which medium, with which shape of brush, at what angle, after what period of drying. Your primary objective is to continuously correct your mistakes of translating what you observe and suspending your preconception of what an object should look like.

There are many parallels with data science. At no point as a classical realist painter should you say, ‘well it’s a face, so I’m going to draw the same lines as last time’ just like as a data scientist, you should look carefully at the data before applying algorithm x, even if that’s what every blog post Google surfaces to the top of your results says to do in that situation. You have to be really true to what you observe and not what you know – sometimes a hand looks more like a potato than a hand, and obsessing over anatomical details because you know it’s a hand is a mistake. Does it produce desirable results in the domain of problems that you’re in? Are you assuming Gaussian distributions on skewed data? Did you go directly to deep learning when logistic regression would have sufficed? I wish I knew how often data science course offerings are paint by numbers. You won’t get very far once the lines are removed, the data is too big to extract on your laptop, and an out-of-memory error pops up running what you thought was a pretty standard algorithm on the subset you used instead. Let alone that you have to create or harvest the data set in the first place – or sweet talk someone into letting you have access to it.  

In addition, Nulla dies sine linea – it’s true for drawing, ballet, writing. It’s true for data science. No day without a line. It’s very difficult to achieve sophistication without crossing off days and days of working through code or theoretical examples (I think this is why Recurse Center is so special for programmers). Sets of bland but well-executed tiny piece of software. Unspectacular, careful work in high volumes raises the quality of all subsequent complex works. Bigger, slower projects benefit from myriads of partially explored pathways you already know not to take.

Also side notes to my past self: Linux. RAM. Thunderbolt ports. 

  1. How do you respond when you hear the phrase ‘big data’?

Big data? Like in the cloud? Or are we in the fog now? Honestly the first thing I see in my mind is PETABYTES. I think of petabytes of selfies raining from the sky and flowing into a data lake. Stagnant. Data-efficient AI is all the rage — less data, more primitives, smarter agents. In the meantime, optimizing hardware and code to work with large datasets is pretty fun. Fetishizing the size of the data works well …as long as you don’t care about robustness to diverse inputs. Can your algorithm do well with really niche patterns? What can you do with the bare minimum amount of data? 

  1. What is the most exciting thing about your field?

Fashion is visual. It’s inescapable. Every culture has garb or adornment, however minimal. A few trillion dollars of apparel, textiles, and accessories across the globe. The problems of the industry are very diverse and largely unsolved. A biologist might come to fashion to grow better silk. An AI researcher might turn to deep learning to sift through the massive semi-structured set of apparel images available online. So many problems that may have a tech solution are unsolved. Garment manufacturing is one of the most neglected areas of open source software development. LVMH and Richemont don’t fight over who provided the most sophisticated open-source tools to researchers the way that Amazon and Google do. You can start a deep learning company on a couple grand and use state-of-the-art software tools for cheap or free. You cannot start an apparel manufacturing vertical using state-of-the-art tools without serious investment, because the climate is still extremely unfavorable to support a true ecosystem of small-scale independent designers. The smartest software tools for the most innovative hardware are excessively expensive, closed-source, and barely marketed — or simply not talked about in publically accessible ways. Sewing has resisted automation for decades, although is finally now at a place now were the joining of fabrics into a seam is robot-automatable with computer vision used on a thread-by-thread basis to determine the location of the next stitch. 

High end, low end, or somewhere in between, the apparel side of fashion’s output is a physical object that has to be brought to life from scratch, or delivered seamlessly, to a human, who will put the object on their body. Many people participate in apparel by default, but the fashion crowd is largely self-selected and passionate, so it’s exciting (and difficult) to build for such an engaged group that don’t fit standard applications of standard machine learning algorithms.

  1. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

Artists learn this eventually: volume of works produced trumps perfectionism. Even to match something in classical realism, you start with ridiculous abstractions. Cubes and cylinders to approximate heads and arms. Break it down into the smallest possible unit. Listen to Polya, “If you can’t solve a problem, then there is an easier problem you can solve: find it.”

As for when to finish? Nothing is never good enough. The thing that is implemented is better than the abstract, possibly better thing, for now, and will probably outlive its original intentions. But make sure that solution correlates thoroughly with the problem, as described in the words of the stakeholder. Otherwise, for a consumer-facing product or feature, your users will usually give you clues as to what’s working. 

  1. You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?

Be open. Fashion has a lot of space for innovation if you understand and quantify your impact on problems that are actually occurring and costing money or time, and show that you can solve them fast enough. “We built this new thing” has absolutely nothing to do with “We built this useful thing” and certainly not “We built this backwards-compatible thing”. You might be tempted to recommend a “new thing” and then complain that fashion isn’t sophisticated enough or “data” enough for it. As an industry that in some cases has largely ignored data for gut feelings with a serious payoff, I think the attitude should be more of pure respect than of condescension, and of transitioning rather than scrapping. That or build your own fashion thing instead of updating existing ones.  

  1. You have worked in fashion. Can you talk about the biggest opportunities for data in the fashion industry. Are there cultural challenges with datafication in such a ‘creative industry’.

Fashion needs ‘datafication’ that clearly benefits fashion. If you apply off-the-shelf collaborative filtering to fashion items with a fixed seasonal shelf life to users that never really interact with, you’re going to get poor results. Algorithms that work badly in other domains might work really well in fashion with a few tweaks. NIPS had an ecommerce workshop last year, and KDD has a fashion-specific workshop this year, which is exciting to see, although I’ll point out that researchers have been trying to solve textile manufacturing problems with neural networks since the 90s.

A fashion creative might very well LOVE artificial intelligence, machine learning, and data science if you tailor your language into what makes their lives easier. Louis Vuitton uses an algorithm to arrange handbag pattern pieces advantageously on a piece of leather (not all surfaces of the leather are appropriate for all pattern pieces of the handbag) and marks the lines with lasers before artisans hand-cut the pieces. The artisans didn’t seem particularly upset about this. 

The two main problems I still see right now are the doorman problem and fit. Use data and software to make it simple for designers of all scales to adjust garments to fit their real markets instead of their imagined muses. And, use as little input as possible to help online shoppers know which existing items will fit. Once they buy, make sure they get their packages on time, securely, discreetly. 

Data Science Delivered – Consulting skills

Standard

I recently gave a lighting talk at PyData Meetup London where I talked about ‘Consulting skills for Data Scientists’.

Here are the slides here

https://speakerdeck.com/springcoil/consulting-skills-for-data-scientists 

My thoughts

Some thoughts – these are not just related to ‘consulting skills’ but something more nuanced – general soft skills and business skills – which are essential for those of us working in a commercial environment. I’m still improving these skills but these are important for me and I take these seriously. I present some bullet points that are worth further thought – I’ll try to tackle these in more detail in future blog posts.

  • Business skills are necessary as you get more experience as a data scientist – you take part in a commercial environment.
  • All projects involve risk and this needs to be communicated clearly to clients – whether their internal or external.
  • Negotiation is a useful skill to pick up on too
  • Maturing as an engineer involves being able to make estimates, stick to them, and take part in a joint activity with other people.
  • Leadership of technical projects is something I’m exploring lately – a great post is by John Allspaw (current CTO of Etsy). http://www.kitchensoap.com/2012/10/25/on-being-a-senior-engineer/ 
  • My friend John Sandall talked about this at the meetup too. He talked more about ‘soft skills’ and has some links to some books etc.
  • Learning to write and communicate is incredibly valuable. I recommend the Pyramid Principle as a book for this.
  • For the product delivery and de-risking projects – I recommend the book ‘The Lean Startup‘ can be really good regardless of the organization you’re in.
  • Modesty forbids me to recommend my own book but it has some conversations with data scientists about communication, delivery, and adding value throughout the data science process.
  • Editing and presenting results is really important in Data Science. In one project, I simplified a lot of complex modelling to just an if-statement – by focusing on the business deliverables and the most important results of the analysis. Getting an if-statement into production is trivial – a random forest model is a lot more complicated. John Foreman has written about this too.

In short we’re a new discipline – but we have a lot to learn from other consulting disciplines and other engineering disciplines. Data science may be new – but people aren’t🙂

 

Data Science Delivered

Standard

Quick note

  • At the excellent PyData London conference, there was a lot of food for thought.
  • One thing that came up was the concept of ‘data strategy’, there’s a lot of discussion about how to align, or write, or explain how data can help drive business transformation and be part of a business strategy.
  • A similar concept was product delivery, product management and the challenges of delivering value from data.

I bring this up because although it was a technical conference, there was a lot of discussion about non-technical aspects.

I’m sure at some point there’s a track or a training to be done on this sort of thing🙂

 

Interview with a Data Scientist: Phillip Higgins

Standard

Phillip Higgins is a data science consultant based in New Zealand. His experience includes financial services and working for SAS, amongst other experience including some in Germany.

What project have you worked on do you wish you could go back and do better?

Hindsight is a wonderful thing, we can always find things we could have done better in projects.  On the other hand, analytic and modelling projects are often frought with uncertainty- uncertainty that despite the best planning, is not available to foresight. Most modelling projects that I have worked on could have been improved with the benefit of better foresight!

What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

Firstly, I would advise younger analytics professionals to develop both deep knowledge of a particular area and at the same time, to broaden their knowledge and to maintain this focus of learning on both specialised and general subjects throughout their careers.  Secondly, its important to gain as much practice as possible – data science is precisely that because it deals with real-world problems.  I think PhD students should cultivate industry contacts and network widely- staying abreast of business and technology trends is essential.

What do you wish you knew earlier about being a data scientist?
Undoubtedly I wish I knew the importance of communication skills in the whole analytics life-cycle.  Its particularly important to be able to communicate findings to a wide audience and so refined presentation skills are a must.

How do you respond when you hear the phrase ‘Big Data’?

I think Big Data offers data scientists with new possibilities in terms of the work they are able to perform and the significance of their work.  I don’t think it’s a coincidence that the importance and demand of data scientists has risen sharply right at the time that Big Data has become mainstream- for Big Data to yield insights, “Big Analytics” need to be performed – they go hand in hand.

What is the most exciting thing about your field?

For me personally it’s the interesting people I meet along the way.  I’m continually astounded by the talented people I meet.

How do you go about framing a data problem – in particular, how do you manage expectations etc.  How do you know what is good enough?

I think its important to never lose sight of the business objectives that are the rationale for most data-scientific projects.  Although it is essential that businesses allow for data science to disprove hypotheses, at the end of the day, most evidence will be proving hypotheses (or disproving the null hypothesis).  The path to formulating those hypotheses lies obviously mostly in exploratory data analysis (combined with domain knowledge).  It is important to communicate this uncertainty as to framing from the outset, so that there are no surprises.

You spent some time as a consultant in data analytics.  How did you manage cultural challenges, dealing with stakeholders and executives?  What advice do you have for new starters about this?

In consulting you get to mix with a wide variety of stakeholders and that’s certainly an enjoyable aspect of the job.  I have dealt with a wide range of stakeholders, from C-level executives through to mid- level managers and analysts and each group requires a different approach.  A stakeholder analysis matrix is a good place to start- analysing stakeholders by importance and influence.  Certainly, adjusting your pitch and being aware of the politics behind and around any project is very important.