Talks and Workshops

Sticky

I enjoy giving talks and workshops on Data Analytics. Here is a list of some of the talks I’ve given. In my Mathematics master I regularly gave talks on technical topics, and previously I worked as a Tutor and Technician in a School in Northern Ireland. I consider the evangelism of data and analytics to be an important part of my job as a professional data scientist!

Upcoming

None planned so far

Slides and Videos from Past Events

Keynote at PyCon Colombia Feb 2017. The slides are here.

I gave a tutorial called ‘Lies damned lies and statistics’ at PyData London 2016. I’ll be discussing different statistical and machine learning approaches to the same kinds of problems.
Continue reading

Advertisements

Some hacking of lifelines

Standard

I’ve been recently looking at Survival Analysis at work. It’s an incredibly powerful technique however one challenge I had was getting confidence intervals (or something similar to confidence intervals) to work with Aalen Additive.

So largely to allow myself to remember this in future I share the code here 🙂

It should work also with Cox PH. And other models.

Avoiding the ML hipster trap

Standard
pexels-photo-69212

Source: Pexels

Machine Learning hipster effect

Machine Learning is very in vogue at the moment. I feel that a pressure some junior data scientists and engineers feel is the need to do ML just to be a cool hipster, or as a friend of mine calls it ‘the ML hipster trap’.

What is the ML hipster trap?

This is simply where you use ML where you don’t need it. Or obsess over model performance without realising your job is to add value to a business or to a governmental organisation. A great discussion of when Machine Learning matters comes from Erik here

Tip: Ask yourself is there a simpler solution than ML? Does ML justify the investment for this project? Do you have a product without your ML?

Have you exhausted all the low hanging fruit of analytics?

I once in a project a few years ago, simplified what originally looked like an ML problem to a SQL query with some basic counting. It was 10x more reliable, I could hand it over easily and it integrated well with the reporting pipeline we had. It’s worth thinking about when you have counting problems first, and how to to invent and simplify.

Tip: Don’t be scared to look for a SQL query as a solution. If you add value you’re doing well.

Data Science is more than just supervised learning have you considered other approaches?

Currently at work I’m using some survival analysis, in my career so far I’ve built models with Ordinary Differential equations (some would say this isn’t Data Science, I think they’re being misled by ML hipsters), genetic algorithms  and many other kinds of statistical and algorithmic approaches. I don’t feel this makes me a worse Data Scientist just because I’m not using Deep Learning.

Tip: Use the right tool for the job, and don’t feel ashamed if you’re not using Deep Learning.

In short: Don’t let the current AI hype make you use the wrong tool for the job. An increasing part of growing as a Data Scientist or Engineer is to appreciate that it’s about adding value to those around you, not just using the fanciest toys.

Three pitfalls for non-technical managers managing Data Science teams

Standard

This is a fairly opinionated post. It doesn’t represent the views of anyone else other than myself. 

I recently came across – Pitfalls of a non technical manager and it reminded me of some of the things I was talking about in Trophy Data Scientist

I recommend the post above, and I’ll give my take on it for non-technical managers managing or leading Data Science teams. There’s considerable overlap, since in fact a lot of modern day Data Science work is Software focused.

Firstly – why is this a problem for your company or organisation. Well anecdotally in the world of Data Science there’s a feeling that projects/ teams aren’t delivering the value that was expected. Some of this is a function of hype, some of this is a function of a fast changing technology ecosystem but as I’ve experienced first hand – one of the problems is poor leadership or poor management.

Like Deepak in the Software article I think there exists a considerable communication disconnect between technical specialists and management on Machine Learning projects. I feel this extends as useful advice to data science leaders (and I am a Senior Data Scientist) and scrum masters, product managers. By no means do I not want to encourage people from various backgrounds from managing Data Science teams, but I’d like to explain why it’s harder without leveling up your technical skills.

As Deepak says

Firstly, it benefits you – the non-technical manager. There are two important benefits you get if you are a technically aware manager.

Better management skills – You will always know what you are doing.

 

Better communication – You will always know what you are talking about

As one anecdote I was once in a room with non-technical managers, brainstorming for a project. Their responsibility was to ‘represent’ or ‘translate’ requirements from the business. However they were frankly poor at this, because they didn’t understand what was possible – in terms of what various classes of algorithms could do, in terms of interpret-ability of algorithms (would a decision tree work well, or do you need high accuracy/ precision/recall), nor did they understand the complexity of connecting to various data sources. It was an incredibly frustrating and not very fruitful conversation, especially because when I discussed the technical aspects – they were unable to tell if I was ‘bullshitting’ or not.

Unfortunately that environment tended to reward the ‘talkers’ rather than the builders, and projects were often poorly managed. I even heard one manager say something like ‘we need to move past machine learning into deep learning’ which is absolute nonsense. Deep Learning is a sub field of Machine Learning 🙂

So three things that I think non-technical managers will get wrong in managing a technical project

Believing process will fix everything.

It’s definitely been my experience that non-technical managers when faced with poor performing (or misunderstanding the performance of) will implement ‘process’. Unfortunately this is a good way to both indicate a lack of trust/ appreciation for your team, and adding extra overhead often leads to meetings and planning poker and various other things that don’t work well for R and D projects. Since R and D projects are fundamentally creative, and non-linear. I’ve been building machine learning pipelines and doing machine learning/ technology work for a number of years now and I’m still astonished at how complicated building a product can be.

Process changes are the right solution all the time only for manual labour work, not for creative work. – Deepak Karanth

 Not understanding the nuance of the work

Machine Learning is complicated. Statistics is complicated, there’s all sorts of problems you can run into, like have you violated a linear assumption in your model, have you correctly implemented cross-validation on time-series, have you run into Simpson’s paradox, is there selection bias. If you’re doing Survival Analysis are you violating the proportional hazards assumption? You need to understand this complexity. R and D is hard!

And as my friend Martin Goodson termed it to me.

This is not a Data Scientist or Data Science leader!

Job description: The director/manager/VP of BI has primary responsibility for setting the strategy and vision and for managing the day-to-day tactical operations of the BI teams. He/she will be responsible for all strategic, tactical, operational, financial, human, and technical resource managerial responsibilities associated with the following BI and BI-related functional areas:

  • Data preparation (sourcing, acquisition, integration)
  • Data warehousing (Forrester often recommend that the first two functional areas are managed separately by  data management / data preparation team(s))
  • BI governance (may be same or separate from Data Governance)
  • Reporting, analytics, data exploration

This is a Chief Data Scientist job or head of Data Science job

We are looking for someone with::

  • An advanced Degree (Master’s or PhD) in Computer Science, Statistics, Engineering, Mathematics, Physics, or a related quantitative field
  • Experience working in the field of machine learning and data science
  • Proven track record of working with large data sets to develop innovative data products and capabilities and extract actionable insights
  • Expert knowledge of statistical modelling methods for supervised and unsupervised learning
  • Commercial experience working with either Python or R (we use Python)
  • Knowledge of databases and related languages/tools such as SQL and NoSQL
  • Experience with cloud computing platforms (AWS is desirable)
  • Strong knowledge of the mathematical foundations of statistical inference and forecasting such as time series analysis, multivariate analysis, cluster analysis, and optimization
  • Ability to lead and manage a team of juniors Data Scientists
  • Effective communication skills and ability to explain complex data products in simple terms

That’s it end off. Please companies stop mixing up those two types of people, they’re two different jobs, both have value and both are valuable at different types of your companies evolution.

 You don’t understand the messiness of data sources

One of the biggest things that irritates me as a Data Scientist is the ‘magic quickly’ idea that is pervasive. Some things are complicated, and often it’s a function of the data. One friend who’s a good data science consultant says if no one has looked at the data before add on 6 months to a project. It’s often because the ‘data exhaust’ is a sludge, and for specific projects you need to understand the context of the problem you’re trying to solve.

Neil Lawrence of Amazon wrote a good framework for thinking about this. This can be a good set of principles to think about when you look at the data readiness of a certain area.

In Summary

Unless you’ve hands on experience building data products, and extracting value from data, you run risk of missing a ton of crucial nuance and details. I’m not discounting the value that a data strategist can bring to the table, but there are trade offs with lacking the nuance of hands on experience of working with data. If you’re prepared to learn as you go along, this might slow the project, but at worst it can derail it.

What’s next

In a future post I’ll write about how to avoid these pitfalls, and what you as a non-technical manager can do to better manager the developers and Data Scientists you interact with, and are charged with managing.

What I learned and accomplished in 2017

Standard

This is a slightly self-involved post. The only point is to document for myself what I’ve done in 2017, and force myself to blog a bit more. 

In no particular order:

  • I proposed to my long-term partner, and since then I’ve learned a lot about the various decisions one has to make when planning a wedding, it’s really a funky non-linear optimization problem.
  • I thought and wrote a lot more about Data Science and the barriers for implementing it in companies, both as part of shipping Data Products at my last_job and also through advisory work for various companies. I blogged some of my thoughts on this
  • I helped ship some releases of PyMC3 – and I’d love to spend more time writing OSS code for this, and be more involved, but I’ve just not had the time. This year I wrote some benchmarks, updated docs, and reviewed others pull requests.
  • I keynoted at the first ever PyCon Colombia in the beautiful country of Colombia, where I talked about shipping data products.
  • I delved deeper into Variational Inference this year, and produced a talk on it at PyData London 2017
  • I did an introduction at ODSC London on Bayesian Analysis, this has been based on internal talks I’ve given at Zopa.
  • I learned a ton about NLP, and deploying ML models.
  • I’ve recently been learning about Fintech and delving deeper in Survival Analysis.
  • I’ve been trying harder to implement a rigorous statistical workflow
  • I published on two companies (I moved jobs this year) blogs which I’m quite proud of- Elevate and Zopa
  • I’ve been delving a lot more into Machine Learning systems, so things like interpret-ability, reliability, monitoring, audit-ability. I think there’s still opportunities for some learning of stuff there.
  • I learned a lot more about testing and data pipelines. I learned a lot more about Hypothesis, and Pytest and I’ve added them to my workflow/ taught others about it.
  • I’ve developed deeper relationships, and learned a lot more about maintaining the ones you have.
  • Joining a gym at work has helped a lot – and don’t mock cardio.
  • I made a small commit in Hypothesis – I’m quite proud of this 🙂

Why Zalando’s tech radar sucks as a stack

Standard

I’ve been programming professionally for about 5 years now. So I don’t consider myself a brilliant expert.

Never the less one thing I think that I’ve learned is that it isn’t about the ‘technical problems’ it’s about whether or not you solve a real-world or meaningful problem.

Some people call these ‘business problems’ but they could apply equally in academia/ not-for-profit worlds.

Another thing I’ve learned is that choosing new technology has costs, there are always tradeoffs.

Some of my thinking on this is formed by choose boring technology

It’s good to discuss the social and operational cost of a new technology. I’ve learned for instance that although kubernetes can be a great tool – there is a cost to training developers in it. This cost is non-trivial. And we as technological leaders should be careful when we bring these things in.

I think if you don’t carefully manage technological choices you can end up with something like Zalando’s radar, and while it’s good to have that – you don’t want that to be your entire stack.

 

 

How do we deliver Data Science in the Enterprise

Standard
Source

I’ve worked on Data Science projects and delivered Machine Learning models both in production code and more research type work at a few companies now. Some of these companies were around the Seed stage/ Series A stage and some are established companies listed on stock exchanges. The aim of this article is to simply share what I’ve learned — I don’t think I know everything. I think my audience consists of both managers and technical specialists who’ve just started working in the corporate world — perhaps after some years in Academia or in a Startup. My aim is to simply articulate some of the problems, and propose some solutions — and highlight the importance of culture in enabling data science.

I’ve been reflecting over the years as a practitioner why some of this ‘big data’ stuff is hard to do. I’ll present in this article a take that’s similar to some other commentary on the internet, so this won’t be unusual.

My views are inspired by http://mattturck.com/2016/02/01/big-data-landscape/ in this article Matt says:

Big Data success is not about implementing one piece of technology (like Hadoop or anything else), but instead requires putting together an assembly line of technologies, people and processes. You need to capture data, store data, clean data, query data, analyse data, visualise data. Some of this will be done by products, and some of it will be done by humans. Everything needs to be integrated seamlessly. Ultimately, for all of this to work, the entire company, starting from senior management, needs to commit to building a data-driven culture, where Big Data is not “a” thing, but “the” thing.

Often while speaking about our nascent profession with friends working in other companies we speak about ‘change management’. Change is very hard — particularly for established and non-digital native companies, companies who don’t produce e-commerce websites, social networks or search engines. These companies often have legacy infrastructure and don’t necessarily have technical product managers nor technical cultures. Also for them traditional Business Intelligence systems work quite well — reporting is done correctly, and it’s hard to make a case for machine learning in risk-averse environments like that.

Continue reading

One weird tip to improve the success of Data Science projects

Standard

I was recently speaking to some data science friends on Slack, and we were discussing projects and war stories. Something that came across was that ‘data science’ projects aren’t always successful.

light-311119_1280.png

Source: pixabay

Somewhere around this discussion a lightbulb went off in my head about some of the problems we have with embarking on data science projects. There’s a certain amount of Cargo cult Data Science and so collectively we as a community – of business people, technologists and executives don’t think deeply enough about the risks and opportunities of projects.

So I had my lightbulb moment and now I share it with everyone.

The one weird trick is to write down risks before embarking on a project.

Here’s some questions you should ask you start a project – preferably gather all data .

  • What happens if we don’t do this project? What is the worse case scenario?
  • What legal, ethical or reputational risks are there involved if we successfully deliver results with this project?
  • What engineering risks are there in the project? Is it possible this could turn into a 2 year engineering project as opposed to a quick win?
  • What data risks are there? What kinds of data do we have, and what are we not sure we have? What risks are there in terms of privacy and legal/ ethics?

I’ve found that gathering stakeholders around helps a lot with this, you hear different perspectives and it can help you figure out what the key risks in your project are. I’ve found for instance in the past that ‘lack of data’ killed certain projects. It’s good to clarify that before you spend 3 months on a project.

Try this out and let me know how it works for you! Share your stories with me at myfullname[at]google[dot]com.