Interview with a Data Scientist: Mick Cooney

Standard

I’m delighted to feature my friend Mick Cooney here as an interviewee. Mick has many years of experience in Finance and more recently in Insurance, he co-ran the Dublin R meetup which was very successful and helped foster a data science community in Dublin. More recently he’s been working over in London at an Actuarial Consultancy – building out a data science practice.

q1. What project have you worked on do you wish you could go back to,
and do better?

I started my career as a quant in a small startup hedge fund. We
developed time series models to forecast short-term volatility in
equities and equity indices as part of an option trading strategy. It
is a fascinating topic and I still dabble in it. Thinking back on the
work done, I would re-engineer large portions of it. I made a ton of
mistakes on both the modelling and implementation side, and the R
language in particular has progressed in strides since I did the bulk
of the work.

For example, the system automatically generates PDF reports of the
forecasts but it does so by hand creating La-TeX files compiled into
PDF. One of the first things I would do is switch all that over to use
either ‘knitr’ or ‘rmarkdown’. I would also use more ‘reproducible
research’ concepts.

That said, I had worked on the modeling for a long time, so I am
content with the basic model. There are many things still to
investigate or implement.

On the modeling side, I worked on a persistency model using survival
analysis, which is how I learned about the subject in the first
place. As a result, there are a lot of different things I would love
to return to and do differently. In retrospect, I was too quick to
move past the simpler models. We could see the assumptions were not
consistent with the data, and so did not fully explore simpler
approaches. I am now curious to learn what insights those simpler
approaches would yield.

Customer churn is such a universal problem I expect I will be working
on it again in the near future. Hopefully I can apply those lessons
then.

***
q2. What advice do you have to younger analytics professionals and in
particular PhD students in the Sciences?

I think the key advice I would give is the same for everyone – never
stop learning. This may be the availability heuristic at play with me,
but I have never seen a connection between qualifications and analyst
quality. All the good analysts I know have curiosity and
initiative. Academic achievements do not come into it at all.

Initiative manifests in many ways. First, when they encounter a
problem they learn what they need to do and get on with it. Second,
much of their knowledge is self-taught. Finally, and I believe most
importantly, they have an inherent curiosity – the best analysts I
know engage in the field in their own time, mainly because they want
to.

This brings up a related issue I have been pondering for some time. I
am ambitious. I want to be a top data scientist some day. I have no
academic ambition whatsoever, but my goal is to be able to hold my own
in any conversation with anyone in the field.

How do I achieve this? What do I need to do to get to that point?

While probably not as keen as the average fan, I love sport – soccer,
the NFL and Gaelic Football in particular. For anyone who has met me
in person, comparing me to a top athlete seems preposterous, but
there is a lot to be learned from top athletes if you want to excel
at your chosen field. Look at how they prepare and train. These
principles almost certainly apply to other professions too, but it is
more fun to talk about sport. 🙂

When I read about Lionel Messi, Tom Brady or Colm Cooper (for our
non-Irish readers the recently-retired ‘Gooch’ is arguably the
greatest GAA player to ever play the game – he was majestic to watch),
the one thing that always stands out for me is their fanatical
devotion to their chosen career not their obvious talent. All their
team-mates mention how hard they worked despite their abundance of
natural advantages. Players with huge natural talent often coast, but
elite players are the opposite – they work as hard as the fringe
players slogging to just survive the cut.

In our field, we need to work constantly on improving – going to
Meetups, reading about new techniques, watching videos on YouTube and
looking to strengthen areas where you are weak. This is why a natural
interest and curiosity is so invaluable – it makes these necessary
tasks much less of a burden as they are things you would want to do
anyway.

Secondly, top players do the simple things well, almost never making a
mistake. They are fallible of course, and make mistakes, but almost
never on the basics. They are rigorous about practicing the basic
skills and principles, and that is why they are so good. The bread and
butter of their craft is second-nature to them.

This is why I focus so much on basic statistics classes and reread and
re-watch the books and lectures I find useful. I want these things to
be second nature and they are not.

Probability and statistics are so counter-intuitive that I almost
never get things right on gut feeling. I am almost always wrong. So
much so that I gave a talk about probabilistic graphical models about
a year ago and during the questions at the end made an off-hand joke
about going with the opposite of my intuition.

It was said in jest at the time but is sadly true!

One final piece of advice is to help as many people as you can. Help
people with their homework, with some programming, with their computer
problems and with data problems. You get exposed to all sorts of
topics and problems, most of which you will see again in your
career. You also get the added bonus of people thinking you are
selfless and altruistic, despite being self-serving in reality!

***
q3. What do you wish you knew earlier about being a data scientist?

I have two main things I wish I learned early on in my career, and
both are connected philosophically. First, I wish I had learned about
probabilistic thinking, risk management, economics and statistics –
you can never learn enough about these fundamental topics. Secondly, I
wish I learned it is okay to start working with a bad model that you
know is wrong but simple.

To that first point, I spend a long time fighting my natural desire
for a clean, elegant and correct answer to a problem. I would work on
a problem, get to a point that I was confident pointed us in the right
direction, but then realise that ‘proving’ this was right involved a
huge amount of time and effort, assuming it was possible.

I attributed my natural reluctance to pursue this ‘answer’ as
laziness, and felt guilty. I felt I was being unprofessional and
sloppy. But working on forecasting models for trading taught me that
this was not the case. Models are so imperfect, with so many
compromises it is often more optimal to think about other things first
– what are the limitations of the model in practice, what is it
saying, how are you going to use it. Answer those questions first,
THEN worry about improving it.

This is why I always start with simple, stupid, wrong models. They are
quick to produce, they help you learn a lot about what you are doing,
they fail in spectacular ways and they are sometimes all you need. In
terms of costs and benefits, they are hard to beat.

***
q4. How do you respond when you hear the phrase ‘big data’?

I hate it. It has become a meaningless buzzword used as a means of
making sales.

My attitude to the term is best summarised by the interview you had
with Hadley Wickham: there are three categories of data size,
in-memory, on-disk and finally the truly ‘big data’ problems like
recommender systems. I believe the majority of problems can be solved
by appropriate sampling of your data down to a manageable size and
then analysing those subsets.

After all, the whole point of statistics is to make inferences about a
population from a sample of the data.

Once decided on a solution, putting the model into production and
scaling it for your business is a major issue, but is a problem more
belonging to the realm of network and software engineering. That said,
it is important to keep people with a solid understanding of the
concepts stay involved, just in case some ‘optimisations’ ruin the
output.

***
q5. What is the most exciting thing about your field?

Robert McNamara in ‘The Fog of War’ mentioned that you should never
answer the question asked but instead answer the question you wanted
to be asked, so with your forebearance I will first answer a liberal
interpretation of that question: what work gets me excited?

The short answer to that question is all sorts of things do, but they
are often small things related to work I am doing. In the last few
months, I was excited to try out dataexpks (a data exploration package
I am co-creating) on a brand new data set to see what it showed me and
how well my code worked. I love think of ways to use Monte Carlo
simulation to test the output of various regression models, and over
Christmas I was fascinated by a short project trying out methods of
investigating differences between a subpopulation within a larger
population.

I am fascinated by new ways to learn the fundamentals – there are a
few excellent ones out there and I read them all the time. I can never
learn enough as in my experience reality tends to present us with
basic statistical problems in new and unusual ways.

Having multiple perspectives and multiple approaches is invaluable in
those situations.

Regarding your original question as I think you intended, I think the
advances in reinforcement learning techniques probably have the
biggest potential – some of the Atari gameplaying from Deep Mind was
eye-opening. Sadly, if history is any guide, much of it will prove to
be hype, but I imagine some very interesting results to come from the
work.

***
q6. How do you go about framing a data problem – in particular, how do
you avoid spending too long, how do you manage expectations etc. How
do you know what is good enough?

Framing a data problem is a tough one to answer – I am not sure what I
do or how to articulate it. I have had the good fortune to help a lot
of people with their projects and problems, exposing me to a wide
variety of problems. I learned something from all of them and I rely
on that a lot.

I also read a lot of blogs, articles and subscribe to mailing
lists. While rarely having the time to read all this, often all you
need to get started on a problem is a vague memory of some technical
topic that may help and some terminology to Google.

As a result, the first thing I focus on is understanding the problem:
what is being asked? Do we have any data? What does is it look like?
Are there other data available we can use to enrich or use as a
substitute?

Going through that process will suggest approaches to use, and at that
point I draw upon previous experience, however tangential to the
problem..

By keeping this focus, your other questions are straightforward to
answer: if the current model is not likely to improve the answer by an
amount relevant to the goal, it is not worth spending more time
on. Similarly, knowing what is needed will tell you if your current
model is good enough, or often if there is a model that is good enough
– it is possible the level of accuracy required is not feasible.

In the latter case, discovering that early is much better than later –
you know not to waste time, money and resources on a lost cause.

***
q7. You’ve spoken before about the ‘need for apprenticeships’ in Data
Science. Do you have any suggestions on what that would involve? Are
meetups and coaching a good first start?

To explain the point I was making on that note, I think there is a lot
of implicit knowledge in this field, and I have been told a number of
times from people looking for help that people feel overwhelmed by the
sheer amount of knowledge people feel they need to know.

I do not think this is true, but I understand its origin: there is so
many different aspects to working with data it is tough to know where
to start. I always start very simple, but as I mentioned early, it
took a lot of time, thought and effort to get to that point, and it is
not easy to explain these ideas in theory – you have to work on a
number of different datasets to get a feel for how to do this.

As a result, I believe an approach such as mentoring or
apprenticeships are an effective approach to teach people – more
experienced analysts can guide junior members around the various
pitfalls and traps that are easy to fall into. It allows us to
illustrate that fancy and sophisticated techniques and algorithms are
not needed to do interesting work – some of the most interesting work
I have seen involved little more than summary statistics along with
basic models like linear regression and decision trees.

This is hard to learn from a book – almost impossible. The closest
book I read that talks about this is “Data Analysis Using Regression
and Multilevel/Hierarchical Models” by Gelman and Hill, stressing the
importance of starting from simple models. I would love to know if
there are more.

That said, I could only appreciate the point because I was already
experienced, a younger version of myself would have missed the
point. It would not have occurred to me that the right way to do
something is to do the simple and obvious thing.

I am a firm believer in the KISS principle. Keep It Simple, Stupid.

Building Full-Stack Vertical Data Products

Standard

I’ve been in the Data Science space for a number of years now, I first got interested in AI/Machine Learning in 2009 and have a background typical of a number of people in my field – I come from Physics and Mathematics.

One trend I’ve run into both at Corporates and Startups is that there are many challenges to deploying Data Science in a bureaucratic organisation – or delivering Enterprise Intelligence. Running into this problem led me to be interested in building data products.

One of the first people I saw building AI startups was Bradford Cross – and he’s been writing lately about his predictions for the 2017 in the Machine Learning startups space.

I agree with his precis that we’ll begin to see successful vertically-oriented AI startups solving full-stack industry problems that require subject matter expertise, unique data, and a product that uses AI to deliver its core value proposition.

At Elevate Direct we’re working on this working on the problem of sourcing and hiring contractors – so one of the fundamental problems that companies have which is hiring the best contractor talent out there.

So what are some of the reasons that it can be hard to deploy Data Science internally at a corporate organisation? I think a number of the patterns are related to other patterns we see in terms of software.

  1. Not being capable of building consumer facing software – Large (non-tech) organisations sometimes struggle to build and deliver software internally – I’ve seen a number of organisations fail to do this – their build process can be 6 months.
  2. Organisational anti-patterns – I’ve seen some organisations that rapidly inhibit the ability to deploy product. Some of these anti-patterns are driven by concerns about the risk of deploying software. And often end up with diffuse ownership – where an R and D team can blame the operations team and vice versa.
  3. Building Data Products is risky – Building data products is hard and risky – I think you really need to approach data products in a lean-startup kinda way. Deploy often, if it works it works, if not cut it. Sometimes the middle-management of large corporates is risk-averse and so find these kinds of projects scary. It also needs a lot of expertise –  subject-matter expertise, software expertise, machine learning expertise.
  4. Not allowing talented technical practitioners to use Open Source/ pick the tools – I once worked at a FTSE 100 company that it took me about 6 weeks to be able to install Open Source software tools such as R and Python. It severely restricted my productivity, in that time at a startup my team probably deployed into production, to a customer facing app about 1000 changes. This reminds me of the number 3 here. Don’t restrict the ability of your talented and well-trained people to deliver value. It makes no sense from a business point of view. Data Science produces value only when it produces products or insights for the business or the customers.
  5. Not having a Data Strategy – Data Science is most valuable when it aligns with the business strategy. Too often I’ve seen companies hiring data scientists before they have actual problems for them to work on. I’ve written about this before.
  6. Long term outsourcing deals – This is an insidious one, and one that came from a period of time when “IT didn’t matter”, before big Tech companies proved the value in the consumer space of for example e-commerce. It’s impossible to predict what will be the key tech for the next 10 years, so don’t lock yourself to a vendor for that period of time. Luckily this trend is reversing – we’re seeing the rise of agile, MVP, cloud computing, design thinking, getting closer to the customer. A great article on this re-shoring is here.

I think fundamentally a lot of these anti-patterns come from not knowing how to handle risk correctly. I like the idea in that RedMonk article that big outsourcing is a bit like CDOs in finance. Bundling the risk into one big lump doesn’t make the risk go away.

I learn this day after day working on building data products and tools at Elevate. Being honest about the risks and working hard to de-risk projects and drive down that risk in an agile way is the best we can do.

Finally, I think we’re just getting started building Data Products and deploying data science. It’ll be interesting what we see what other anti-patterns emerge as we grow up as an industry. This is also one of the reasons I’ve joined a startup and why I’m very excited to work on an end-to-end Data Product, which is solving a real-business problem.

What happens when you import modules in Python

Standard

 

I’ve been using Python for a number of years now – but like most things I didn’t really understand this until I investigated it.

Firstly let’s introduce what a module is, this is one of Python’s main abstraction layers, and probably the most natural one.

Abstraction layers allow a programmer to separate code into
parts that hold related data and functionality.

In python you use ‘import’ statements to use modules.

Importing modules

The

import modu

statement will look for the definition
of modu in a file called `modu.py` in the same directory as the caller
if a file with that name exists.

If it is not found, the Python interpreter will search for modu.py in `Python’s search path`.

Python search path can be inspected really easily

import sys
`>>> sys.path`

Here is mine for a conda env.

['', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/pymc3-3.0rc1-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/numpydoc-0.6.0-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/nbsphinx-0.2.9-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/Sphinx-1.5a1-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/recommonmark-0.4.0-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/CommonMark-0.5.4-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/tqdm-4.8.4-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/joblib-0.10.3.dev0-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/pandas-0.19.0rc1-py3.5-macosx-10.6-x86_64.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/Theano-0.8.2-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/numpy-1.11.2rc1-py3.5-macosx-10.6-x86_64.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/imagesize-0.7.1-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/alabaster-0.7.9-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/Babel-2.3.4-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/snowballstemmer-1.2.1-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python35.zip', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/plat-darwin', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/lib-dynload', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/setuptools-27.2.0-py3.5.egg']

What is a namespace?

We say that the modules variables, functions, and classes will be available
to the caller through the modules `namespace`, a central concept in programming that
is particularly helpful and powerful in Python. Namespaces provide a scope containing
named attributes that are visible to each other but not directly accessible outside of the namespace.

So there you have it this is an explanation of what happens when you import, and what a namespace is.

This is based on the Hitchikers guide which is well worth a read 🙂

Are RNN’s ready to replace journalists?

Standard

I recently was experimenting with RNN’s in Keras. I used the example and edited it slightly.

This is what I got for Nietzsche – as you can see the answer above to my question is No.

——– diversity: 0.2
——- Generating with seed: “iginal text, homo natura; to bring it ab”
iginal text, homo natura; to bring it absolute and the sense of the superficial for the suffering of the sense of the things of the sayment of the conception of the fact of the suffering and an an and an animation and an art of the subject, the sense of the experience of the souls of the sense of the contrason of the soul” and as a pleasure of the things of the superficially and an anything the suffering of the souls of the senses of th

——– diversity: 0.5
——- Generating with seed: “iginal text, homo natura; to bring it ab”
iginal text, homo natura; to bring it absolute that is to find ancient which is comparison that the belief in a soul in his own school of his love, and be a pulses of working to the reciantiating, morality and such a regnisoristic and impatiently
and an animation of the sayment of the actions and proudion of the conscience, the sensible and saint and incensed nowadays something of
the most terest to the superficial and decist of the sen

——– diversity: 1.0
——- Generating with seed: “iginal text, homo natura; to bring it ab”
iginal text, homo natura; to bring it able and moral fecth and thus, did alsopisible stinds of what virtuoth experiences–or another which is as still like dne conscience of any men this ethical musiates.

o8i xusted has
among the soul’ yet it is as we
pleasion to ones to you
more courage in the this thus, nexy what is certains by those deming an a myments only
“sight of expsequential time they do all things, that the sensible, for inte

——– diversity: 1.2
——- Generating with seed: “iginal text, homo natura; to bring it ab”
iginal text, homo natura; to bring it abcrude”.

142. can mutly, society, of the long, to beom an
yot. divystess–with theseful, his
poorness of asias and
tactless
life it!–” such one, through pucisomen, just merehonding
hastensce
an
him, old te, the profounded generals, seen fies
everygaing
bale because it
for meardy itsed upon
esprisf. how imvanemed, how he gives to soid of adierch) a pediorice simusreds has slee” in the pri
himse

Why Code review? Or why should I care as a data scientist.

Standard

The insightful Data Scientist Trey Causey talks about Software Development Skills for Data Scientists I’m going to write about my views on Code Review – as a Data Scientist with a few years experience, and experience delivering Data Products at organizations of varying sizes. I’m not perfect and I’m still maturing as an Engineer.

A good thorough introduction to Code Review comes from the excellent team at Lyst I suggest that as follow up reading!

The fundamental nugget is that ‘code reviews allow you to more effectively collaborate with your peers‘ and a lot of new Engineers and Data Scientists don’t know how to do that. This is one reason why I wrote ‘soft skills for data scientists‘. This article talks about a technical skill but I consider this a kind of ‘technical communication’.

Here are some views on ‘why code review’ – I share them here as reference, largely to remind myself. I steal a lot of these from this video series.

  • Peer to peer quality engineering and training 

As a Data Science community that is forming – and with us coming from various backgrounds there’s a lot of invaluable knowledge from others in the team. Don’t waste your chance at getting that 🙂

  • Catches bugs easily

There are many bugs that we all write when we write code.

Keeps team members on the same page

  • Domain knowledge 
    How do we share knowledge about our domain to others without sharing code?
  • Project style and architecture
    I’m a big believer in using structured projects like Cookiecutter Data Science and I’m sure there exist alternatives in other languages. Before hand I had a messy workflow like hacked together IPython notebooks and no idea what was what – refactoring code into modules is a good practice for a reason 🙂
  • Programming skills
    I learn a lot myself by reading other peoples code – a lot of the value of being part of an open source project like PyMC3 – is that I learn a lot from reading peoples code 🙂

Other good practices

  • PEP8 and Pylint (according to team standards)
  • Code review often, but by request of the author only

I think it’s a good idea (I think Roland Swingler mentioned this to me)

To not obsess too much about style – having a linter doing that is better, otherwise code reviews can become overly critical and pedantic. This can stop people sharing code and leads to criticism that can shake Junior Engineers in particular – who need psychological safety. As I mature as an Engineer and a Data Scientist I’m aware of this more and more 🙂

Keep code small

  • < 20 minutes, < 100 lines is best
  • Large code reviews make suggestions harder and can lead to bikeshedding

These are my own lessons so far and are based on experience writing code as a Data Scientist – I’d love to hear your views.

3 tips for successful Data Science Projects

Standard

I’ve been doing Data Science projects, delivering software and doing Mathematical modelling for nearly 7 years (if you include grad school).

I really don’t know everything, but these are a few things I’ve learned.

Consider this like a ‘joel test‘ for Data Science.

  1. Use a reproducible framework like Cookiecutter Data Science. My workflow used to be use an IPython notebook and forget to name things correctly – and discover messy, badly written code 🙂 I’ve now turned to a project structure like Cookiecutter – this has helped me write better, more maintainable code and reminded me to document things and make my work reproducible.
  2. Have a spec for a data science project- all projects should start with an agreed spec between the business stakeholder and the project. This forces people to clarify what they really want. This project should have a ‘goal’. Just to clarify – I mean a well defined goal that is Specific, Measurable, Achievable, Realistic and Time bounded – SMART.
  3. Make sure your stakeholders are realistic about the ‘failure’ aspect of R and D. One of the anti-patterns I’ve encountered in Data Science is stakeholders being immature and not realizing that for example ‘this Bayesian model doesn’t work for this kind of problem’ isn’t a statement of incompetence but it is a statement of a fact of the matter about the world. If organizations can’t accept that, they deserve suboptimal Data Science. R and D work is not engineering – failures teach us something too!

What are your views? I’d love to hear them 🙂

Interview with a Data Scientist – Jessica Graves

Standard

Jessica Graves is a Data Scientist who currently works on fashion problems in New York City. She’s worked with Hilary Mason at Fast Forward Labs and keeps in regular contact with the London startup scene. After many months of asking her for an interview she finally gave in – and she shares her unique perspective on the datafication of Fashion. She comes from a background in visual and performing arts, as well as fashion design. In her spare time you’ll find her reading a stack of papers or studying dance.

Cover image: unsplash.com CCO

Jessica Graves_02-1

  1. What project have you worked on do you wish you could go back to, and do better?

I worked with Dr. Laurens Mets on an iteration of the technology behind Electrochaea, a device where microbes convert waste electricity to clean natural gas. My job was to translate models from electrochemistry journals into code, to help simulate, measure and optimize the parameters of the device. We needed to facilitate electron transport and keep the microbes happy. Read papers, write code, and design alternative energy technology with math + data?! I would hand my past self How to Design Programs as a guide and learn to re-implement from scratch in an open source language. 

  1. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

Listen! If you are a data scientist, listen carefully to the business problems of your industry, and see the problems for what they are, rather than putting the technical beauty of and personal interest in the solution first and foremost. You may find it’s more important to you to work with a certain type of problem than it is to work at a certain type of company, or vice versa. Watch very carefully when your team expresses frustration in general – articulate problems that no one knows they should be asking you to solve. At the same time, it can be tempting to work on a solution that has no problem. If you’re most interested in a specific machine learning technique, can you justify its use over another, or will high technical debt be a serious liability? Will a project be leveragable (legally, financially, technically, operationally)? Can you quantify the risk of not doing a project? 

  1. What do you wish you knew earlier about being a data scientist?

I wish I realized that data science is classical realist painting.

Classical realists train to accurately represent a 3D observation as a 2D image. In the strictest cases, you might not be allowed to use color for 1-3 years, working only with a stick of graphite, graduating to charcoal and pencils, eventually monotone paintings. Only after mastering the basics of form, line, value, shade, tone, are you allowed a more impactful weapon, color. With oil painting in particular, it matters immensely in what order at what layer you add which colors, which chemicals compose each color, of which quality pigment, at what thickness, with what ratio of which medium, with which shape of brush, at what angle, after what period of drying. Your primary objective is to continuously correct your mistakes of translating what you observe and suspending your preconception of what an object should look like.

There are many parallels with data science. At no point as a classical realist painter should you say, ‘well it’s a face, so I’m going to draw the same lines as last time’ just like as a data scientist, you should look carefully at the data before applying algorithm x, even if that’s what every blog post Google surfaces to the top of your results says to do in that situation. You have to be really true to what you observe and not what you know – sometimes a hand looks more like a potato than a hand, and obsessing over anatomical details because you know it’s a hand is a mistake. Does it produce desirable results in the domain of problems that you’re in? Are you assuming Gaussian distributions on skewed data? Did you go directly to deep learning when logistic regression would have sufficed? I wish I knew how often data science course offerings are paint by numbers. You won’t get very far once the lines are removed, the data is too big to extract on your laptop, and an out-of-memory error pops up running what you thought was a pretty standard algorithm on the subset you used instead. Let alone that you have to create or harvest the data set in the first place – or sweet talk someone into letting you have access to it.  

In addition, Nulla dies sine linea – it’s true for drawing, ballet, writing. It’s true for data science. No day without a line. It’s very difficult to achieve sophistication without crossing off days and days of working through code or theoretical examples (I think this is why Recurse Center is so special for programmers). Sets of bland but well-executed tiny piece of software. Unspectacular, careful work in high volumes raises the quality of all subsequent complex works. Bigger, slower projects benefit from myriads of partially explored pathways you already know not to take.

Also side notes to my past self: Linux. RAM. Thunderbolt ports. 

  1. How do you respond when you hear the phrase ‘big data’?

Big data? Like in the cloud? Or are we in the fog now? Honestly the first thing I see in my mind is PETABYTES. I think of petabytes of selfies raining from the sky and flowing into a data lake. Stagnant. Data-efficient AI is all the rage — less data, more primitives, smarter agents. In the meantime, optimizing hardware and code to work with large datasets is pretty fun. Fetishizing the size of the data works well …as long as you don’t care about robustness to diverse inputs. Can your algorithm do well with really niche patterns? What can you do with the bare minimum amount of data? 

  1. What is the most exciting thing about your field?

Fashion is visual. It’s inescapable. Every culture has garb or adornment, however minimal. A few trillion dollars of apparel, textiles, and accessories across the globe. The problems of the industry are very diverse and largely unsolved. A biologist might come to fashion to grow better silk. An AI researcher might turn to deep learning to sift through the massive semi-structured set of apparel images available online. So many problems that may have a tech solution are unsolved. Garment manufacturing is one of the most neglected areas of open source software development. LVMH and Richemont don’t fight over who provided the most sophisticated open-source tools to researchers the way that Amazon and Google do. You can start a deep learning company on a couple grand and use state-of-the-art software tools for cheap or free. You cannot start an apparel manufacturing vertical using state-of-the-art tools without serious investment, because the climate is still extremely unfavorable to support a true ecosystem of small-scale independent designers. The smartest software tools for the most innovative hardware are excessively expensive, closed-source, and barely marketed — or simply not talked about in publically accessible ways. Sewing has resisted automation for decades, although is finally now at a place now were the joining of fabrics into a seam is robot-automatable with computer vision used on a thread-by-thread basis to determine the location of the next stitch. 

High end, low end, or somewhere in between, the apparel side of fashion’s output is a physical object that has to be brought to life from scratch, or delivered seamlessly, to a human, who will put the object on their body. Many people participate in apparel by default, but the fashion crowd is largely self-selected and passionate, so it’s exciting (and difficult) to build for such an engaged group that don’t fit standard applications of standard machine learning algorithms.

  1. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

Artists learn this eventually: volume of works produced trumps perfectionism. Even to match something in classical realism, you start with ridiculous abstractions. Cubes and cylinders to approximate heads and arms. Break it down into the smallest possible unit. Listen to Polya, “If you can’t solve a problem, then there is an easier problem you can solve: find it.”

As for when to finish? Nothing is never good enough. The thing that is implemented is better than the abstract, possibly better thing, for now, and will probably outlive its original intentions. But make sure that solution correlates thoroughly with the problem, as described in the words of the stakeholder. Otherwise, for a consumer-facing product or feature, your users will usually give you clues as to what’s working. 

  1. You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?

Be open. Fashion has a lot of space for innovation if you understand and quantify your impact on problems that are actually occurring and costing money or time, and show that you can solve them fast enough. “We built this new thing” has absolutely nothing to do with “We built this useful thing” and certainly not “We built this backwards-compatible thing”. You might be tempted to recommend a “new thing” and then complain that fashion isn’t sophisticated enough or “data” enough for it. As an industry that in some cases has largely ignored data for gut feelings with a serious payoff, I think the attitude should be more of pure respect than of condescension, and of transitioning rather than scrapping. That or build your own fashion thing instead of updating existing ones.  

  1. You have worked in fashion. Can you talk about the biggest opportunities for data in the fashion industry. Are there cultural challenges with datafication in such a ‘creative industry’.

Fashion needs ‘datafication’ that clearly benefits fashion. If you apply off-the-shelf collaborative filtering to fashion items with a fixed seasonal shelf life to users that never really interact with, you’re going to get poor results. Algorithms that work badly in other domains might work really well in fashion with a few tweaks. NIPS had an ecommerce workshop last year, and KDD has a fashion-specific workshop this year, which is exciting to see, although I’ll point out that researchers have been trying to solve textile manufacturing problems with neural networks since the 90s.

A fashion creative might very well LOVE artificial intelligence, machine learning, and data science if you tailor your language into what makes their lives easier. Louis Vuitton uses an algorithm to arrange handbag pattern pieces advantageously on a piece of leather (not all surfaces of the leather are appropriate for all pattern pieces of the handbag) and marks the lines with lasers before artisans hand-cut the pieces. The artisans didn’t seem particularly upset about this. 

The two main problems I still see right now are the doorman problem and fit. Use data and software to make it simple for designers of all scales to adjust garments to fit their real markets instead of their imagined muses. And, use as little input as possible to help online shoppers know which existing items will fit. Once they buy, make sure they get their packages on time, securely, discreetly.