Interview with a Data Scientist: Mick Cooney

Standard

I’m delighted to feature my friend Mick Cooney here as an interviewee. Mick has many years of experience in Finance and more recently in Insurance, he co-ran the Dublin R meetup which was very successful and helped foster a data science community in Dublin. More recently he’s been working over in London at an Actuarial Consultancy – building out a data science practice.

q1. What project have you worked on do you wish you could go back to,
and do better?

I started my career as a quant in a small startup hedge fund. We
developed time series models to forecast short-term volatility in
equities and equity indices as part of an option trading strategy. It
is a fascinating topic and I still dabble in it. Thinking back on the
work done, I would re-engineer large portions of it. I made a ton of
mistakes on both the modelling and implementation side, and the R
language in particular has progressed in strides since I did the bulk
of the work.

For example, the system automatically generates PDF reports of the
forecasts but it does so by hand creating La-TeX files compiled into
PDF. One of the first things I would do is switch all that over to use
either ‘knitr’ or ‘rmarkdown’. I would also use more ‘reproducible
research’ concepts.

That said, I had worked on the modeling for a long time, so I am
content with the basic model. There are many things still to
investigate or implement.

On the modeling side, I worked on a persistency model using survival
analysis, which is how I learned about the subject in the first
place. As a result, there are a lot of different things I would love
to return to and do differently. In retrospect, I was too quick to
move past the simpler models. We could see the assumptions were not
consistent with the data, and so did not fully explore simpler
approaches. I am now curious to learn what insights those simpler
approaches would yield.

Customer churn is such a universal problem I expect I will be working
on it again in the near future. Hopefully I can apply those lessons
then.

***
q2. What advice do you have to younger analytics professionals and in
particular PhD students in the Sciences?

I think the key advice I would give is the same for everyone – never
stop learning. This may be the availability heuristic at play with me,
but I have never seen a connection between qualifications and analyst
quality. All the good analysts I know have curiosity and
initiative. Academic achievements do not come into it at all.

Initiative manifests in many ways. First, when they encounter a
problem they learn what they need to do and get on with it. Second,
much of their knowledge is self-taught. Finally, and I believe most
importantly, they have an inherent curiosity – the best analysts I
know engage in the field in their own time, mainly because they want
to.

This brings up a related issue I have been pondering for some time. I
am ambitious. I want to be a top data scientist some day. I have no
academic ambition whatsoever, but my goal is to be able to hold my own
in any conversation with anyone in the field.

How do I achieve this? What do I need to do to get to that point?

While probably not as keen as the average fan, I love sport – soccer,
the NFL and Gaelic Football in particular. For anyone who has met me
in person, comparing me to a top athlete seems preposterous, but
there is a lot to be learned from top athletes if you want to excel
at your chosen field. Look at how they prepare and train. These
principles almost certainly apply to other professions too, but it is
more fun to talk about sport. 🙂

When I read about Lionel Messi, Tom Brady or Colm Cooper (for our
non-Irish readers the recently-retired ‘Gooch’ is arguably the
greatest GAA player to ever play the game – he was majestic to watch),
the one thing that always stands out for me is their fanatical
devotion to their chosen career not their obvious talent. All their
team-mates mention how hard they worked despite their abundance of
natural advantages. Players with huge natural talent often coast, but
elite players are the opposite – they work as hard as the fringe
players slogging to just survive the cut.

In our field, we need to work constantly on improving – going to
Meetups, reading about new techniques, watching videos on YouTube and
looking to strengthen areas where you are weak. This is why a natural
interest and curiosity is so invaluable – it makes these necessary
tasks much less of a burden as they are things you would want to do
anyway.

Secondly, top players do the simple things well, almost never making a
mistake. They are fallible of course, and make mistakes, but almost
never on the basics. They are rigorous about practicing the basic
skills and principles, and that is why they are so good. The bread and
butter of their craft is second-nature to them.

This is why I focus so much on basic statistics classes and reread and
re-watch the books and lectures I find useful. I want these things to
be second nature and they are not.

Probability and statistics are so counter-intuitive that I almost
never get things right on gut feeling. I am almost always wrong. So
much so that I gave a talk about probabilistic graphical models about
a year ago and during the questions at the end made an off-hand joke
about going with the opposite of my intuition.

It was said in jest at the time but is sadly true!

One final piece of advice is to help as many people as you can. Help
people with their homework, with some programming, with their computer
problems and with data problems. You get exposed to all sorts of
topics and problems, most of which you will see again in your
career. You also get the added bonus of people thinking you are
selfless and altruistic, despite being self-serving in reality!

***
q3. What do you wish you knew earlier about being a data scientist?

I have two main things I wish I learned early on in my career, and
both are connected philosophically. First, I wish I had learned about
probabilistic thinking, risk management, economics and statistics –
you can never learn enough about these fundamental topics. Secondly, I
wish I learned it is okay to start working with a bad model that you
know is wrong but simple.

To that first point, I spend a long time fighting my natural desire
for a clean, elegant and correct answer to a problem. I would work on
a problem, get to a point that I was confident pointed us in the right
direction, but then realise that ‘proving’ this was right involved a
huge amount of time and effort, assuming it was possible.

I attributed my natural reluctance to pursue this ‘answer’ as
laziness, and felt guilty. I felt I was being unprofessional and
sloppy. But working on forecasting models for trading taught me that
this was not the case. Models are so imperfect, with so many
compromises it is often more optimal to think about other things first
– what are the limitations of the model in practice, what is it
saying, how are you going to use it. Answer those questions first,
THEN worry about improving it.

This is why I always start with simple, stupid, wrong models. They are
quick to produce, they help you learn a lot about what you are doing,
they fail in spectacular ways and they are sometimes all you need. In
terms of costs and benefits, they are hard to beat.

***
q4. How do you respond when you hear the phrase ‘big data’?

I hate it. It has become a meaningless buzzword used as a means of
making sales.

My attitude to the term is best summarised by the interview you had
with Hadley Wickham: there are three categories of data size,
in-memory, on-disk and finally the truly ‘big data’ problems like
recommender systems. I believe the majority of problems can be solved
by appropriate sampling of your data down to a manageable size and
then analysing those subsets.

After all, the whole point of statistics is to make inferences about a
population from a sample of the data.

Once decided on a solution, putting the model into production and
scaling it for your business is a major issue, but is a problem more
belonging to the realm of network and software engineering. That said,
it is important to keep people with a solid understanding of the
concepts stay involved, just in case some ‘optimisations’ ruin the
output.

***
q5. What is the most exciting thing about your field?

Robert McNamara in ‘The Fog of War’ mentioned that you should never
answer the question asked but instead answer the question you wanted
to be asked, so with your forebearance I will first answer a liberal
interpretation of that question: what work gets me excited?

The short answer to that question is all sorts of things do, but they
are often small things related to work I am doing. In the last few
months, I was excited to try out dataexpks (a data exploration package
I am co-creating) on a brand new data set to see what it showed me and
how well my code worked. I love think of ways to use Monte Carlo
simulation to test the output of various regression models, and over
Christmas I was fascinated by a short project trying out methods of
investigating differences between a subpopulation within a larger
population.

I am fascinated by new ways to learn the fundamentals – there are a
few excellent ones out there and I read them all the time. I can never
learn enough as in my experience reality tends to present us with
basic statistical problems in new and unusual ways.

Having multiple perspectives and multiple approaches is invaluable in
those situations.

Regarding your original question as I think you intended, I think the
advances in reinforcement learning techniques probably have the
biggest potential – some of the Atari gameplaying from Deep Mind was
eye-opening. Sadly, if history is any guide, much of it will prove to
be hype, but I imagine some very interesting results to come from the
work.

***
q6. How do you go about framing a data problem – in particular, how do
you avoid spending too long, how do you manage expectations etc. How
do you know what is good enough?

Framing a data problem is a tough one to answer – I am not sure what I
do or how to articulate it. I have had the good fortune to help a lot
of people with their projects and problems, exposing me to a wide
variety of problems. I learned something from all of them and I rely
on that a lot.

I also read a lot of blogs, articles and subscribe to mailing
lists. While rarely having the time to read all this, often all you
need to get started on a problem is a vague memory of some technical
topic that may help and some terminology to Google.

As a result, the first thing I focus on is understanding the problem:
what is being asked? Do we have any data? What does is it look like?
Are there other data available we can use to enrich or use as a
substitute?

Going through that process will suggest approaches to use, and at that
point I draw upon previous experience, however tangential to the
problem..

By keeping this focus, your other questions are straightforward to
answer: if the current model is not likely to improve the answer by an
amount relevant to the goal, it is not worth spending more time
on. Similarly, knowing what is needed will tell you if your current
model is good enough, or often if there is a model that is good enough
– it is possible the level of accuracy required is not feasible.

In the latter case, discovering that early is much better than later –
you know not to waste time, money and resources on a lost cause.

***
q7. You’ve spoken before about the ‘need for apprenticeships’ in Data
Science. Do you have any suggestions on what that would involve? Are
meetups and coaching a good first start?

To explain the point I was making on that note, I think there is a lot
of implicit knowledge in this field, and I have been told a number of
times from people looking for help that people feel overwhelmed by the
sheer amount of knowledge people feel they need to know.

I do not think this is true, but I understand its origin: there is so
many different aspects to working with data it is tough to know where
to start. I always start very simple, but as I mentioned early, it
took a lot of time, thought and effort to get to that point, and it is
not easy to explain these ideas in theory – you have to work on a
number of different datasets to get a feel for how to do this.

As a result, I believe an approach such as mentoring or
apprenticeships are an effective approach to teach people – more
experienced analysts can guide junior members around the various
pitfalls and traps that are easy to fall into. It allows us to
illustrate that fancy and sophisticated techniques and algorithms are
not needed to do interesting work – some of the most interesting work
I have seen involved little more than summary statistics along with
basic models like linear regression and decision trees.

This is hard to learn from a book – almost impossible. The closest
book I read that talks about this is “Data Analysis Using Regression
and Multilevel/Hierarchical Models” by Gelman and Hill, stressing the
importance of starting from simple models. I would love to know if
there are more.

That said, I could only appreciate the point because I was already
experienced, a younger version of myself would have missed the
point. It would not have occurred to me that the right way to do
something is to do the simple and obvious thing.

I am a firm believer in the KISS principle. Keep It Simple, Stupid.

Interview with a Data Scientist: Phillip Higgins

Standard

Phillip Higgins is a data science consultant based in New Zealand. His experience includes financial services and working for SAS, amongst other experience including some in Germany.

What project have you worked on do you wish you could go back and do better?

Hindsight is a wonderful thing, we can always find things we could have done better in projects.  On the other hand, analytic and modelling projects are often frought with uncertainty- uncertainty that despite the best planning, is not available to foresight. Most modelling projects that I have worked on could have been improved with the benefit of better foresight!

What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

Firstly, I would advise younger analytics professionals to develop both deep knowledge of a particular area and at the same time, to broaden their knowledge and to maintain this focus of learning on both specialised and general subjects throughout their careers.  Secondly, its important to gain as much practice as possible – data science is precisely that because it deals with real-world problems.  I think PhD students should cultivate industry contacts and network widely- staying abreast of business and technology trends is essential.

What do you wish you knew earlier about being a data scientist?
Undoubtedly I wish I knew the importance of communication skills in the whole analytics life-cycle.  Its particularly important to be able to communicate findings to a wide audience and so refined presentation skills are a must.

How do you respond when you hear the phrase ‘Big Data’?

I think Big Data offers data scientists with new possibilities in terms of the work they are able to perform and the significance of their work.  I don’t think it’s a coincidence that the importance and demand of data scientists has risen sharply right at the time that Big Data has become mainstream- for Big Data to yield insights, “Big Analytics” need to be performed – they go hand in hand.

What is the most exciting thing about your field?

For me personally it’s the interesting people I meet along the way.  I’m continually astounded by the talented people I meet.

How do you go about framing a data problem – in particular, how do you manage expectations etc.  How do you know what is good enough?

I think its important to never lose sight of the business objectives that are the rationale for most data-scientific projects.  Although it is essential that businesses allow for data science to disprove hypotheses, at the end of the day, most evidence will be proving hypotheses (or disproving the null hypothesis).  The path to formulating those hypotheses lies obviously mostly in exploratory data analysis (combined with domain knowledge).  It is important to communicate this uncertainty as to framing from the outset, so that there are no surprises.

You spent some time as a consultant in data analytics.  How did you manage cultural challenges, dealing with stakeholders and executives?  What advice do you have for new starters about this?

In consulting you get to mix with a wide variety of stakeholders and that’s certainly an enjoyable aspect of the job.  I have dealt with a wide range of stakeholders, from C-level executives through to mid- level managers and analysts and each group requires a different approach.  A stakeholder analysis matrix is a good place to start- analysing stakeholders by importance and influence.  Certainly, adjusting your pitch and being aware of the politics behind and around any project is very important.

 

Interview with a Data Scientist: Trey Causey

Standard
Trey Causey is a blogger with experience as a professional data scientist in sports analytics and e-commerce. He’s got some fantastic views about the state of the industry, and I was privileged to read this.
1. What project have you worked on do you wish you could go back to, and do better?
The easy and honest answer would be to say all of them. More concretely, I’d love
to have had more time to work on my current project, the NYT 4th Down Bot before
going live. The mission of the bot is to show fans that there is an analytical
way to go about deciding what to do on 4th down (in American football), and that
the conventional wisdom is often too conservative. Doing this means you have to
really get the “obvious” calls correct as close to 100% of the time as possible,
but we all know how easy it is to wander down the path to overfitting in these
circumstances…
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences and Social Sciences?
Students should take as many methods classes as possible. They’re far more generalizable
than substantive classes in your discipline. Additionally, you’ll probably meet
students from other disciplines and that’s how constructive intellectual cross-fertilization
happens. Additionally, learn a little bit about software engineering (as distinct
from learning to code). You’ll never have as much time as you do right now for things
like learning new skills, languages, and methods.
For young professionals, seek out someone more senior than yourself, either at your
job or elsewhere, and try to learn from their experience. A word of warning, though,
it’s hard work and a big obligation to mentor someone, so don’t feel too bad if
you have hard time finding someone willing to do this at first. Make it worth
their while and don’t treat it as your “right” that they spend their valuable
time on you. I wish this didn’t even have to be said.
3. What do you wish you knew earlier about being a data scientist?
 
It’s cliche to say it now, but how much of my time would be spent getting data,
cleaning data, fixing bugs, trying to get pieces of code to run across multiple
environments, etc. The “nuts and bolts” aspect takes up so much of your time but
it’s what you’re probably least prepared for coming out of school.
4. How do you respond when you hear the phrase ‘big data’?
Indifference.
5. What is the most exciting thing about your field?
Probably that it’s just beginning to even be ‘a field.’ I suspect in five years
or so, the generalist ‘data scientist’ may not exist as we see more differentiation
into ‘data engineer’ or ‘experimentalist’ and so on. I’m excited about the
prospect of data scientists moving out of tech and into more traditional
companies. We’ve only really scratched the surface of what’s possible or,
amazingly, not located in San Francisco.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
A difficult question along the lines of “how long is a piece of string?” I think
the key is to communicate early and often, define success metrics as much as
possible at the *beginning* of a project, not at the end of a project. I’ve found
that “spending too long” / navel-gazing is a trope that many like to level at data
scientists, especially former academics, but as often as not, it’s a result of
goalpost-moving and requirement-changing from management. It’s important to manage
up, aggressively setting expectations, especially if you’re the only data scientist
at your company.
7. How do you explain to C-level execs the importance of Data Science? How do you deal with the ‘educated selling’ parts of the job? In particular – how does this differ from sports and industry?
Honestly, I don’t believe I’ve met any executives who were dubious about the
value of data or data science. The challenge is often either a) to temper
unrealistic expectations about what is possible in a given time frame (we data
scientists mostly have ourselves to blame for this) or b) to convince them to
stay the course when the data reveal something unpleasant or unwelcome.
8. What is the most exciting thing you’ve been working on lately and tell us a bit about it.
I’m about to start a new position as the first data scientist at ChefSteps, which
I’m very excited about, but I can’t tell you about what I’ve been working on there
as I haven’t started yet. Otherwise, the 4th Down Bot has been a really fun
project to work on. The NYT Graphics team is the best in the business and is
full of extremely smart and innovative people. It’s been amazing to see the
thought and time that they put into projects.
9. What is the biggest challenge of leading a data science team?
I’ve written a lot about unrealistic expectations that all data scientists
be “unicorns” and be experts in every possible field, so for me the hardest
part of building a team is finding the right people with complementary skills
that can work together amicably and constructively. That’s not special to
data science, though.

Interview with a Data Scientist: Thomas Wiecki

Standard

I interviewed Thomas Wiecki recently – Thomas is Data Science Lead at Quantopian Inc which is a crowd-sourced hedge fund and algotrading platform. Thomas is a cool guy and came to give a great talk in Luxembourg last year – which I found so fascinating that I decided to learn some PyMC3 🙂

1. What project have you worked on do you wish you could go back to, and do better?
While I was doing my masters in CS I got a stipend to develop an object recognition framework. This was before deep learning dominated every benchmark data set and bag-of-features was the way to go. I am proud of the resulting software, called Pynopticon (https://code.google.com/p/pynopticon/wiki/Introduction), even though it never gained any traction. I spent a lot of time developing a streamed data piping mechanism that was pretty general and flexible. This was in anticipation of the large size of data sets. In retrospect though it was overkill and I should have spent less time coming up with the best solution and instead spend time improving usability! Resources are limited and a great core is not worth a whole lot if the software is difficult to use. The lesson I learned is to make something useful first, place it into the hands of users, and then worry about performance.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Spend time learning the basics. This will make more advanced concepts much easier to understand as it’s merely an extension of core principles and integrates much better into an existing mental framework. Moreover, things like math and stats, at least for me, require time and continuous focus to dive in deep. The benefit of taking that time, however, is a more intuitive understanding of the concepts. So if possible, I would advise people to study these things while still in school as that’s where you have the time and mental space. Things like the new data science tools or languages are much easier to learn and have a greater risk of being ‘out-of-date’ soon. More concretely, I’d start with Linear Algebra (the Strang lectures are a great resource) and Statistics (for something applied I recommend Kruschke’s Doing Bayesian Analysis, for fundamentals “The Elements of Statistical Learning” is a classic).
3. What do you wish you knew earlier about being a data scientist?
How important non-technical skills are. Communication is key, but so are understanding business requirements and constraints. Academia does a pretty good job of training you for the former (verbal and written), although mostly it is assumed that communicate to an expert audience. This certainly will not be the case in industry where you have to communicate your results (as well as how you obtained them) to people with much more diverse backgrounds. This I find very challenging.
As to general business skills, the best way to learn is probably to just start doing it. That’s why my advice for grad-students who are looking to move to industry would be to not obsess over their technical skills (or their Kaggle score) but rather try to get some real-world experience.
4. How do you respond when you hear the phrase ‘big data’?
As has been said before, it’s quite an overloaded term. On one side, it’s a buzzword in business where I think the best interpretation is that ‘big data’ actually means that data is a ‘big deal’ — i.e. the fact that more and more people realize that by analyzing their data they can have an edge over the competition and make more money.
Then there’s the more technical interpretation where it means that data increases in size and some data sets do not fit into RAM anymore. I’m still undecided of whether this is actually more of a data engineering problem (i.e. the infrastructure to store the data, like hadoop) or an actual data science problem (i.e. how to actually perform analyses on large data). A lot of times, as a data scientist I think you can get by by sub-sampling the data (Andreas Müller has a great talk of how to do ML on large data sets https://www.youtube.com/watch?v=l43VIw5xhTg).
Then again, more data also has the potential to allow us to build more complex models that capture reality more accurately, but I don’t think we are there yet. Currently, if you have little data, you can only do very simple things. If you have medium data, you are in the sweet spot where you can do more complex analyses like Probabilistic Programming. However, with “big data”, the advanced inference algorithms fail to scale so you’re back to doing very simple things. This “big data needs big models” narrative is expressed in a talk by Michael Betancourt: https://www.youtube.com/watch?v=pHsuIaPbNbY
5. What is the most exciting thing about your field?
The fast pace the field is moving. It seems like every week there is another cool tool announced. Personally I’m very excited about the blaze ecosystem including dask which has a very elegant approach to distributed analytics which relies on existing functionality in well established packages like pandas, instead of trying to reinvent the wheel. But also data visualization is coming along quite nicely where the current frontier seems to be interactive web-based plots and dashboards as worked on by bokeh, plotly and pyxley.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 
I try to keep the loop between data analysis and communication to consumers very tight. This also extends to any software to perform certain analyses which I try to place into the hands of others even if it’s not perfect yet. That way there is little chance to ween off track too far and there is a clearer sense of how usable something is. I suppose it’s borrowing from the agile approach and applying it to data science.

Interview with a Data Scientist: Erin Shellman

Standard
I recently caught up with Erin for an interview. Her interview is full of nice pieces of hard-earned advice and her final answer on Data Governance is gold!
Erin does some great blog posts at her blog, which I recommend. Erin is a programmer + statistician working as a research scientist at Amazon Web Services. Before that she was a Data Scientist in the Nordstrom Data Lab, where she primarily built product recommendations for Nordstrom.com. She mostly codes in Scala, Python and R, but dabbles in Javascript to put data on the internet. Erin loves to teach and speak, and does both often through talks, as co-organizer of PyLadies-Seattle, and as an instructor at the University of Washington’s Professional and Continuing Education program.
 
1. What project have you worked on do you wish you could go back to, and do better?
Often the goal of data science projects is to automate processes with data–I worked on a lot of projects at Nordstrom with that goal. I think we were pretty naive in those pursuits, often approaching the problems with low empathy and EQ (Emotional Quotient). We built tools, expecting that the teams we were trying to automate would immediately see the value and jump to use them, but we didn’t spend a lot of time listening and trying to understand why some might be hesitant to adopt our tools. Eventually, I started training people and specifically asking them to send bug reports or feature requests. The trainings opened up dialog about our plans and made the other teams more invested, because they could see when their bugs were fixed and their feature implemented. I learned that doing the data work is only half (or less) of the challenge, the other is advocating for your work in such a way that others are similarly compelled.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
If you’re in school right now, use this time to master a programming language (you have more time than you ever will again despite what you may believe). For data science, I’d recommend Python, R or Scala (and if you had to choose one, Python). You absolutely need to be able to produce high-quality code before you walk in the door because chances are you’ll be asked to code early in the interview process.
I also think you shouldn’t spend too much time “training” and learning in your free time, it’s nearly impossible to retain knowledge that way. Instead, spend all your time shoring up the essentials and work on getting a job immediately. You’ll learn so much more on the job than you could ever hope to on your own, plus you’ll be paid. Don’t wait for postings for junior data scientists (I don’t know that I’ve ever even seen one), contact employers you’re interested in working with directly and ask them to make that role for you. You should look for places where you know there’s a solid data team already so you have plenty of people to learn from. Academics tend to have a sort of learned helplessness because they’re so often not in control of their work or careers. This is not the case in industry, if you want something, don’t wait for it to come to you (it won’t). Be an active participant in your future.
3. What do you wish you knew earlier about being a data scientist?
I wish I had spent more time in grad school learning computer science. Often DS (Data Science) jobs end up being almost the same as CS (Computer Science) jobs, and in my case I had to pick up a lot of CS skills on the job.
4. How do you respond when you hear the phrase ‘big data’?
Usually by rolling my eyes so far into the back of my head that they get stuck. I think the return on investment of Any Data is still higher than that of Big Data. Most shops who’re convinced that they need big data technology don’t make use of the data they have already, and adding more data to the pile won’t help the cause.
5. What is the most exciting thing about your field?
The most exciting thing is that I get to learn for a living. Every time I switch jobs or work on something new I have to learn a ton, different technologies and languages, different domains, and different businesses. I especially love that data science is often so close to the business. I love learning about what makes a business successful and providing knowledge to help businesses make better decisions.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 
When I’m approaching a new problem I focus really hard on the inputs and outputs, particularly the output. What exactly are you trying to produce, or trying to answer? This is often a question I pose to business stakeholders to encourage them to think critically about and what they really want to know, how it will be applied, and how to formally articulate it. Basically what I encourage them to do is state a formal hypothesis and the observations required to test that hypothesis. Once we’ve all agreed on the output, what are the inputs? I try to make this as specific as possible, so no “customer data”-level descriptions. Tell me exactly what the inputs are, e.g. annual customer spend, age, and zip code. The more you can reason through the solution in terms of inputs and outputs before you set out to solve the problem the less likely it will be that you’re halfway to answering a question that was ill-posed (I promise, this is 90% of requests), or that you don’t have data to support (this is probably another 5% of requests). It’s also a good way to prevent “stakeholder punting” which is a phrase I made up just now to describe when stakeholders make half-baked requests and then leave them for you to sort out. Data science and research is highly collaborative, and the data scientist shouldn’t be the only one invested in the work.
Once the inputs and outputs are defined, I like to draw flowcharts of the path to completion, and it’s usually easier to start from the bottom. Here’s an example I created for the students in my data mining course. They were working on prediction of a continuous outcome with various regression methods. First we decided on a criteria for model selection, which in this case was the model with the lowest root mean squared error. You can see that the input is a data file, and the output is whichever model had the best predictive accuracy as measured by the lowest RMSE (Root Mean Square Error). For me, diagramming your work like this makes your goal completely concrete.
 Inline image 1
The other really great thing about framing problems this way is that it makes it very easy to estimate effort and communicate to others what is required to complete the projects. For whatever reason, people often assume that while software engineers need 2 weeks to add a minor feature, data scientists need about 6 hours to do complete analyses and make beautiful visualizations. Communicating the amount of work required to complete projects to the requesters is crucial in data science, because most people just don’t know. It’s not something software engineers typically have to do, but providing guidance on the components of a data science project to your stakeholders will reduce your stress in the long-run.
7. What does data governance or data quality mean to you as a data scientist?
Data governance is the collection of processes and protocols to which an organization conforms to insure data accuracy and integrity. Most of the time I’m a data consumer, so I depend on a mature data infrastructure team to create the pipelines I use to collect and analyze data. When I was working on recommendations at Nordstrom, I was a consumer and provider. I provided data in the sense that the output of my recommendation algorithms was data consumed by the web team. Data governance in that context meant writing lots of unit tests to make sure the results of my computations produced correctly formatted entries. It also meant applying business rules, for example, removing entries for products out of stock, or applying brand restrictions.

Interview with a Data Scientist: Peadar Coyle

Standard

Peadar Coyle is a Data Analytics professional based in Luxembourg. His intellectual background is in Mathematics and Physics, and he currently works for Vodafone in one of their Supply Chain teams.

He is passionate about data science and the lead author of this project. He also contributes to Open Source projects and speaks at EuroSciPy, PyData and PyCon.

His expertise is largely in the statistical side of Data Science.

Peadar was asked by various of his interviewees to share his own interview, so he does humbly. 

  1. What project have you worked on do you wish you could go back to, and do better?

I agree that it is better to look forward rather than look backward. And my skills have frankly improved since I first started doing what we could call professional data analysis (which was probably just before starting my Masters a few years ago).

One project I did which springs to mind (and not naming names) is where there was a huge breakdown in communication and misaligned incentives. There needed to be more communication on that project and it overran the initial allotted time. I also spent not enough time communicating up front the risks and opportunities with the stakeholders.

The data was a lot messier than expected, and management had committed to delivered results in 2 weeks. This was impossible, the data cleaning and exploration phase took too long. Now I would focus on quicker wins. I also rushed to the ‘modelling’ phase without really understanding the data. I think such terms ‘understanding the data’ sound a bit academic to some stakeholders, but you need to clearly and articulately explain how important the data generation process is, and the uncertainty in that data. 

Some of this comes from experience – now I focus on adding value as quickly as possible and keeping things simple. There I fell to the siren call of ‘do more analysis’ rather than thinking about how the analysis is conveyed.

  1. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

I don’t have a PhD but I have recently been giving advice to people in that situation.

My advice is that having a portfolio of work if possible is great, or at least move towards doing an online course on Machine Learning or something cool like that.

The PyData videos are a good start too to watch. I’d recommend if you can to do any outreach or communication skills courses. There are many such courses at a lot of universities around the world, it’ll just help you understand the needs of others.

I think frankly that the most important skill for a Data Science is the ‘tactical application of empathy’ and that is something that working in a team really helps you develop. One thing I feel my Masters let me down on – as is common in Pure Mathematics – was a shortage of experience working in a team.

  1. What do you wish you knew earlier about being a data scientist?

The focus on communication skills, the need to add value every day. The fact that budget or a project can be terminated at any moment.

Adding value every day means showing results and sharing them, talking to people about stuff. Share visualizations, and share results – a lot of data science is about relationships and empathy. In fact I think that the tactical application of empathy is the greatest skill of our times.

You need to get out there and speak to the domain specialist, and understand what they understand. I believe that the best algorithms incorporate human as well as machine intelligence.

  1. How do you respond when you hear the phrase ‘big data’?

I too like the distinction of the small, medium and big data. I don’t worry so much about the terminology, and I focus on understanding exactly what my stakeholder wants from it.

I think though that it is often a distraction. I did one proof of concept as a consultant, that was an operational disaster. We didn’t have the resources to support a dev ops culture, nor did we have the capabilities to support a Hadoop cluster. Even worse the problem really could be solved more intelligently by being in RAM. But I got excited by the new tools, without understanding what they were really for.

I think this is a challenge, part of myself maturing as an engineering/ data scientist is appreciating the limits of tools and avoiding the hype. Most companies don’t need a cluster, and the mean size of a cluster will remain one for a long time. Don’t believe the salesmen, and ask the experts in your community about what is needed.

In short: I do feel it is strongly misleading but it is certainly here to stay.

  1. How did you end up being a data analyst? What is the most exciting thing about your field?

My academic and professional career have a bit of weird path. I started at Bristol in a Physics and Philosophy program. It was a really exciting time, and I learned a lot (some of it non-academic). I went into that program because I wanted to learn everything. At various points – especially in 2009-2010 the terminology of ‘data science’ began to pick up, and when I went into grad school in 2010, I was ‘aware’ of the discipline. I took a lot of financial maths classes at Luxembourg, just to keep that option open, yet I still in my heart wanted to be an academic.

I eventually after some soul searching realized that academic opportunities were going to be too difficult to get, and that I could earn more in industry. So I did a few industrial internships including one at import.io, and towards the end of my Masters – I did a 6 month internship at a ‘small’ e-commerce company called Amazon.com.

I learned a lot at Amazon.com and it was here that I realized i needed to work a lot harder on my software engineering skills. I’ve been working on them in my working life and through contributing to open source software and my various speaking engagements. I strongly recommend to any wanna data geeks to come to these and share your own knowledge 🙂

The most exciting thing about my field relates to the first statement about physics and philosophy – we truly are drowning in data, and we really with the computational resources we have now have the ability to answer or simulate certain questions in a business context. The web is a microscope, and your ERP system tells you more about your business than you can actually imagine – I’m very excited to help companies exploit their data.

  1. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

I like the OSEMIC framework (which I developed myself) and the CoNVO framework (which comes from Thinking with Data by Max Schron – I recommend the following video for an intro and the book itself.)

Let me explain – at the beginning of an ‘engagement’ I look for the Context, Need, Vision and Outcome of the project. Outcome means the delivery and asking these questions by having a conversation with stakeholders is a really good way to get to solving the ‘business problem’.

A lot of this after a few years in the business still feels like an art rather than a science.

I like explaining to people the Data Science process – obtain data, scrub data, explore, model, interpret and communicate.

I think a lot of people get these kinds of notions and a lot of my conversations recently at work have been about data quality – and data quality really needs domain knowledge. It is amazing how easy it is to misinterpret a number – especially around things like unit conversion etc.

  1. You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?

I would see a lot of the stuff above. One challenge is that some places aren’t ready for a data scientist nor do they know how to use one. I would avoid such places, and look for work elsewhere.

Some of this is a lack of vision, and one reason I do a lot of talks is to do ‘educated selling’ about the gospel of data-informed decision making and how the new tools such as the PyData stack and R are helping us extract more and more value out of data.

I’ve also found that visualizations help a lot, humans react to stories and pictures more than to numbers.

My advice to new-starters is over communicate, and learn some soft skills. The frameworks I mentioned help a bit in structuring and explaining a project to stakeholders. I recommend also reading this interview series, I learned a lot from it too 🙂

Interview with a Data Scientist: Ian Ozsvald

Standard

Ian Ozsvald is a Data Scientist based in London. He’s a friend and an inspiration to all us data geeks. He’s a co-organizer of PyData in London and speaks a lot on the data science circuit. He’s also very tall 🙂

I include a bio at the bottom.

1. What project have you worked on do you wish you could go back to, and do better?
My most frustrating project was (thankfully) many years ago. A client gave me a classification task for a large number of ecommerce products involving NLP. We defined an early task to derisk the project and the client provided representative data, according to the specification that I’d laid out. I built a set of classifiers that performed as well as a human and we felt that the project was derisked sufficiently to push on. Upon receiving the next data set I threw up my arms in horror – as a human I couldn’t solve the task on this new, very messy data – I couldn’t imagine how the machine would solve it. The client explained that they wanted the first task to succeed so they gave me the best data they could find and since we’d solved that problem, now I could work on the harder stuff. I tried my best to explain the requirements of the derisking project but fear I didn’t give a deep enough explanation to why I needed fully-representative dirty data rather than cherry-picked good data. After this I got *really* tough when explaining the needs for a derisking phase.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

You probably want an equal understanding of statistics, linear algebra and engineering, with multiple platforms and languages plus visualisation skills. You probably want 5+ years experience in each industrial domain you’ll work in. None of this however is realistic. Instead focus on some areas that interest you and that pay well-enough and deepen your skills so that you’re valuable. Next go to open source conferences and speak, talk at meetups and generally try to share your knowledge – this is a great way of firming up all the dodgy corners of your knowledge. By speaking at open source events you’ll be contributing back to the ecosystem that’s provided you with lots of high quality free tools. For me I speak, teach and keynote at conferences like PyDatas, PyCons, EuroSciPys and EuroPythons around the world and co-run London’s most active data community at PyDataLondon. Also get involved in supporting the projects you use – by answering questions and submitting new code you’ll massively improve the quality of your knowledge.

3. What do you wish you knew earlier about being a data scientist?
 I wish I knew how much I’d miss not paying attention to classes in statistics and linear algebra! I also wish I’d appreciated how much easier conversations with clients were if you have lots of diagrams from past projects and projects related to their data – people tend to think visually, they don’t work well from lists of numbers.
4. How do you respond when you hear the phrase ‘big data’?

Most clients don’t have a Big Data problem and even if they’re storing huge volumes of logs, once you subselect the relevant data you can generally store it on a single machine and probably you can represent it in RAM. For many small and medium sized companies this is definitely the case (and it is definitely-not-the-case for a company like Facebook!). With a bit of thought about the underlying data and its representation you can do things like use sparse arrays in place of dense arrays, use probabilistic counting and hashes in place of reversible data structures and strip out much of the unnecessary data. Cluster-sized data problems can be made to fit into the RAM of a laptop and if the original data already fits on just 1 hard-drive then it almost certainly only needs a single machine for analysis. I co-wrote O’Reilly’s High Performance Python and one of the goals of the book was to show that many number-crunching problems work well using just 1 machine and Python, without the complexity and support-cost of a cluster.

5. What is the most exciting thing about your field?

We’re stuck in a world of messy, human-created data. Cleaning it and joining it is currently a human-level activity, I strongly suspect that we can make this task machine-powered using some supervised approaches so less human time is spent crafting regular expressions and data transformations. Once we start to automate data cleaning and joining I suspect we’ll see a new explosion in the breadth of data science projects people can tackle.

6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 

To my mind the trick is figuring out a) how good the client’s data is and b) how valuable it could be to their business if put to work. You can justify any project if the value is high enough but first you have to derisk it and you want to do that as quickly and cheaply as possible. With 10 years of gut-feel experience I have some idea about how to do this but it feels more like art than science for the time being. Always design milestones that let you deliver lumps of value, this helps everyone stay confident when you hit the inevitable problems.

7. You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?
Justify the business value behind your work and make lots of diagrams (stick them on the wall!) so that others can appreciate what you’re doing. Make bits of it easy to understand and explain why it is valuable and people will buy into it. Don’t hide behind your models, instead speak to domain experts and learn about their expertise and use your models to backup and automate their judgement, you’ll want them on your side.
8. You have a cool startup can you comment on how important it is as a CEO to make a company such as that data-driven or data-informed?

My consultancy (ModelInsight.io) helps companies to exploit their data so we’re entirely data-driven! If a company has figured out that it has a lot of data and it could steal a march on its competitors by exploiting this data, that’s where we step in. A part of the reason I speak internationally is to help companies think about the value in their data based on the projects we’ve worked on previously.

Bio: 

My name is Ian Ozsvald. I’m an Entrepreneurial Geek, 30-late-ish, living in London (after 10 years in Brighton and a year in Latin America).

I take on work in my Artificial Intelligence consultancy (Mor Consulting Ltd.) and I also authorThe Artificial Intelligence Cookbook – learn how to add clever algorithms to your software to make it smarter! One of my mobile products is SocialTies (built with RadicalRobot).

I co-founded ShowMeDo.com in 2005, it is all about tutorial screencasts that teach you programming, see About ShowMeDo for more info.  This was my second company and I’m rather proud to say that it is financially self-sufficient, growing and is full of very useful user-generated (and us-generated) content.  100,000 users and 1TB of data served per month say that we built some very useful indeed. In 5 years ShowMeDo has educated over 3 million people about open source tools.

I’m also co-founder of the £5 Apps Meetup, OpenCoffee Sussex and the BrightonDigital mail list (RIP).

Previously I’ve worked as Senior Programmer at Algorithmix (now Corpora) and the MASA Group, and these jobs came via my MSc in Artificial Intelligence at Sussex University.  See myLinkedIn profile.