Interview with a Data Scientist: Mick Cooney

Standard

I’m delighted to feature my friend Mick Cooney here as an interviewee. Mick has many years of experience in Finance and more recently in Insurance, he co-ran the Dublin R meetup which was very successful and helped foster a data science community in Dublin. More recently he’s been working over in London at an Actuarial Consultancy – building out a data science practice.

q1. What project have you worked on do you wish you could go back to,
and do better?

I started my career as a quant in a small startup hedge fund. We
developed time series models to forecast short-term volatility in
equities and equity indices as part of an option trading strategy. It
is a fascinating topic and I still dabble in it. Thinking back on the
work done, I would re-engineer large portions of it. I made a ton of
mistakes on both the modelling and implementation side, and the R
language in particular has progressed in strides since I did the bulk
of the work.

For example, the system automatically generates PDF reports of the
forecasts but it does so by hand creating La-TeX files compiled into
PDF. One of the first things I would do is switch all that over to use
either ‘knitr’ or ‘rmarkdown’. I would also use more ‘reproducible
research’ concepts.

That said, I had worked on the modeling for a long time, so I am
content with the basic model. There are many things still to
investigate or implement.

On the modeling side, I worked on a persistency model using survival
analysis, which is how I learned about the subject in the first
place. As a result, there are a lot of different things I would love
to return to and do differently. In retrospect, I was too quick to
move past the simpler models. We could see the assumptions were not
consistent with the data, and so did not fully explore simpler
approaches. I am now curious to learn what insights those simpler
approaches would yield.

Customer churn is such a universal problem I expect I will be working
on it again in the near future. Hopefully I can apply those lessons
then.

***
q2. What advice do you have to younger analytics professionals and in
particular PhD students in the Sciences?

I think the key advice I would give is the same for everyone – never
stop learning. This may be the availability heuristic at play with me,
but I have never seen a connection between qualifications and analyst
quality. All the good analysts I know have curiosity and
initiative. Academic achievements do not come into it at all.

Initiative manifests in many ways. First, when they encounter a
problem they learn what they need to do and get on with it. Second,
much of their knowledge is self-taught. Finally, and I believe most
importantly, they have an inherent curiosity – the best analysts I
know engage in the field in their own time, mainly because they want
to.

This brings up a related issue I have been pondering for some time. I
am ambitious. I want to be a top data scientist some day. I have no
academic ambition whatsoever, but my goal is to be able to hold my own
in any conversation with anyone in the field.

How do I achieve this? What do I need to do to get to that point?

While probably not as keen as the average fan, I love sport – soccer,
the NFL and Gaelic Football in particular. For anyone who has met me
in person, comparing me to a top athlete seems preposterous, but
there is a lot to be learned from top athletes if you want to excel
at your chosen field. Look at how they prepare and train. These
principles almost certainly apply to other professions too, but it is
more fun to talk about sport. 🙂

When I read about Lionel Messi, Tom Brady or Colm Cooper (for our
non-Irish readers the recently-retired ‘Gooch’ is arguably the
greatest GAA player to ever play the game – he was majestic to watch),
the one thing that always stands out for me is their fanatical
devotion to their chosen career not their obvious talent. All their
team-mates mention how hard they worked despite their abundance of
natural advantages. Players with huge natural talent often coast, but
elite players are the opposite – they work as hard as the fringe
players slogging to just survive the cut.

In our field, we need to work constantly on improving – going to
Meetups, reading about new techniques, watching videos on YouTube and
looking to strengthen areas where you are weak. This is why a natural
interest and curiosity is so invaluable – it makes these necessary
tasks much less of a burden as they are things you would want to do
anyway.

Secondly, top players do the simple things well, almost never making a
mistake. They are fallible of course, and make mistakes, but almost
never on the basics. They are rigorous about practicing the basic
skills and principles, and that is why they are so good. The bread and
butter of their craft is second-nature to them.

This is why I focus so much on basic statistics classes and reread and
re-watch the books and lectures I find useful. I want these things to
be second nature and they are not.

Probability and statistics are so counter-intuitive that I almost
never get things right on gut feeling. I am almost always wrong. So
much so that I gave a talk about probabilistic graphical models about
a year ago and during the questions at the end made an off-hand joke
about going with the opposite of my intuition.

It was said in jest at the time but is sadly true!

One final piece of advice is to help as many people as you can. Help
people with their homework, with some programming, with their computer
problems and with data problems. You get exposed to all sorts of
topics and problems, most of which you will see again in your
career. You also get the added bonus of people thinking you are
selfless and altruistic, despite being self-serving in reality!

***
q3. What do you wish you knew earlier about being a data scientist?

I have two main things I wish I learned early on in my career, and
both are connected philosophically. First, I wish I had learned about
probabilistic thinking, risk management, economics and statistics –
you can never learn enough about these fundamental topics. Secondly, I
wish I learned it is okay to start working with a bad model that you
know is wrong but simple.

To that first point, I spend a long time fighting my natural desire
for a clean, elegant and correct answer to a problem. I would work on
a problem, get to a point that I was confident pointed us in the right
direction, but then realise that ‘proving’ this was right involved a
huge amount of time and effort, assuming it was possible.

I attributed my natural reluctance to pursue this ‘answer’ as
laziness, and felt guilty. I felt I was being unprofessional and
sloppy. But working on forecasting models for trading taught me that
this was not the case. Models are so imperfect, with so many
compromises it is often more optimal to think about other things first
– what are the limitations of the model in practice, what is it
saying, how are you going to use it. Answer those questions first,
THEN worry about improving it.

This is why I always start with simple, stupid, wrong models. They are
quick to produce, they help you learn a lot about what you are doing,
they fail in spectacular ways and they are sometimes all you need. In
terms of costs and benefits, they are hard to beat.

***
q4. How do you respond when you hear the phrase ‘big data’?

I hate it. It has become a meaningless buzzword used as a means of
making sales.

My attitude to the term is best summarised by the interview you had
with Hadley Wickham: there are three categories of data size,
in-memory, on-disk and finally the truly ‘big data’ problems like
recommender systems. I believe the majority of problems can be solved
by appropriate sampling of your data down to a manageable size and
then analysing those subsets.

After all, the whole point of statistics is to make inferences about a
population from a sample of the data.

Once decided on a solution, putting the model into production and
scaling it for your business is a major issue, but is a problem more
belonging to the realm of network and software engineering. That said,
it is important to keep people with a solid understanding of the
concepts stay involved, just in case some ‘optimisations’ ruin the
output.

***
q5. What is the most exciting thing about your field?

Robert McNamara in ‘The Fog of War’ mentioned that you should never
answer the question asked but instead answer the question you wanted
to be asked, so with your forebearance I will first answer a liberal
interpretation of that question: what work gets me excited?

The short answer to that question is all sorts of things do, but they
are often small things related to work I am doing. In the last few
months, I was excited to try out dataexpks (a data exploration package
I am co-creating) on a brand new data set to see what it showed me and
how well my code worked. I love think of ways to use Monte Carlo
simulation to test the output of various regression models, and over
Christmas I was fascinated by a short project trying out methods of
investigating differences between a subpopulation within a larger
population.

I am fascinated by new ways to learn the fundamentals – there are a
few excellent ones out there and I read them all the time. I can never
learn enough as in my experience reality tends to present us with
basic statistical problems in new and unusual ways.

Having multiple perspectives and multiple approaches is invaluable in
those situations.

Regarding your original question as I think you intended, I think the
advances in reinforcement learning techniques probably have the
biggest potential – some of the Atari gameplaying from Deep Mind was
eye-opening. Sadly, if history is any guide, much of it will prove to
be hype, but I imagine some very interesting results to come from the
work.

***
q6. How do you go about framing a data problem – in particular, how do
you avoid spending too long, how do you manage expectations etc. How
do you know what is good enough?

Framing a data problem is a tough one to answer – I am not sure what I
do or how to articulate it. I have had the good fortune to help a lot
of people with their projects and problems, exposing me to a wide
variety of problems. I learned something from all of them and I rely
on that a lot.

I also read a lot of blogs, articles and subscribe to mailing
lists. While rarely having the time to read all this, often all you
need to get started on a problem is a vague memory of some technical
topic that may help and some terminology to Google.

As a result, the first thing I focus on is understanding the problem:
what is being asked? Do we have any data? What does is it look like?
Are there other data available we can use to enrich or use as a
substitute?

Going through that process will suggest approaches to use, and at that
point I draw upon previous experience, however tangential to the
problem..

By keeping this focus, your other questions are straightforward to
answer: if the current model is not likely to improve the answer by an
amount relevant to the goal, it is not worth spending more time
on. Similarly, knowing what is needed will tell you if your current
model is good enough, or often if there is a model that is good enough
– it is possible the level of accuracy required is not feasible.

In the latter case, discovering that early is much better than later –
you know not to waste time, money and resources on a lost cause.

***
q7. You’ve spoken before about the ‘need for apprenticeships’ in Data
Science. Do you have any suggestions on what that would involve? Are
meetups and coaching a good first start?

To explain the point I was making on that note, I think there is a lot
of implicit knowledge in this field, and I have been told a number of
times from people looking for help that people feel overwhelmed by the
sheer amount of knowledge people feel they need to know.

I do not think this is true, but I understand its origin: there is so
many different aspects to working with data it is tough to know where
to start. I always start very simple, but as I mentioned early, it
took a lot of time, thought and effort to get to that point, and it is
not easy to explain these ideas in theory – you have to work on a
number of different datasets to get a feel for how to do this.

As a result, I believe an approach such as mentoring or
apprenticeships are an effective approach to teach people – more
experienced analysts can guide junior members around the various
pitfalls and traps that are easy to fall into. It allows us to
illustrate that fancy and sophisticated techniques and algorithms are
not needed to do interesting work – some of the most interesting work
I have seen involved little more than summary statistics along with
basic models like linear regression and decision trees.

This is hard to learn from a book – almost impossible. The closest
book I read that talks about this is “Data Analysis Using Regression
and Multilevel/Hierarchical Models” by Gelman and Hill, stressing the
importance of starting from simple models. I would love to know if
there are more.

That said, I could only appreciate the point because I was already
experienced, a younger version of myself would have missed the
point. It would not have occurred to me that the right way to do
something is to do the simple and obvious thing.

I am a firm believer in the KISS principle. Keep It Simple, Stupid.

Advertisements

Interview with a Data Scientist: Phillip Higgins

Standard

Phillip Higgins is a data science consultant based in New Zealand. His experience includes financial services and working for SAS, amongst other experience including some in Germany.

What project have you worked on do you wish you could go back and do better?

Hindsight is a wonderful thing, we can always find things we could have done better in projects.  On the other hand, analytic and modelling projects are often frought with uncertainty- uncertainty that despite the best planning, is not available to foresight. Most modelling projects that I have worked on could have been improved with the benefit of better foresight!

What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

Firstly, I would advise younger analytics professionals to develop both deep knowledge of a particular area and at the same time, to broaden their knowledge and to maintain this focus of learning on both specialised and general subjects throughout their careers.  Secondly, its important to gain as much practice as possible – data science is precisely that because it deals with real-world problems.  I think PhD students should cultivate industry contacts and network widely- staying abreast of business and technology trends is essential.

What do you wish you knew earlier about being a data scientist?
Undoubtedly I wish I knew the importance of communication skills in the whole analytics life-cycle.  Its particularly important to be able to communicate findings to a wide audience and so refined presentation skills are a must.

How do you respond when you hear the phrase ‘Big Data’?

I think Big Data offers data scientists with new possibilities in terms of the work they are able to perform and the significance of their work.  I don’t think it’s a coincidence that the importance and demand of data scientists has risen sharply right at the time that Big Data has become mainstream- for Big Data to yield insights, “Big Analytics” need to be performed – they go hand in hand.

What is the most exciting thing about your field?

For me personally it’s the interesting people I meet along the way.  I’m continually astounded by the talented people I meet.

How do you go about framing a data problem – in particular, how do you manage expectations etc.  How do you know what is good enough?

I think its important to never lose sight of the business objectives that are the rationale for most data-scientific projects.  Although it is essential that businesses allow for data science to disprove hypotheses, at the end of the day, most evidence will be proving hypotheses (or disproving the null hypothesis).  The path to formulating those hypotheses lies obviously mostly in exploratory data analysis (combined with domain knowledge).  It is important to communicate this uncertainty as to framing from the outset, so that there are no surprises.

You spent some time as a consultant in data analytics.  How did you manage cultural challenges, dealing with stakeholders and executives?  What advice do you have for new starters about this?

In consulting you get to mix with a wide variety of stakeholders and that’s certainly an enjoyable aspect of the job.  I have dealt with a wide range of stakeholders, from C-level executives through to mid- level managers and analysts and each group requires a different approach.  A stakeholder analysis matrix is a good place to start- analysing stakeholders by importance and influence.  Certainly, adjusting your pitch and being aware of the politics behind and around any project is very important.

 

Interview with a Data Scientist: Ian Ozsvald

Standard

Ian Ozsvald is a Data Scientist based in London. He’s a friend and an inspiration to all us data geeks. He’s a co-organizer of PyData in London and speaks a lot on the data science circuit. He’s also very tall 🙂

I include a bio at the bottom.

1. What project have you worked on do you wish you could go back to, and do better?
My most frustrating project was (thankfully) many years ago. A client gave me a classification task for a large number of ecommerce products involving NLP. We defined an early task to derisk the project and the client provided representative data, according to the specification that I’d laid out. I built a set of classifiers that performed as well as a human and we felt that the project was derisked sufficiently to push on. Upon receiving the next data set I threw up my arms in horror – as a human I couldn’t solve the task on this new, very messy data – I couldn’t imagine how the machine would solve it. The client explained that they wanted the first task to succeed so they gave me the best data they could find and since we’d solved that problem, now I could work on the harder stuff. I tried my best to explain the requirements of the derisking project but fear I didn’t give a deep enough explanation to why I needed fully-representative dirty data rather than cherry-picked good data. After this I got *really* tough when explaining the needs for a derisking phase.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

You probably want an equal understanding of statistics, linear algebra and engineering, with multiple platforms and languages plus visualisation skills. You probably want 5+ years experience in each industrial domain you’ll work in. None of this however is realistic. Instead focus on some areas that interest you and that pay well-enough and deepen your skills so that you’re valuable. Next go to open source conferences and speak, talk at meetups and generally try to share your knowledge – this is a great way of firming up all the dodgy corners of your knowledge. By speaking at open source events you’ll be contributing back to the ecosystem that’s provided you with lots of high quality free tools. For me I speak, teach and keynote at conferences like PyDatas, PyCons, EuroSciPys and EuroPythons around the world and co-run London’s most active data community at PyDataLondon. Also get involved in supporting the projects you use – by answering questions and submitting new code you’ll massively improve the quality of your knowledge.

3. What do you wish you knew earlier about being a data scientist?
 I wish I knew how much I’d miss not paying attention to classes in statistics and linear algebra! I also wish I’d appreciated how much easier conversations with clients were if you have lots of diagrams from past projects and projects related to their data – people tend to think visually, they don’t work well from lists of numbers.
4. How do you respond when you hear the phrase ‘big data’?

Most clients don’t have a Big Data problem and even if they’re storing huge volumes of logs, once you subselect the relevant data you can generally store it on a single machine and probably you can represent it in RAM. For many small and medium sized companies this is definitely the case (and it is definitely-not-the-case for a company like Facebook!). With a bit of thought about the underlying data and its representation you can do things like use sparse arrays in place of dense arrays, use probabilistic counting and hashes in place of reversible data structures and strip out much of the unnecessary data. Cluster-sized data problems can be made to fit into the RAM of a laptop and if the original data already fits on just 1 hard-drive then it almost certainly only needs a single machine for analysis. I co-wrote O’Reilly’s High Performance Python and one of the goals of the book was to show that many number-crunching problems work well using just 1 machine and Python, without the complexity and support-cost of a cluster.

5. What is the most exciting thing about your field?

We’re stuck in a world of messy, human-created data. Cleaning it and joining it is currently a human-level activity, I strongly suspect that we can make this task machine-powered using some supervised approaches so less human time is spent crafting regular expressions and data transformations. Once we start to automate data cleaning and joining I suspect we’ll see a new explosion in the breadth of data science projects people can tackle.

6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 

To my mind the trick is figuring out a) how good the client’s data is and b) how valuable it could be to their business if put to work. You can justify any project if the value is high enough but first you have to derisk it and you want to do that as quickly and cheaply as possible. With 10 years of gut-feel experience I have some idea about how to do this but it feels more like art than science for the time being. Always design milestones that let you deliver lumps of value, this helps everyone stay confident when you hit the inevitable problems.

7. You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?
Justify the business value behind your work and make lots of diagrams (stick them on the wall!) so that others can appreciate what you’re doing. Make bits of it easy to understand and explain why it is valuable and people will buy into it. Don’t hide behind your models, instead speak to domain experts and learn about their expertise and use your models to backup and automate their judgement, you’ll want them on your side.
8. You have a cool startup can you comment on how important it is as a CEO to make a company such as that data-driven or data-informed?

My consultancy (ModelInsight.io) helps companies to exploit their data so we’re entirely data-driven! If a company has figured out that it has a lot of data and it could steal a march on its competitors by exploiting this data, that’s where we step in. A part of the reason I speak internationally is to help companies think about the value in their data based on the projects we’ve worked on previously.

Bio: 

My name is Ian Ozsvald. I’m an Entrepreneurial Geek, 30-late-ish, living in London (after 10 years in Brighton and a year in Latin America).

I take on work in my Artificial Intelligence consultancy (Mor Consulting Ltd.) and I also authorThe Artificial Intelligence Cookbook – learn how to add clever algorithms to your software to make it smarter! One of my mobile products is SocialTies (built with RadicalRobot).

I co-founded ShowMeDo.com in 2005, it is all about tutorial screencasts that teach you programming, see About ShowMeDo for more info.  This was my second company and I’m rather proud to say that it is financially self-sufficient, growing and is full of very useful user-generated (and us-generated) content.  100,000 users and 1TB of data served per month say that we built some very useful indeed. In 5 years ShowMeDo has educated over 3 million people about open source tools.

I’m also co-founder of the £5 Apps Meetup, OpenCoffee Sussex and the BrightonDigital mail list (RIP).

Previously I’ve worked as Senior Programmer at Algorithmix (now Corpora) and the MASA Group, and these jobs came via my MSc in Artificial Intelligence at Sussex University.  See myLinkedIn profile.

Interviews with a Data Scientist: Cameron Davidson-Pilon

Standard
Cameron is an open source contributor, a pythonista and a data geek –  he’s developed various cool libraries. His blog is worth a read, and I personally recommend his screencasts.
He’s got a strong Mathematical background like myself, and he currently is Lead Data Analyst in a Data Science job for Shopify. He’s possibly most famous in the Python community for his excellent Bayesian Methods for Hackers. I also had the honour of contributing to that project.
1. What project have you worked on do you wish you could go back to, and do better?
1. For sure, it was my projects during 2012 when I first started to enter Kaggle competitions. The two in particular I wish I could redo were the Twitter Psychopaths challenge and the US Census Return Rate challenge. In both challenges I made some serious high-level errors (but that’s the point of these challenges, to discover mistakes before they happen when it really matters!) I’ve detailed my mistake in the US Census challenge in my latest PyData presentation “Mistakes I’ve Made”, . Basically I ignored population variance and replaced it with machine learning egotism. Oh, I also remembered another project I would really love to go back to. In 2011, when I was doing research into stochastic processes, I started my first Python library (if you could even call it that) called PyProcess. You can still see it here: https://github.com/camdavidsonpilon/pyprocess.
Notice that it is, embarrassingly, one large file filled with Python classes. The first iteration didn’t even use Numpy! I would love to go back and redo the entire thing, but two things hold me back: 1) It was a lot of work to test each stochastic process and make sure they were doing the right, and 2) I’m do far out of the field now.
(Editor note: I personally used PyProcess during some of my Financial Mathematics coursework and always meant to try to add to the project, but never did)
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
2. If you’re not already learning and using Python or Scala, do that. Similarly, if you’re not already learning some software engineering, do that. What are some examples of data science software engineering? – writing (close to) professional level code – thinking proper abstractions, writing testable pieces, thinking about reusability. – having code reviewed, and reviewing code yourself – writing tests Why do I emphasize programming and software development so much? At a high level, data science is about using computers to do statistics for you. If you can’t properly use the former, then your most important tool in your toolbox is missing.
3. What do you wish you knew earlier about being a data scientist?
3. I wish I, and the rest of the field, knew about data cleaning. This is an important part of the whole data story and is glossed over. Specifically, the ETL pipeline (extract-transform-load). What I use to do is use SQL for the T part, but this caused too many problems (untestable, unmaintainable, unscalable). Now that is done prior to me even using the data for anything remotely complicated. This saves me time later, and allows the entire team to scale and benefit from my work (yes, I am still writing ETLs – I expect all my team members to, too). The problem is, you can’t really teach ETLs until you have the data problem. Small companies (I mean really small companies) and tutorials online can assume data is fine. Not until one is submerged in changing data does the ETL process start to make sense. So, though I wish I knew this earlier, I probably couldn’t have learned anyways!
4. How do you respond when you hear the phrase ‘big data’?
4. Sure, “Big Data” is a buzzword, but I think the issue with the name “Big data” comes down to two camps: are you seeing “Big data” as a solution (probably wrong) or as a problem (probably right). For example, two common questions an organization might have are 1) find the number of unique visitors to our site in the part month, and 2) find me the median of this dataset. If you data is simply too big for memory, which is a good definition of big data, then we can’t solve either of these problems naively. What is really interesting about big data as a problem is the abundance of cool new algorithms and data structures being invented to solve these problems. For example, HyperLogLog estimates the number of unique values in a set of data too big for memory. And TDigest estimates the percentiles of data too big for memory (and hence can’t be sorted).
5. What is the most exciting thing about your field?
5. I’ve already mentioned the interesting new algorithms for big data problems, so I won’t go over them again, but I do think they are very exciting. Another exciting thing the new problems being discovered, and the solutions being used. For example, the recommendation problem of what to recommend visitors to a site is a new problem that has massive impact, and is being solved by data. I can’t imagine Fisher or Pearson ever asking the question “what should I recommend next to this user?”. In a similar vein, we *are* seeing the reemergence of classical statistics again. Classical techniques like survival analysis, clinical trials, and logistic regression are seeing a major comeback because new problems have been identified.
6. How do you go about framing a data problem? 
6. Honestly, I try to turn it into a binomial problem. I use the beta-binomial model as a large crutch far too often, but it’s a really good initial model of a problem. If I can turn the problem into a binomial problem, then I have lots of tools I can work with: Bayesian analysis, sample-size appropriate ranking techniques, Bayesian Bandits, etc. If I can’t turn it into a binomial problem, I go through the rest of my toolbox: survival analysis, lifetime value, Bayesian modeling, classification, association analysis, etc. If I still can’t find an appropriate solution, then I have to expand my scope (and often learn a new tool while doing that).

Interview with a Data Scientist: Jon Sedar

Standard
As part of my hugely successful Interviews with a Data Scientist feature. I interviewed Jon recently. Jon runs his own niche consultancy called Applied AI which specialises in the Insurance industry. He is involved in the Data Science meetup world in Dublin and London.
And I recommend his insights and blog.
1.What project have you worked on do you wish you could go back to, and do better?

I won’t name names, but throughout my career I’ve encountered projects – and indeed full-time jobs – where major issues have popped up not due to technologies or analysis, but due to ineffective communication, either institutional or interpersonal. Just to pick an example, one particular job was an analyst’s nightmare due to overbearing senior management and too-rapid engineering – the task was to produce KPIs of the company’s health, but the entire software and hardware stack changed so frequently that getting even the most basic information out was extremely hard work. That could have been fixed by stronger communication and pushback on my part – but my opinions weren’t accepted and it wasn’t to be. Another large project (of which I was only a very minor part) was scuppered to due mishandled client expectations and caused no end of overwork for the consulting team. Every project needs better communication, always.

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
I’ll deal with these separately, since there are (or should be) different reasons why people are in each group.

To PhD candidates here I simply hope that they truly love their subject and are careful to gain commercially-useful skills along the way. I’ve friends who have completed PhDs, some who’ve quit midway, and some like me who considered it but instead returned to industry after an MSc. You might not plan to go into industry, but gaining the following skills is vital for academia too:

  • reproducible research (version control, data management, robust / testable / actually maintainable code)
  • lightweight programming (learn Python, it’s easy, able to do most things, always available, the packages are very well maintained and the community is very strong)
  • statistics (Bayesian, frequentist, whatever – make sure you have a really solid grasp of the fundamentals)
  • finally ensure you have proven capability in high-quality communication – and a dead-tree LaTeX publication doesn’t count. Get yourself blogging, answering questions on Stack Overflow, presenting at meetups and conferences, working with others, consulting in industry etc. As you improve upon this you’ll really distinguish yourself from the herd.

Also some flamebait: whilst I love the idea of improving humanity’s body of knowledge in the hard sciences, I’m not convinced that a PhD in the soft sciences is worthwhile nowadays, at least not straight out of school. If you want to research the humanities just take your degree and go work for a giant search engine / social network / online retailer; you’ll get real-world issues and massive study sizes from day one.

To the younger analytics professionals, regardless the company or industry in which you find yourself, build up your skills as per the PhD advice above, polish your external profile (blogs, talks, research papers etc) and don’t ever be afraid to jump ship and try a few things out. Try to have 3 month’s pay in your savings account, maintain your friendships local and international, and set up a basic vehicle for you to do independent contracting / consulting work.

Over the years I’ve tried a lot of different jobs in a few different locations. I felt happiest once I’d set up my own company and knew that I would always have a method to market my skills independent of anyone else. Data science skills are likely to be important for a good few years yet, so if you’re well-connected, well-respected and mobile, you can try a lot of things, find what you love, and will never be out of work for long.

3. What do you wish you knew earlier about being a data scientist?
Lots to unpack in that question! If I can call myself a scientist at all, then it’s an empiricist rather than theoretician. As such I consider data the be the record of things that happen(ed) and science as the formalisation & generalisation of our understanding of those things. ‘Data scientist’ is thus a useful shorthand term for someone who specialises in learning from data, communicating insights and taking/recommending reasoned actions accordingly.

With that in mind, I’d advise my younger self to never forget that it’s that final step that matters most – allowing decision makers to take reasoned actions according to your well-communicated insights. That decision maker may be your client, your boss or even simply yourself, but without an effective application ‘data science’ is actually research & development – and chances are you’re not being paid to do R&D.
4. How do you respond when you hear the phrase ‘big data’?

I think we’re far enough along the hype cycle now that nearly all data science practitioners recognise both the possibilities and the constraints of performing large-scale analyses. Proper problem-definition and product-market fit are the most important to get right, and hopefully even your typical non-technical business leader is no longer bedazzled by the term and instead wants to see actionable insights that don’t require a major engineering project.

That said, I’m still happy to see experts in the field continue to preach that whilst gathering reams of ‘big’ data (which I take here to be primarily commercially-related data including interface interactions, system log files, audio, images, video feeds, positional info, live market movements etc.) can lead to something immensely powerful, it can easily become a giant waste of everyone’s time and resources.

Truly understanding the behaviour of a system/process, and properly cleaning, reducing and sub-sampling datasets are practices long-understood by the statistics community. A reasoned hypothesis tested with ‘small-medium’ data on a modest desktop machine beats blind number crunching any day.
5. What is the most exciting thing about your field?

Well, the tools for applying the analysis techniques, and the techniques themselves are certainly moving at a hell of a pace, but science & technology always does. I really enjoy having the opportunity to research and apply novel techniques to client problems.
More widely I’m excited to see the principles of gathering, maintaining and learning from data permeate all aspects of businesses and organizations. There’s well-developed data science platforms popping up every day, new software packages to use, heavily over-subscribed meetup groups and conferences everywhere, and it’s great to see the formalisation and commoditization of certain technical aspects. Just as it’s unlikely that anyone would try today to run an enterprise without a website, a telephone or even an accountant, I expect that a data science capability will be at the core of most businesses in future.

6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

I assume you mean an analytical problem rather than a data management problem or something else.
I think it’s quite simple really, and just common sense to ensure that you define well the analytical problem, and the inputs and outputs of your work. What question are we trying to answer? How should the answer be presented and how will it be used? What analysis and what data will let us provide insights based on that question? What data do we have and what analysis is possible / acceptable within our organisational and technical constraints? Then prototype, develop, communicate and iterate until baked.

7. Do you feel ‘Data Science’ is a thing – or do you feel it is just some Engineering functions rebranded? Do you think we could do more of the hypothesis driven scientific enquiry?

As above, I think that in future the practice of gathering, maintaining and learning from data will be core to nearly all commercial and social enterprises. Bringing academic research to bear on real-world problems is just too useful, and those who rely on gut instinct or trivial analyses will be out-competed.
That said, I think we’re already seeing a definite split between data science (statistics, experimentation, prediction), data processing (large-scale systems development), and data engineering (acquiring, maintaining and making available high-quality data sources), and no doubt in future there will be more spin-out skills that take on a life of their own. The veritable zoo of job titles spawned from web development is a good example: UI designers, UX designers, javascript engineers, mobile app engineers, hosting and replication engineers etc etc.
Finally I’d just like to thank you for putting this series of interviews / blogposts together, it’s a really interesting resource, particularly as the data science industry is maturing.

Jon Sedar in his own words:

I’m currently calling myself a consulting data scientist, trained in physics and machine learning, with 10 years professional background in data analysis and management consulting. I co-manage a niche data science consultancy called Applied AI, operating primarily in the insurance sector throughout UK, Ireland and Europe. I’m also an organiser and volunteer within data-for-good social movements, and occasional speaker at tech and industry events.

More generally, I love science, technology, electronic music, visual arts, excessive coffee, and will never know enough maths. Hundreds of sci-fi stories have me convinced I was born years too early.

Interview with a Data ‘Scientist’ Matt Hall

Standard
Now it is time for another interview with a Data Scientist.
Today I decided to be a bit different and I interviewed a professional geophysicist, who runs a consultancy in Nova Scotia, Canada.
Let me introduce the interview candidate – Matt Hall has a PhD in sedimentology from the University of Manchester, UK, and 15 years’ experience in the hydrocarbon industry. He has worked for Landmark as a volume interpretation specialist, Statoil as an explorationist, and ConocoPhillips as a geophysical advisor. Matt has a broad range of interests, from signal processing to facies analysis, and from uncertainty modelling to knowledge sharing.
Personally as someone with a Physics degree, I think it is really important we in the data science community learn a lot from the scientific modelling community. There is a lot of shared knowledge out there and one reason I am attending EuroSciPy in Cambridge this year.
1. What project have you worked on do you wish you could go back to, and do better?
1. If it wasn’t for the fact that revisiting old projects would mean missing out on future ones, I would say ‘All of them’. I learn something new on almost everything I work on, and on the downtime in between. I occasionally do go back and re-do things with new tools or insights, and often projects come back around naturally, in one form or another, but I leave a lot untouched. I can’t even imagine, for example, what I could do with almost every aspect of my postgraduate research with what I have today.
Oddly, and worryingly, the reverse is true too. Sometimes I wish I had the same facility with mathematics now as I did at the age of 20. There’s always stuff you forget — perhaps learned but did not actually need at the time, so you never applied it. I guess this means that learning is an x-steps-forward-y-steps-back situation. I just hope x > y.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
2. At the risk of giving a rather tired answer, the number one things for anyone in science is: learn to program. While you’re doing that, learn a bit about the Linux shell (yes, this means dumping Windows as soon as possible) and version control. These are fundamental skills and it could not be easier to get started today. As for the language, pick Python or R, or maybe even JavaScript, or actually just anything.
Here’s something I wrote for geoscientists about learning to program: http://www.agilegeoscience.com/blog/2011/9/28/learn-to-program.html
3. What do you wish you knew earlier about being a data scientist?
3. I’m not a data scientist as such, but a scientist. Data science seems like a fun place to play with code and data, and I wish it had been a thing when I was finishing my education. I must say though, possibly horribly tangentially, that I sense the field of ‘data science’ seems in danger of moving so quickly that it trips over itself. ‘Data’ has to include an awareness of provenance, bias, and the statistics of sampling. ‘Science’ has to include rigorous testing, peer review, and reproducibility. And the analysis of data should build on the history of statistics, especially with respect to the handling of uncertainty. This isn’t to say ‘slow down’ or ‘respect your elders’, it’s more about not repeating the mistakes of the past with the scale and speed of today. I highly recommend joining the Royal Statistical Society, and reading every issue of Significance.
4. How do you respond when you hear the phrase ‘big data’?
4. I know some people freak out when they hear those words, but I don’t mind ‘big data’. It seems to communicate in shorthand something we all understand. Like any jargon, different people often mean different things, so one has to clarify. But it’s clearly taken off so it probably doesn’t matter what we think about it at this point.
I work in applied geophysics. Lots of people in the field roll their eyes at ‘big data’, grumbling that “we’ve been doing that for years”. But they have missed the point entirely. Big data isn’t just lots of data, it implies something about all the other components of the analytics cycles too — storage, retrieval, and analysis.
5. What is the most exciting thing about your field?
5. Geophysics is a fantastically hard big data problem. All interesting problems are inverse problems, and learning the geological history of the earth from some acoustic or electromagnetic data recorded at the surface is a very hard inverse problem. The earth is about as complex a system as you can get, so its equation (so to speak) has infinite variables, so any question you can ask is woefully underdetermined. As a bonus, geological hypotheses are incredibly hard to test on human timescales. And we never get to find out the actual answer by, say, visiting the Cretaceous.
Here’s a vertical and a roughly horizontal slice through a dataset offshore Nova Scotia, where I live… https://peadarcoyle.files.wordpress.com/2015/07/78c3f-hisforhorizon1.png
In an attempt to get reasonable answers, we collect huge amounts of data using ships and drills and dynamite and other exciting things. All the data came from the same earth, but of course every channel has its own response, its own noise, and its own bias. Reducing 200 billion data samples to an image of the earth is a hard optimization problem, but it 60% of the time, it works every time.
What gets me excited every day is that we’ve only scratched the surface of what we can learn about the earth from seismic data.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 
6. I usually have a specific purpose to work towards, and often a time constraint as well. So you do what you can to deliver the required result (for me, this might be a set of maps, or a ranked list of locations) in the time available.
Scoping projects is hard. The concept of a ‘project’ is somewhat at odds with how scientists work — with a lot of unknowns. It’s better to be more agile — do something easy quickly, then review and iterate. This isn’t always compatible with working to time and resource constraints, especially if you’re working with people with a strong ‘waterfall’ project mindset.
You can’t beat meeting the client (there’s always a client — someone who needs your work) face to face. The more you can iterate on the plan, the better. And the more space you can leave for the stuff you don’t know yet, the better. There’s no point committing to something, then finding out that the available data. Have contingencies for everything. Report back continuously, not after a problem has started eating things for breakfast. Be open and transparent about everything. All common sense, but not easy to stick to once you get stuck in.

On the Cultural divide between Data Scientists and Managers

Standard

My friend Chris recently quoted me from Twitter on his Blog.

I don’t have much to add to this – other than I feel it is a fair article is correct, and it discusses my own challenges.

Ian Ozsvald has a good keynote on this – from PyCon Sweden where he talks about the need to add value everyday.

I believe he has a good talk coming up at PyData London about the challenges of shipping it. Which I’ll also be speaking at! I’ll be giving a talk on PyMC and PyMC3

I recommend my Interviews with a Data Scientist page – for a discussion of this.