Interview with a Data Scientist: Phillip Higgins

Standard

Phillip Higgins is a data science consultant based in New Zealand. His experience includes financial services and working for SAS, amongst other experience including some in Germany.

What project have you worked on do you wish you could go back and do better?

Hindsight is a wonderful thing, we can always find things we could have done better in projects.  On the other hand, analytic and modelling projects are often frought with uncertainty- uncertainty that despite the best planning, is not available to foresight. Most modelling projects that I have worked on could have been improved with the benefit of better foresight!

What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

Firstly, I would advise younger analytics professionals to develop both deep knowledge of a particular area and at the same time, to broaden their knowledge and to maintain this focus of learning on both specialised and general subjects throughout their careers.  Secondly, its important to gain as much practice as possible – data science is precisely that because it deals with real-world problems.  I think PhD students should cultivate industry contacts and network widely- staying abreast of business and technology trends is essential.

What do you wish you knew earlier about being a data scientist?
Undoubtedly I wish I knew the importance of communication skills in the whole analytics life-cycle.  Its particularly important to be able to communicate findings to a wide audience and so refined presentation skills are a must.

How do you respond when you hear the phrase ‘Big Data’?

I think Big Data offers data scientists with new possibilities in terms of the work they are able to perform and the significance of their work.  I don’t think it’s a coincidence that the importance and demand of data scientists has risen sharply right at the time that Big Data has become mainstream- for Big Data to yield insights, “Big Analytics” need to be performed – they go hand in hand.

What is the most exciting thing about your field?

For me personally it’s the interesting people I meet along the way.  I’m continually astounded by the talented people I meet.

How do you go about framing a data problem – in particular, how do you manage expectations etc.  How do you know what is good enough?

I think its important to never lose sight of the business objectives that are the rationale for most data-scientific projects.  Although it is essential that businesses allow for data science to disprove hypotheses, at the end of the day, most evidence will be proving hypotheses (or disproving the null hypothesis).  The path to formulating those hypotheses lies obviously mostly in exploratory data analysis (combined with domain knowledge).  It is important to communicate this uncertainty as to framing from the outset, so that there are no surprises.

You spent some time as a consultant in data analytics.  How did you manage cultural challenges, dealing with stakeholders and executives?  What advice do you have for new starters about this?

In consulting you get to mix with a wide variety of stakeholders and that’s certainly an enjoyable aspect of the job.  I have dealt with a wide range of stakeholders, from C-level executives through to mid- level managers and analysts and each group requires a different approach.  A stakeholder analysis matrix is a good place to start- analysing stakeholders by importance and influence.  Certainly, adjusting your pitch and being aware of the politics behind and around any project is very important.

 

Interview with a Data Scientist: Trey Causey

Standard
Trey Causey is a blogger with experience as a professional data scientist in sports analytics and e-commerce. He’s got some fantastic views about the state of the industry, and I was privileged to read this.
1. What project have you worked on do you wish you could go back to, and do better?
The easy and honest answer would be to say all of them. More concretely, I’d love
to have had more time to work on my current project, the NYT 4th Down Bot before
going live. The mission of the bot is to show fans that there is an analytical
way to go about deciding what to do on 4th down (in American football), and that
the conventional wisdom is often too conservative. Doing this means you have to
really get the “obvious” calls correct as close to 100% of the time as possible,
but we all know how easy it is to wander down the path to overfitting in these
circumstances…
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences and Social Sciences?
Students should take as many methods classes as possible. They’re far more generalizable
than substantive classes in your discipline. Additionally, you’ll probably meet
students from other disciplines and that’s how constructive intellectual cross-fertilization
happens. Additionally, learn a little bit about software engineering (as distinct
from learning to code). You’ll never have as much time as you do right now for things
like learning new skills, languages, and methods.
For young professionals, seek out someone more senior than yourself, either at your
job or elsewhere, and try to learn from their experience. A word of warning, though,
it’s hard work and a big obligation to mentor someone, so don’t feel too bad if
you have hard time finding someone willing to do this at first. Make it worth
their while and don’t treat it as your “right” that they spend their valuable
time on you. I wish this didn’t even have to be said.
3. What do you wish you knew earlier about being a data scientist?
 
It’s cliche to say it now, but how much of my time would be spent getting data,
cleaning data, fixing bugs, trying to get pieces of code to run across multiple
environments, etc. The “nuts and bolts” aspect takes up so much of your time but
it’s what you’re probably least prepared for coming out of school.
4. How do you respond when you hear the phrase ‘big data’?
Indifference.
5. What is the most exciting thing about your field?
Probably that it’s just beginning to even be ‘a field.’ I suspect in five years
or so, the generalist ‘data scientist’ may not exist as we see more differentiation
into ‘data engineer’ or ‘experimentalist’ and so on. I’m excited about the
prospect of data scientists moving out of tech and into more traditional
companies. We’ve only really scratched the surface of what’s possible or,
amazingly, not located in San Francisco.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
A difficult question along the lines of “how long is a piece of string?” I think
the key is to communicate early and often, define success metrics as much as
possible at the *beginning* of a project, not at the end of a project. I’ve found
that “spending too long” / navel-gazing is a trope that many like to level at data
scientists, especially former academics, but as often as not, it’s a result of
goalpost-moving and requirement-changing from management. It’s important to manage
up, aggressively setting expectations, especially if you’re the only data scientist
at your company.
7. How do you explain to C-level execs the importance of Data Science? How do you deal with the ‘educated selling’ parts of the job? In particular – how does this differ from sports and industry?
Honestly, I don’t believe I’ve met any executives who were dubious about the
value of data or data science. The challenge is often either a) to temper
unrealistic expectations about what is possible in a given time frame (we data
scientists mostly have ourselves to blame for this) or b) to convince them to
stay the course when the data reveal something unpleasant or unwelcome.
8. What is the most exciting thing you’ve been working on lately and tell us a bit about it.
I’m about to start a new position as the first data scientist at ChefSteps, which
I’m very excited about, but I can’t tell you about what I’ve been working on there
as I haven’t started yet. Otherwise, the 4th Down Bot has been a really fun
project to work on. The NYT Graphics team is the best in the business and is
full of extremely smart and innovative people. It’s been amazing to see the
thought and time that they put into projects.
9. What is the biggest challenge of leading a data science team?
I’ve written a lot about unrealistic expectations that all data scientists
be “unicorns” and be experts in every possible field, so for me the hardest
part of building a team is finding the right people with complementary skills
that can work together amicably and constructively. That’s not special to
data science, though.

Interview with a Data Scientist: Thomas Wiecki

Standard

I interviewed Thomas Wiecki recently – Thomas is Data Science Lead at Quantopian Inc which is a crowd-sourced hedge fund and algotrading platform. Thomas is a cool guy and came to give a great talk in Luxembourg last year – which I found so fascinating that I decided to learn some PyMC3 🙂

1. What project have you worked on do you wish you could go back to, and do better?
While I was doing my masters in CS I got a stipend to develop an object recognition framework. This was before deep learning dominated every benchmark data set and bag-of-features was the way to go. I am proud of the resulting software, called Pynopticon (https://code.google.com/p/pynopticon/wiki/Introduction), even though it never gained any traction. I spent a lot of time developing a streamed data piping mechanism that was pretty general and flexible. This was in anticipation of the large size of data sets. In retrospect though it was overkill and I should have spent less time coming up with the best solution and instead spend time improving usability! Resources are limited and a great core is not worth a whole lot if the software is difficult to use. The lesson I learned is to make something useful first, place it into the hands of users, and then worry about performance.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Spend time learning the basics. This will make more advanced concepts much easier to understand as it’s merely an extension of core principles and integrates much better into an existing mental framework. Moreover, things like math and stats, at least for me, require time and continuous focus to dive in deep. The benefit of taking that time, however, is a more intuitive understanding of the concepts. So if possible, I would advise people to study these things while still in school as that’s where you have the time and mental space. Things like the new data science tools or languages are much easier to learn and have a greater risk of being ‘out-of-date’ soon. More concretely, I’d start with Linear Algebra (the Strang lectures are a great resource) and Statistics (for something applied I recommend Kruschke’s Doing Bayesian Analysis, for fundamentals “The Elements of Statistical Learning” is a classic).
3. What do you wish you knew earlier about being a data scientist?
How important non-technical skills are. Communication is key, but so are understanding business requirements and constraints. Academia does a pretty good job of training you for the former (verbal and written), although mostly it is assumed that communicate to an expert audience. This certainly will not be the case in industry where you have to communicate your results (as well as how you obtained them) to people with much more diverse backgrounds. This I find very challenging.
As to general business skills, the best way to learn is probably to just start doing it. That’s why my advice for grad-students who are looking to move to industry would be to not obsess over their technical skills (or their Kaggle score) but rather try to get some real-world experience.
4. How do you respond when you hear the phrase ‘big data’?
As has been said before, it’s quite an overloaded term. On one side, it’s a buzzword in business where I think the best interpretation is that ‘big data’ actually means that data is a ‘big deal’ — i.e. the fact that more and more people realize that by analyzing their data they can have an edge over the competition and make more money.
Then there’s the more technical interpretation where it means that data increases in size and some data sets do not fit into RAM anymore. I’m still undecided of whether this is actually more of a data engineering problem (i.e. the infrastructure to store the data, like hadoop) or an actual data science problem (i.e. how to actually perform analyses on large data). A lot of times, as a data scientist I think you can get by by sub-sampling the data (Andreas Müller has a great talk of how to do ML on large data sets https://www.youtube.com/watch?v=l43VIw5xhTg).
Then again, more data also has the potential to allow us to build more complex models that capture reality more accurately, but I don’t think we are there yet. Currently, if you have little data, you can only do very simple things. If you have medium data, you are in the sweet spot where you can do more complex analyses like Probabilistic Programming. However, with “big data”, the advanced inference algorithms fail to scale so you’re back to doing very simple things. This “big data needs big models” narrative is expressed in a talk by Michael Betancourt: https://www.youtube.com/watch?v=pHsuIaPbNbY
5. What is the most exciting thing about your field?
The fast pace the field is moving. It seems like every week there is another cool tool announced. Personally I’m very excited about the blaze ecosystem including dask which has a very elegant approach to distributed analytics which relies on existing functionality in well established packages like pandas, instead of trying to reinvent the wheel. But also data visualization is coming along quite nicely where the current frontier seems to be interactive web-based plots and dashboards as worked on by bokeh, plotly and pyxley.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 
I try to keep the loop between data analysis and communication to consumers very tight. This also extends to any software to perform certain analyses which I try to place into the hands of others even if it’s not perfect yet. That way there is little chance to ween off track too far and there is a clearer sense of how usable something is. I suppose it’s borrowing from the agile approach and applying it to data science.

Interview with a Data Scientist: Erin Shellman

Standard
I recently caught up with Erin for an interview. Her interview is full of nice pieces of hard-earned advice and her final answer on Data Governance is gold!
Erin does some great blog posts at her blog, which I recommend. Erin is a programmer + statistician working as a research scientist at Amazon Web Services. Before that she was a Data Scientist in the Nordstrom Data Lab, where she primarily built product recommendations for Nordstrom.com. She mostly codes in Scala, Python and R, but dabbles in Javascript to put data on the internet. Erin loves to teach and speak, and does both often through talks, as co-organizer of PyLadies-Seattle, and as an instructor at the University of Washington’s Professional and Continuing Education program.
 
1. What project have you worked on do you wish you could go back to, and do better?
Often the goal of data science projects is to automate processes with data–I worked on a lot of projects at Nordstrom with that goal. I think we were pretty naive in those pursuits, often approaching the problems with low empathy and EQ (Emotional Quotient). We built tools, expecting that the teams we were trying to automate would immediately see the value and jump to use them, but we didn’t spend a lot of time listening and trying to understand why some might be hesitant to adopt our tools. Eventually, I started training people and specifically asking them to send bug reports or feature requests. The trainings opened up dialog about our plans and made the other teams more invested, because they could see when their bugs were fixed and their feature implemented. I learned that doing the data work is only half (or less) of the challenge, the other is advocating for your work in such a way that others are similarly compelled.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
If you’re in school right now, use this time to master a programming language (you have more time than you ever will again despite what you may believe). For data science, I’d recommend Python, R or Scala (and if you had to choose one, Python). You absolutely need to be able to produce high-quality code before you walk in the door because chances are you’ll be asked to code early in the interview process.
I also think you shouldn’t spend too much time “training” and learning in your free time, it’s nearly impossible to retain knowledge that way. Instead, spend all your time shoring up the essentials and work on getting a job immediately. You’ll learn so much more on the job than you could ever hope to on your own, plus you’ll be paid. Don’t wait for postings for junior data scientists (I don’t know that I’ve ever even seen one), contact employers you’re interested in working with directly and ask them to make that role for you. You should look for places where you know there’s a solid data team already so you have plenty of people to learn from. Academics tend to have a sort of learned helplessness because they’re so often not in control of their work or careers. This is not the case in industry, if you want something, don’t wait for it to come to you (it won’t). Be an active participant in your future.
3. What do you wish you knew earlier about being a data scientist?
I wish I had spent more time in grad school learning computer science. Often DS (Data Science) jobs end up being almost the same as CS (Computer Science) jobs, and in my case I had to pick up a lot of CS skills on the job.
4. How do you respond when you hear the phrase ‘big data’?
Usually by rolling my eyes so far into the back of my head that they get stuck. I think the return on investment of Any Data is still higher than that of Big Data. Most shops who’re convinced that they need big data technology don’t make use of the data they have already, and adding more data to the pile won’t help the cause.
5. What is the most exciting thing about your field?
The most exciting thing is that I get to learn for a living. Every time I switch jobs or work on something new I have to learn a ton, different technologies and languages, different domains, and different businesses. I especially love that data science is often so close to the business. I love learning about what makes a business successful and providing knowledge to help businesses make better decisions.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 
When I’m approaching a new problem I focus really hard on the inputs and outputs, particularly the output. What exactly are you trying to produce, or trying to answer? This is often a question I pose to business stakeholders to encourage them to think critically about and what they really want to know, how it will be applied, and how to formally articulate it. Basically what I encourage them to do is state a formal hypothesis and the observations required to test that hypothesis. Once we’ve all agreed on the output, what are the inputs? I try to make this as specific as possible, so no “customer data”-level descriptions. Tell me exactly what the inputs are, e.g. annual customer spend, age, and zip code. The more you can reason through the solution in terms of inputs and outputs before you set out to solve the problem the less likely it will be that you’re halfway to answering a question that was ill-posed (I promise, this is 90% of requests), or that you don’t have data to support (this is probably another 5% of requests). It’s also a good way to prevent “stakeholder punting” which is a phrase I made up just now to describe when stakeholders make half-baked requests and then leave them for you to sort out. Data science and research is highly collaborative, and the data scientist shouldn’t be the only one invested in the work.
Once the inputs and outputs are defined, I like to draw flowcharts of the path to completion, and it’s usually easier to start from the bottom. Here’s an example I created for the students in my data mining course. They were working on prediction of a continuous outcome with various regression methods. First we decided on a criteria for model selection, which in this case was the model with the lowest root mean squared error. You can see that the input is a data file, and the output is whichever model had the best predictive accuracy as measured by the lowest RMSE (Root Mean Square Error). For me, diagramming your work like this makes your goal completely concrete.
 Inline image 1
The other really great thing about framing problems this way is that it makes it very easy to estimate effort and communicate to others what is required to complete the projects. For whatever reason, people often assume that while software engineers need 2 weeks to add a minor feature, data scientists need about 6 hours to do complete analyses and make beautiful visualizations. Communicating the amount of work required to complete projects to the requesters is crucial in data science, because most people just don’t know. It’s not something software engineers typically have to do, but providing guidance on the components of a data science project to your stakeholders will reduce your stress in the long-run.
7. What does data governance or data quality mean to you as a data scientist?
Data governance is the collection of processes and protocols to which an organization conforms to insure data accuracy and integrity. Most of the time I’m a data consumer, so I depend on a mature data infrastructure team to create the pipelines I use to collect and analyze data. When I was working on recommendations at Nordstrom, I was a consumer and provider. I provided data in the sense that the output of my recommendation algorithms was data consumed by the web team. Data governance in that context meant writing lots of unit tests to make sure the results of my computations produced correctly formatted entries. It also meant applying business rules, for example, removing entries for products out of stock, or applying brand restrictions.

Interview with a Data Scientist: Peadar Coyle

Standard

Peadar Coyle is a Data Analytics professional based in Luxembourg. His intellectual background is in Mathematics and Physics, and he currently works for Vodafone in one of their Supply Chain teams.

He is passionate about data science and the lead author of this project. He also contributes to Open Source projects and speaks at EuroSciPy, PyData and PyCon.

His expertise is largely in the statistical side of Data Science.

Peadar was asked by various of his interviewees to share his own interview, so he does humbly. 

  1. What project have you worked on do you wish you could go back to, and do better?

I agree that it is better to look forward rather than look backward. And my skills have frankly improved since I first started doing what we could call professional data analysis (which was probably just before starting my Masters a few years ago).

One project I did which springs to mind (and not naming names) is where there was a huge breakdown in communication and misaligned incentives. There needed to be more communication on that project and it overran the initial allotted time. I also spent not enough time communicating up front the risks and opportunities with the stakeholders.

The data was a lot messier than expected, and management had committed to delivered results in 2 weeks. This was impossible, the data cleaning and exploration phase took too long. Now I would focus on quicker wins. I also rushed to the ‘modelling’ phase without really understanding the data. I think such terms ‘understanding the data’ sound a bit academic to some stakeholders, but you need to clearly and articulately explain how important the data generation process is, and the uncertainty in that data. 

Some of this comes from experience – now I focus on adding value as quickly as possible and keeping things simple. There I fell to the siren call of ‘do more analysis’ rather than thinking about how the analysis is conveyed.

  1. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

I don’t have a PhD but I have recently been giving advice to people in that situation.

My advice is that having a portfolio of work if possible is great, or at least move towards doing an online course on Machine Learning or something cool like that.

The PyData videos are a good start too to watch. I’d recommend if you can to do any outreach or communication skills courses. There are many such courses at a lot of universities around the world, it’ll just help you understand the needs of others.

I think frankly that the most important skill for a Data Science is the ‘tactical application of empathy’ and that is something that working in a team really helps you develop. One thing I feel my Masters let me down on – as is common in Pure Mathematics – was a shortage of experience working in a team.

  1. What do you wish you knew earlier about being a data scientist?

The focus on communication skills, the need to add value every day. The fact that budget or a project can be terminated at any moment.

Adding value every day means showing results and sharing them, talking to people about stuff. Share visualizations, and share results – a lot of data science is about relationships and empathy. In fact I think that the tactical application of empathy is the greatest skill of our times.

You need to get out there and speak to the domain specialist, and understand what they understand. I believe that the best algorithms incorporate human as well as machine intelligence.

  1. How do you respond when you hear the phrase ‘big data’?

I too like the distinction of the small, medium and big data. I don’t worry so much about the terminology, and I focus on understanding exactly what my stakeholder wants from it.

I think though that it is often a distraction. I did one proof of concept as a consultant, that was an operational disaster. We didn’t have the resources to support a dev ops culture, nor did we have the capabilities to support a Hadoop cluster. Even worse the problem really could be solved more intelligently by being in RAM. But I got excited by the new tools, without understanding what they were really for.

I think this is a challenge, part of myself maturing as an engineering/ data scientist is appreciating the limits of tools and avoiding the hype. Most companies don’t need a cluster, and the mean size of a cluster will remain one for a long time. Don’t believe the salesmen, and ask the experts in your community about what is needed.

In short: I do feel it is strongly misleading but it is certainly here to stay.

  1. How did you end up being a data analyst? What is the most exciting thing about your field?

My academic and professional career have a bit of weird path. I started at Bristol in a Physics and Philosophy program. It was a really exciting time, and I learned a lot (some of it non-academic). I went into that program because I wanted to learn everything. At various points – especially in 2009-2010 the terminology of ‘data science’ began to pick up, and when I went into grad school in 2010, I was ‘aware’ of the discipline. I took a lot of financial maths classes at Luxembourg, just to keep that option open, yet I still in my heart wanted to be an academic.

I eventually after some soul searching realized that academic opportunities were going to be too difficult to get, and that I could earn more in industry. So I did a few industrial internships including one at import.io, and towards the end of my Masters – I did a 6 month internship at a ‘small’ e-commerce company called Amazon.com.

I learned a lot at Amazon.com and it was here that I realized i needed to work a lot harder on my software engineering skills. I’ve been working on them in my working life and through contributing to open source software and my various speaking engagements. I strongly recommend to any wanna data geeks to come to these and share your own knowledge 🙂

The most exciting thing about my field relates to the first statement about physics and philosophy – we truly are drowning in data, and we really with the computational resources we have now have the ability to answer or simulate certain questions in a business context. The web is a microscope, and your ERP system tells you more about your business than you can actually imagine – I’m very excited to help companies exploit their data.

  1. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

I like the OSEMIC framework (which I developed myself) and the CoNVO framework (which comes from Thinking with Data by Max Schron – I recommend the following video for an intro and the book itself.)

Let me explain – at the beginning of an ‘engagement’ I look for the Context, Need, Vision and Outcome of the project. Outcome means the delivery and asking these questions by having a conversation with stakeholders is a really good way to get to solving the ‘business problem’.

A lot of this after a few years in the business still feels like an art rather than a science.

I like explaining to people the Data Science process – obtain data, scrub data, explore, model, interpret and communicate.

I think a lot of people get these kinds of notions and a lot of my conversations recently at work have been about data quality – and data quality really needs domain knowledge. It is amazing how easy it is to misinterpret a number – especially around things like unit conversion etc.

  1. You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?

I would see a lot of the stuff above. One challenge is that some places aren’t ready for a data scientist nor do they know how to use one. I would avoid such places, and look for work elsewhere.

Some of this is a lack of vision, and one reason I do a lot of talks is to do ‘educated selling’ about the gospel of data-informed decision making and how the new tools such as the PyData stack and R are helping us extract more and more value out of data.

I’ve also found that visualizations help a lot, humans react to stories and pictures more than to numbers.

My advice to new-starters is over communicate, and learn some soft skills. The frameworks I mentioned help a bit in structuring and explaining a project to stakeholders. I recommend also reading this interview series, I learned a lot from it too 🙂

Interview with a Data Scientist: Ian Ozsvald

Standard

Ian Ozsvald is a Data Scientist based in London. He’s a friend and an inspiration to all us data geeks. He’s a co-organizer of PyData in London and speaks a lot on the data science circuit. He’s also very tall 🙂

I include a bio at the bottom.

1. What project have you worked on do you wish you could go back to, and do better?
My most frustrating project was (thankfully) many years ago. A client gave me a classification task for a large number of ecommerce products involving NLP. We defined an early task to derisk the project and the client provided representative data, according to the specification that I’d laid out. I built a set of classifiers that performed as well as a human and we felt that the project was derisked sufficiently to push on. Upon receiving the next data set I threw up my arms in horror – as a human I couldn’t solve the task on this new, very messy data – I couldn’t imagine how the machine would solve it. The client explained that they wanted the first task to succeed so they gave me the best data they could find and since we’d solved that problem, now I could work on the harder stuff. I tried my best to explain the requirements of the derisking project but fear I didn’t give a deep enough explanation to why I needed fully-representative dirty data rather than cherry-picked good data. After this I got *really* tough when explaining the needs for a derisking phase.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

You probably want an equal understanding of statistics, linear algebra and engineering, with multiple platforms and languages plus visualisation skills. You probably want 5+ years experience in each industrial domain you’ll work in. None of this however is realistic. Instead focus on some areas that interest you and that pay well-enough and deepen your skills so that you’re valuable. Next go to open source conferences and speak, talk at meetups and generally try to share your knowledge – this is a great way of firming up all the dodgy corners of your knowledge. By speaking at open source events you’ll be contributing back to the ecosystem that’s provided you with lots of high quality free tools. For me I speak, teach and keynote at conferences like PyDatas, PyCons, EuroSciPys and EuroPythons around the world and co-run London’s most active data community at PyDataLondon. Also get involved in supporting the projects you use – by answering questions and submitting new code you’ll massively improve the quality of your knowledge.

3. What do you wish you knew earlier about being a data scientist?
 I wish I knew how much I’d miss not paying attention to classes in statistics and linear algebra! I also wish I’d appreciated how much easier conversations with clients were if you have lots of diagrams from past projects and projects related to their data – people tend to think visually, they don’t work well from lists of numbers.
4. How do you respond when you hear the phrase ‘big data’?

Most clients don’t have a Big Data problem and even if they’re storing huge volumes of logs, once you subselect the relevant data you can generally store it on a single machine and probably you can represent it in RAM. For many small and medium sized companies this is definitely the case (and it is definitely-not-the-case for a company like Facebook!). With a bit of thought about the underlying data and its representation you can do things like use sparse arrays in place of dense arrays, use probabilistic counting and hashes in place of reversible data structures and strip out much of the unnecessary data. Cluster-sized data problems can be made to fit into the RAM of a laptop and if the original data already fits on just 1 hard-drive then it almost certainly only needs a single machine for analysis. I co-wrote O’Reilly’s High Performance Python and one of the goals of the book was to show that many number-crunching problems work well using just 1 machine and Python, without the complexity and support-cost of a cluster.

5. What is the most exciting thing about your field?

We’re stuck in a world of messy, human-created data. Cleaning it and joining it is currently a human-level activity, I strongly suspect that we can make this task machine-powered using some supervised approaches so less human time is spent crafting regular expressions and data transformations. Once we start to automate data cleaning and joining I suspect we’ll see a new explosion in the breadth of data science projects people can tackle.

6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 

To my mind the trick is figuring out a) how good the client’s data is and b) how valuable it could be to their business if put to work. You can justify any project if the value is high enough but first you have to derisk it and you want to do that as quickly and cheaply as possible. With 10 years of gut-feel experience I have some idea about how to do this but it feels more like art than science for the time being. Always design milestones that let you deliver lumps of value, this helps everyone stay confident when you hit the inevitable problems.

7. You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?
Justify the business value behind your work and make lots of diagrams (stick them on the wall!) so that others can appreciate what you’re doing. Make bits of it easy to understand and explain why it is valuable and people will buy into it. Don’t hide behind your models, instead speak to domain experts and learn about their expertise and use your models to backup and automate their judgement, you’ll want them on your side.
8. You have a cool startup can you comment on how important it is as a CEO to make a company such as that data-driven or data-informed?

My consultancy (ModelInsight.io) helps companies to exploit their data so we’re entirely data-driven! If a company has figured out that it has a lot of data and it could steal a march on its competitors by exploiting this data, that’s where we step in. A part of the reason I speak internationally is to help companies think about the value in their data based on the projects we’ve worked on previously.

Bio: 

My name is Ian Ozsvald. I’m an Entrepreneurial Geek, 30-late-ish, living in London (after 10 years in Brighton and a year in Latin America).

I take on work in my Artificial Intelligence consultancy (Mor Consulting Ltd.) and I also authorThe Artificial Intelligence Cookbook – learn how to add clever algorithms to your software to make it smarter! One of my mobile products is SocialTies (built with RadicalRobot).

I co-founded ShowMeDo.com in 2005, it is all about tutorial screencasts that teach you programming, see About ShowMeDo for more info.  This was my second company and I’m rather proud to say that it is financially self-sufficient, growing and is full of very useful user-generated (and us-generated) content.  100,000 users and 1TB of data served per month say that we built some very useful indeed. In 5 years ShowMeDo has educated over 3 million people about open source tools.

I’m also co-founder of the £5 Apps Meetup, OpenCoffee Sussex and the BrightonDigital mail list (RIP).

Previously I’ve worked as Senior Programmer at Algorithmix (now Corpora) and the MASA Group, and these jobs came via my MSc in Artificial Intelligence at Sussex University.  See myLinkedIn profile.

Interview with a Data Scientist: Jon Sedar

Standard
As part of my hugely successful Interviews with a Data Scientist feature. I interviewed Jon recently. Jon runs his own niche consultancy called Applied AI which specialises in the Insurance industry. He is involved in the Data Science meetup world in Dublin and London.
And I recommend his insights and blog.
1.What project have you worked on do you wish you could go back to, and do better?

I won’t name names, but throughout my career I’ve encountered projects – and indeed full-time jobs – where major issues have popped up not due to technologies or analysis, but due to ineffective communication, either institutional or interpersonal. Just to pick an example, one particular job was an analyst’s nightmare due to overbearing senior management and too-rapid engineering – the task was to produce KPIs of the company’s health, but the entire software and hardware stack changed so frequently that getting even the most basic information out was extremely hard work. That could have been fixed by stronger communication and pushback on my part – but my opinions weren’t accepted and it wasn’t to be. Another large project (of which I was only a very minor part) was scuppered to due mishandled client expectations and caused no end of overwork for the consulting team. Every project needs better communication, always.

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
I’ll deal with these separately, since there are (or should be) different reasons why people are in each group.

To PhD candidates here I simply hope that they truly love their subject and are careful to gain commercially-useful skills along the way. I’ve friends who have completed PhDs, some who’ve quit midway, and some like me who considered it but instead returned to industry after an MSc. You might not plan to go into industry, but gaining the following skills is vital for academia too:

  • reproducible research (version control, data management, robust / testable / actually maintainable code)
  • lightweight programming (learn Python, it’s easy, able to do most things, always available, the packages are very well maintained and the community is very strong)
  • statistics (Bayesian, frequentist, whatever – make sure you have a really solid grasp of the fundamentals)
  • finally ensure you have proven capability in high-quality communication – and a dead-tree LaTeX publication doesn’t count. Get yourself blogging, answering questions on Stack Overflow, presenting at meetups and conferences, working with others, consulting in industry etc. As you improve upon this you’ll really distinguish yourself from the herd.

Also some flamebait: whilst I love the idea of improving humanity’s body of knowledge in the hard sciences, I’m not convinced that a PhD in the soft sciences is worthwhile nowadays, at least not straight out of school. If you want to research the humanities just take your degree and go work for a giant search engine / social network / online retailer; you’ll get real-world issues and massive study sizes from day one.

To the younger analytics professionals, regardless the company or industry in which you find yourself, build up your skills as per the PhD advice above, polish your external profile (blogs, talks, research papers etc) and don’t ever be afraid to jump ship and try a few things out. Try to have 3 month’s pay in your savings account, maintain your friendships local and international, and set up a basic vehicle for you to do independent contracting / consulting work.

Over the years I’ve tried a lot of different jobs in a few different locations. I felt happiest once I’d set up my own company and knew that I would always have a method to market my skills independent of anyone else. Data science skills are likely to be important for a good few years yet, so if you’re well-connected, well-respected and mobile, you can try a lot of things, find what you love, and will never be out of work for long.

3. What do you wish you knew earlier about being a data scientist?
Lots to unpack in that question! If I can call myself a scientist at all, then it’s an empiricist rather than theoretician. As such I consider data the be the record of things that happen(ed) and science as the formalisation & generalisation of our understanding of those things. ‘Data scientist’ is thus a useful shorthand term for someone who specialises in learning from data, communicating insights and taking/recommending reasoned actions accordingly.

With that in mind, I’d advise my younger self to never forget that it’s that final step that matters most – allowing decision makers to take reasoned actions according to your well-communicated insights. That decision maker may be your client, your boss or even simply yourself, but without an effective application ‘data science’ is actually research & development – and chances are you’re not being paid to do R&D.
4. How do you respond when you hear the phrase ‘big data’?

I think we’re far enough along the hype cycle now that nearly all data science practitioners recognise both the possibilities and the constraints of performing large-scale analyses. Proper problem-definition and product-market fit are the most important to get right, and hopefully even your typical non-technical business leader is no longer bedazzled by the term and instead wants to see actionable insights that don’t require a major engineering project.

That said, I’m still happy to see experts in the field continue to preach that whilst gathering reams of ‘big’ data (which I take here to be primarily commercially-related data including interface interactions, system log files, audio, images, video feeds, positional info, live market movements etc.) can lead to something immensely powerful, it can easily become a giant waste of everyone’s time and resources.

Truly understanding the behaviour of a system/process, and properly cleaning, reducing and sub-sampling datasets are practices long-understood by the statistics community. A reasoned hypothesis tested with ‘small-medium’ data on a modest desktop machine beats blind number crunching any day.
5. What is the most exciting thing about your field?

Well, the tools for applying the analysis techniques, and the techniques themselves are certainly moving at a hell of a pace, but science & technology always does. I really enjoy having the opportunity to research and apply novel techniques to client problems.
More widely I’m excited to see the principles of gathering, maintaining and learning from data permeate all aspects of businesses and organizations. There’s well-developed data science platforms popping up every day, new software packages to use, heavily over-subscribed meetup groups and conferences everywhere, and it’s great to see the formalisation and commoditization of certain technical aspects. Just as it’s unlikely that anyone would try today to run an enterprise without a website, a telephone or even an accountant, I expect that a data science capability will be at the core of most businesses in future.

6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

I assume you mean an analytical problem rather than a data management problem or something else.
I think it’s quite simple really, and just common sense to ensure that you define well the analytical problem, and the inputs and outputs of your work. What question are we trying to answer? How should the answer be presented and how will it be used? What analysis and what data will let us provide insights based on that question? What data do we have and what analysis is possible / acceptable within our organisational and technical constraints? Then prototype, develop, communicate and iterate until baked.

7. Do you feel ‘Data Science’ is a thing – or do you feel it is just some Engineering functions rebranded? Do you think we could do more of the hypothesis driven scientific enquiry?

As above, I think that in future the practice of gathering, maintaining and learning from data will be core to nearly all commercial and social enterprises. Bringing academic research to bear on real-world problems is just too useful, and those who rely on gut instinct or trivial analyses will be out-competed.
That said, I think we’re already seeing a definite split between data science (statistics, experimentation, prediction), data processing (large-scale systems development), and data engineering (acquiring, maintaining and making available high-quality data sources), and no doubt in future there will be more spin-out skills that take on a life of their own. The veritable zoo of job titles spawned from web development is a good example: UI designers, UX designers, javascript engineers, mobile app engineers, hosting and replication engineers etc etc.
Finally I’d just like to thank you for putting this series of interviews / blogposts together, it’s a really interesting resource, particularly as the data science industry is maturing.

Jon Sedar in his own words:

I’m currently calling myself a consulting data scientist, trained in physics and machine learning, with 10 years professional background in data analysis and management consulting. I co-manage a niche data science consultancy called Applied AI, operating primarily in the insurance sector throughout UK, Ireland and Europe. I’m also an organiser and volunteer within data-for-good social movements, and occasional speaker at tech and industry events.

More generally, I love science, technology, electronic music, visual arts, excessive coffee, and will never know enough maths. Hundreds of sci-fi stories have me convinced I was born years too early.