Interview with a Data Scientist: Greg Linden

Standard
I caught up with Greg Linden via email recently
Greg was one of the first people to work on data science in Industry – he invented the item-to-item collaborative filtering algorithm at Amazon.com in the late 90s.
I’ll quote his bio from Linkedin:
“Much of my past work was in artificial intelligence, personalization, recommendations, search, and advertising. Over the years, I have worked at Amazon, Google, and Microsoft, founded and run my own startups, and advised several other startups, some of which were acquired. I invented the now widely used item-to-item collaborative filtering algorithm, contributed to many patents and academic publications, and have been quoted often in books and in the press. I have an MS in Computer Science from University of Washington and an MBA from Stanford.”
img_6330

Greg Linden: Source Personal Website

1. What project have you worked on do you wish you could go back to, and do better?
All of them! There’s always more to do, more improvements to make, another thing to try. Every time you build anything, you learn what you could do to make it better next time.

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Learn to code. Computers are a tool, and coding is the way to get the most out of that tool. If you can code, you can do things in your field that others cannot. Coding is a major force multiplier. It makes you more powerful.

3. What do you wish you knew earlier about being a data scientist?
I was doing what is now called data science at Amazon.com in 1997.The term wasn’t even coined until 2008 (by Jeff Hammerbacher and DJ Patil). It’s hard to be much earlier. As for what I wish, I mostly wish I had the powerful tools we have now back then; today is a wonderland of data, tools, and computation. It’s a great time to be a data scientist.

4. How do you respond when you hear the phrase ‘big data’?
I usually think of Peter Norvig talking about the unreasonable effectiveness of data and Michele Banko and Eric Brill finding that more data beat better algorithms in their 2001 paper. Big data is why Amazon’s recommendations work so well. Big data is what tunes search and helps us find what we need. Big data is what makes web and mobile intelligent.

5. What is the most exciting thing about your field?
I very much enjoy looking at huge amounts of data that no one has looked at yet. Being one of only a few to explore a previously unmined new source of information is very fun. Low hanging fruit galore! It’s also fraught with peril, as you’re the first to find all the problems in the data as well.

6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
Data problems should be iterative. Start simple. Solve a small problem. Explore the data. Then solve a harder problem. Then a harder one. Each time you take a step, you’ll get ideas on where to go next, and you also get something out at each step. Too many people start trying to solve the entire problem at the beginning, flailing for a long time, usually to discover that it was the wrong problem to solve when they finally struggle to completion. Start with easier problems, learn where to go, and you might be surprised by all the goodies you find along the way.
Advertisements

A map of the PyData Stack

Standard

One question you have when you use Python is what do I do with my data. How do I process it and analyze it. The aim of this flow chart is to simply provide a simple to use ‘map’ of the PyData stack.

At PyData Amsterdam I’ll present this and explain it in more detail but I hope this helps.

landscape_infographic_colour.png

Thanks to Thomas Wiecki, Matt Rocklin, Stephan Hoyer and Rob Story for their feedback and discussion over the last year about this kind of problem. There’ll be a few iterations based on their feedback.

CC-0 (Creative Commons-0) 2016 Peadar Coyle

 

(I’ll share the source file eventually).

Interview with a Data Scientist: Ivana Balazevic

Standard

Ivana Balazevic is a Data Scientist at a Berkeley based startup Wise.io, where she is working in a small team of data scientists on solving problems in customer service for different clients. She did her bachelor’s degree in Computer Science at the Faculty of Electrical Engineering and Computing in Zagreb and she recently finished her master’s degree in Computer Science with the focus on Machine Learning at the Technical University Berlin.

 

1. What do you think about ‘big data’?

I try not to think about it that much, although nowadays that’s quite hard to avoid. 🙂 It’s definitely an overused term, a buzzword.

I think that adding more and more data can certainly be helpful up to a point, but the outcome of majority of the problems that people are trying to solve depends primarily on the feature engineering process, i.e. on extracting the necessary information from the data and deciding which features to create. However, I’m certain there are problems out there which require large amounts of data, but they are definitely not so common for the whole world to obsess about.

 

2. What is the hardest thing for you to learn about data science?

I would say the hardest things are those which can’t be learned at school, but which you gain through experience. Coming out of school and working mostly on toy datasets, you are rarely prepared for the messiness of the real-world data. It takes time to learn how to deal with it, how to clean it up, select the important pieces of information, and transform this information into good features. Although that can be quite challenging, it is a core process of the whole data science creativity and one of the things that make data science so interesting.

 

3. What advice do you have for graduate students in the sciences who wish to become Data Scientists?

I don’t know if I’m qualified enough to give such advice, being a recent graduate myself, but I’ll try to write down things that I learned from my own experience.

Invest time in your math and statistics courses, because you’re going to need it. Take a side project, which might give you a chance to learn some new programming concepts and introduce you to interesting datasets. Do your homeworks and don’t be afraid to ask questions whenever you don’t understand something in the lecture, since the best time to learn the basics is now and it’s much harder to fill those holes in knowledge than to learn everything the right way from the beginning.

 

4. What project would you back to do and change? How would you change it?

Most of them! I often catch myself looking back at a project I did a couple of years ago and wishing I knew then what I know now. The most recent project is my master’s thesis, I wish I tried out some things I didn’t have time for, but I hope I’ll manage to catch some time to work on it further in the next couple of months.

 

5. How do you go about scoping a data science project?

Usually when I’m faced with a new dataset, I get very excited about it and can’t wait to dig into it, which gets in the way of all the planning that should have been done beforehand. I hope I’ll manage to become more patient about it with time and learn to do it the “right” way.

One of the things that I find a bit limiting about the industry is that you often have to decide whether something is worth the effort of trying it out, since there are always certain deadlines you need to hold on to. Therefore, it is very important to have a clear final goal right from the beginning. However, one needs to be flexible and take into account that things at the end user’s side might change along the way and be prepared to adapt to the user’s needs accordingly.

 

6. What do you wish you knew earlier about being a data scientist?

That you don’t spend all of your time doing the fun stuff! A lot of the work done by the data scientists is invested into getting the data, making it into the right format, cleaning it up, battling different encoding issues, writing tests for the code you wrote, etc. When you sum everything up, you spend only a part of your time doing the actual “data science magic”.

 

7. What is the most exciting thing you’ve been working on lately?

We are a small team of data scientists at Wise who are working on many interesting projects. I am mostly involved with the natural language processing tasks, since that is the field I’m planning to do my PhD in starting this fall. My most recent project is on expanding the customer service support to multilingual datasets, which can be quite challenging considering the highly skewed language distribution (80% English, 20% all other languages) in the majority of datasets we are dealing with.

 

8. How do you manage learning the ‘soft’ skills and the ‘hard’ skills? Any tips?

Learning the hard skills requires a lot of time, patience, and persistence, and I highly doubt there is a golden formula for it. You just have to read a lot of books and papers, talk to people that are smarter and/or have more experience than you and be patient, because it will all pay off.

Soft skills, on the other hand, somehow come naturally to me. I’m quite an open person and I’ve never had problems talking to people. However, if you do have problems with it, I suggest you to take a deep breath, try to relax, focus and tell yourself that the people you are dealing with are just humans like you, with their good and bad days, their strengths and imperfections. I believe that picturing things this way takes a lot of pressure off your chest and gives you the opportunity to think much more clearly.

What I’ve been working on – late 2015 and early 2016

Standard

I find it useful for morale just to write up what I’ve been working on and what I’ve learned over the last few months.

PyMC3: Bayesian Logistic Regression: Bayesian Logistic Regression and Model Selection – I wrote an example of how to use Deviance Information Criterion for model selection in a Bayesian Logistic Regression. This example includes quite a few plots and visualisations in Seaborn.

Rugby Analytics: A Hierarchical Model of the Six Nations 2015 in PyMC3. This is based on the work I showcased at my talks, I finally got it into the PyMC3 Examples directory.

Comparison of Fibonacci functions – This is a classic interview question but I was interested in putting together an example comparing different data structures in Python. In particular this was a good exercise to make sure I understood lazy evaluation.

Hamiltonian Monte Carlo – I wrote up some notes on the Hamiltonian Monte-Carlo algorithm. This is used a lot in PyMC3 but I hadn’t gone through the theory before. The piece isn’t original but I thought it was worth putting on my blog.

Deep Learning – I wrote a short post based on a days work on getting Deep Learning to work on AWS. My advice is don’t re-invent the wheel and some of the Nvidia drivers are incredibly difficult to install. I was able to finally get GPU speedup and reproduce some examples from Tensorflow.

The Setup – I interviewed myself with my own version of the ‘Setup’ a noted website. This is just me talking about what tools I use both software and hardware. I found it useful to think about how my tools affect my thought processes and creativity so I recommend you do it too 🙂

Hacking InsideAirBnB – I was using AirBnB over the last few months, so I thought it would be good to look for examples of data sources. This isn’t a very complete Machine Learning project but I put it here anyway. I might fix it up and add some more feature extraction, visualisation and PCA/SVD type tools to this.

Image Similarity Database – I haven’t had the chance to work with image data much professionally. So when I came across this from my friend Thomas Hunger I forced myself to reproduce it. I used Zalando image data in this example.

Three Things I wish I learned earlier about Machine Learning – I first got interested in Machine Learning in 2009 when I was interning in Shanghai. I think the only notable work I did back then was using Matlab to do some simple clustering algorithms for customer segmentation. I don’t claim several years professional data-science or Machine Learning experience but I’m not a complete neophyte, and this article is just about what I’ve learned. I republished it on Medium  too, so pick whichever version you prefer.

Dataconomy  – I interviewed Kevin Hillstrom a consultant in Analytics, he discussed the need for accuracy and business acumen, which certainly applies to Data Analytics.

What does Big Data have to do with the Food Industry – I wrote a non-technical  article on the opportunities for Data Science in the Food industry, this was the first time my commentary was featured on IrishTechNews.

There’ll be more stuff from me soon.

 

 

Image Similarity Database…

Standard

Image similarity questions are very common in e-commerce and fashion. This is particular the case with the question of similar colours. I based the following on the excellent work by my friend Thomas Hunger. My implementation has only a few alterations compared to his, but I felt it was worth putting online even if I’m not claiming any originality.

There are many improvements that would be made in a real industrial case, but I found this a good education exercise especially since I don’t have a lot of experience with image analysis and such similarity problems.

I hope you enjoy this too.

Interview with a Data Scientist: Trey Causey

Standard
Trey Causey is a blogger with experience as a professional data scientist in sports analytics and e-commerce. He’s got some fantastic views about the state of the industry, and I was privileged to read this.
1. What project have you worked on do you wish you could go back to, and do better?
The easy and honest answer would be to say all of them. More concretely, I’d love
to have had more time to work on my current project, the NYT 4th Down Bot before
going live. The mission of the bot is to show fans that there is an analytical
way to go about deciding what to do on 4th down (in American football), and that
the conventional wisdom is often too conservative. Doing this means you have to
really get the “obvious” calls correct as close to 100% of the time as possible,
but we all know how easy it is to wander down the path to overfitting in these
circumstances…
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences and Social Sciences?
Students should take as many methods classes as possible. They’re far more generalizable
than substantive classes in your discipline. Additionally, you’ll probably meet
students from other disciplines and that’s how constructive intellectual cross-fertilization
happens. Additionally, learn a little bit about software engineering (as distinct
from learning to code). You’ll never have as much time as you do right now for things
like learning new skills, languages, and methods.
For young professionals, seek out someone more senior than yourself, either at your
job or elsewhere, and try to learn from their experience. A word of warning, though,
it’s hard work and a big obligation to mentor someone, so don’t feel too bad if
you have hard time finding someone willing to do this at first. Make it worth
their while and don’t treat it as your “right” that they spend their valuable
time on you. I wish this didn’t even have to be said.
3. What do you wish you knew earlier about being a data scientist?
 
It’s cliche to say it now, but how much of my time would be spent getting data,
cleaning data, fixing bugs, trying to get pieces of code to run across multiple
environments, etc. The “nuts and bolts” aspect takes up so much of your time but
it’s what you’re probably least prepared for coming out of school.
4. How do you respond when you hear the phrase ‘big data’?
Indifference.
5. What is the most exciting thing about your field?
Probably that it’s just beginning to even be ‘a field.’ I suspect in five years
or so, the generalist ‘data scientist’ may not exist as we see more differentiation
into ‘data engineer’ or ‘experimentalist’ and so on. I’m excited about the
prospect of data scientists moving out of tech and into more traditional
companies. We’ve only really scratched the surface of what’s possible or,
amazingly, not located in San Francisco.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
A difficult question along the lines of “how long is a piece of string?” I think
the key is to communicate early and often, define success metrics as much as
possible at the *beginning* of a project, not at the end of a project. I’ve found
that “spending too long” / navel-gazing is a trope that many like to level at data
scientists, especially former academics, but as often as not, it’s a result of
goalpost-moving and requirement-changing from management. It’s important to manage
up, aggressively setting expectations, especially if you’re the only data scientist
at your company.
7. How do you explain to C-level execs the importance of Data Science? How do you deal with the ‘educated selling’ parts of the job? In particular – how does this differ from sports and industry?
Honestly, I don’t believe I’ve met any executives who were dubious about the
value of data or data science. The challenge is often either a) to temper
unrealistic expectations about what is possible in a given time frame (we data
scientists mostly have ourselves to blame for this) or b) to convince them to
stay the course when the data reveal something unpleasant or unwelcome.
8. What is the most exciting thing you’ve been working on lately and tell us a bit about it.
I’m about to start a new position as the first data scientist at ChefSteps, which
I’m very excited about, but I can’t tell you about what I’ve been working on there
as I haven’t started yet. Otherwise, the 4th Down Bot has been a really fun
project to work on. The NYT Graphics team is the best in the business and is
full of extremely smart and innovative people. It’s been amazing to see the
thought and time that they put into projects.
9. What is the biggest challenge of leading a data science team?
I’ve written a lot about unrealistic expectations that all data scientists
be “unicorns” and be experts in every possible field, so for me the hardest
part of building a team is finding the right people with complementary skills
that can work together amicably and constructively. That’s not special to
data science, though.

Interview with a Data Scientist: Nathalie Hockham

Standard
1038670
(Linkedin picture)
I was very happy to interview Natalie about her data science stuff – as she gave a really cool Machine Learning focused talk at PyData in London this year, which was full of insights into the challenges of doing Machine Learning with Imbalanced data sets.
Natalie leads the data team at GoCardless, a London startup specialising in online direct debit. She cut her teeth as a PhD student working on biomedical control systems before moving into finance, and eventually fintech. She is particularly interested in signal processing and machine learning and is presently swotting up on data engineering concepts, some knowledge of which is a must in the field.

What project have you worked on do you wish you could go back to, and do better?

Before I joined a startup, I was working as an analyst on the trading floor of one of the oil majors. I spent a lot of time building out models to predict futures timespreads based on our understanding of oil stocks around the world, amongst other things. The output was a simple binary indication of whether the timespreads were reasonably priced, so that we could speculate accordingly. I learned a lot about time series regression during this time but worked exclusively with Excel and eViews. Given how much I’ve learned about open source languages, code optimisation, and process automation since working at GoCardless, I’d love to go back in time and persuade the old me to embrace these sooner.

What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Don’t underestimate the software engineers out there! These guys and girls have been coding away in their spare time for years and it’s with their help that your models are going to make it into production. Get familiar with OOP as quickly as you can and make it your mission to learn from the backend and platform engineers so that you can work more independently.

What do you wish you knew earlier about being a data scientist?

It’s not all machine learning. I meet with some really smart candidates every week who are trying to make their entrance into the world of data science and machine learning is never far from the front of their minds. The truth is machine learning is only a small part of what we do. When we do undertake projects that involve machine learning, we do so because they are beneficial to the company, not just because we have a personal interest in them. There is so much other work that needs to be done including statistical inference, data visualization, and API integrations. And all this fundamentally requires spending vast amounts of time cleaning data.


How do you respond when you hear the phrase ‘big data’?

I haven’t had much experience with ‘big data’ yet but it seems to have superseded ‘machine learning’ on the hype scale. It definitely sounds like an exciting field – we’re just some way off going down this route at GoCardless.

What is the most exciting thing about your field?
Working in data is a great way to learn about all aspects of a business, and the lack of engineering resource that characterizes most startups means that you are constantly developing your own skill set. Given how quickly the field is progressing, I can’t see myself reaching saturation in terms of what I can learn for a long time yet. That makes me really happy.

How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
Our 3 co-founders all started out as management consultants and the importance of accurately defining a problem from the outset has been drilled into us. Prioritisation is key – we mainly undertake projects that will generate measurable benefits right now. Before we start a project, we check that the problem actually exists (you’d be surprised how many times we’ve avoided starting down the wrong path because someone has given us incorrect information). We then speak to the relevant stakeholders and try to get as much context as possible, agreeing a (usually quantitative) target to work towards. It’s usually easy enough to communicate to people what their expectations should be. Then the scoping starts within the data team and the build begins. It’s important to recognise that things may change over the course of a project so keeping everyone informed is essential. Our system isn’t perfect yet but we’re improving all the time.

How do you explain to C-level execs the importance of Data Science? How do you deal with the ‘educated selling’ parts of the job?
Luckily, our management team is very embracing of data in general. Our data team naturally seeks out opportunities to meet with other data professionals to validate the work we’re doing. We try hard to make our work as transparent as possible to the rest of the company by giving talks and making our data widely available, so that helps to instill trust. Minor clashes are inevitable every now and then, which can put projects on hold, but we often come back to them later when there is a more compelling reason to continue.

What is the most exciting thing you’ve been working on lately and tell us a bit about GoCardless.
We’ve recently overhauled our fraud detection system, which meant working very closely with the backend engineers for a prolonged period of time – that was a lot of fun.
GoCardless is an online direct debit provider, founded in 2011. Since then, we’ve grown to 60+ employees, with a data team of 3. Our data is by no means ‘big’ but it can be complex and derives from a variety of sources. We’re currently looking to expand our team with the addition of a data engineer, who will help to bridge the gap between data and platform.

What is the biggest challenge of leading a data science team?

The biggest challenge has been making sure that everyone is working on something they find interesting most of the time. To avoid losing great people, they need to be developing all the time. Sometimes this means bringing forward projects to provide interest and raise morale. Moreover, there are so many developments in the field that its hard to keep track, but attending meetups and interacting with other professionals means that we are always seeking out opportunities to put into practice the new things that we have learned.