Interview with a Data Scientist – Ian Wong of OpenDoor

Standard
I interviewed the interesting and fascinating Ian Wong – he’s the technical co-founder of OpenDoor, which I personally think is amazing as a concept!
(Ian Wong – Source: Linkedin)
1. What project have you worked on do you wish you could go back to, and do better?
Pretty much any project I’ve worked on in the past smiling face with open mouth Two projects stick out though.
a. Uniform API for Data Fetching
When we fit a model, call it y = f(X),  (X, y) are often taken for granted to be well-formed. How do you design a service that generates consistent (X, y)? Turns out this is not straightforward, and is specific to the domain and the data-capture systems.
The ideal solution would satisfy both batch & real-time needs; makes it easy to ship new features to production; and enables rapid prototyping. While I scrapped together the initial version at Opendoor, our amazing team of data engineers have really taken it to the next level. We hope to report our findings soon.
b. Interactive Visualizations of ML Algorithms
A few years ago, I tried putting together a visualizer for random forests using d3. I had the tremendous fortune of working with Mike Bostock for a bit, and was inspired by his ability to make abstract concepts tangible through interactive visualizations. At the time, I was working with these big sets of random forests, and wanted a get a better feel of the model outputs. So I rendered hundreds of decisions trees on screen, where when you hovered over one node, all other nodes belonging to the same features across different trees would be highlighted. It was pretty neat! But the prototype suffered from performance issues plus my own technical incompetence.
More broadly, I’m really excited about better interactive tools for ML algorithms because they’ll help us deeply understand them as tools. During my undergrad studies in electrical engineering, we used to play with circuits a lot. In the labs, you could make a change to the input or to the circuitry, and see the corresponding change in output on the oscilloscope instantaneously. When we run a simple regression, wouldn’t it be great to get immediate feedback on the fitted line if we were to drag, add, delete a data point?
I’m hopeful we’ll see more innovation in the read-eval-print loop for data science.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
My view may be slightly biased as I’m a PhD drop-out winking face In grad school, I became increasingly frustrated at the divergence between what’s interesting and what’s impactful. But that’s a whole separate conversation.
For folks looking to enter industry, nothing replaces hands-on practice. I would strongly encourage students to look for internships, participate in Kaggle competitions / Google Summer of Code, seek open source projects to contribute to. If you’re in school, take a wide variety of classes, especially computer science and project-based courses.
The higher order bit here is that the industry faces a different, evolving set of challenges than academia. The focus is typically on solving a business problem.
Here are additional pointers depending on the reader’s interest.
Business Analytics & Decision Science
  • The grammar of graphics and tidydata lay a great foundation for reasoning about data. I personally learned more by working through Hadley’s ggplot2 book than from many of my stats classes at Stanford. 
  • Develop business acumen and communication skills. Sometimes academics prefer to stay in the rarified air of theory and mathematics. Success in the analytics profession requires the ability to (a) meet hard business challenges head on, (b) break them down into smaller, quantifiable sub-problems, (c) rapid analysis, (d) present findings in a way that the audience can engage with, (e) take feedback and iterate.
Machine Learning & Engineering
  • Code, code, code. As I mentioned in Doing Data Science: ML is founded in math, expressed in code, and assembled into software. Being able to build robust software systems is becoming more important, as tools and algorithms are increasingly available.
  • While a strong grasp of theory will help narrow design choices, nothing beats rapidly exploring hypotheses. This demands coding proficiency, which from experience is a differentiating trait of highly productive data scientists.
Also, don’t let your field tie you down! Beware of sunk cost fallacy. Though PhDs may have invested years studying a certain field, the techniques investigated through a graduate program may not be transferable to a new domain. The most important quality of the PhD is persistence in doing research. Remember, it’s re-search search and search again. That’s what defines a great problem solver.
3. What do you wish you knew earlier about being a data scientist?
There are so many! Here’s a few.
How to build great predictive services
While we spend a lot of energy in grad school studying techniques, advanced techniques often yield only incremental lift over a simple solution (and in many cases comes with complexity that becomes a heavy tax). I think the big focus on modeling techniques contributes to the phenomenon of solutions chasing problems, rather than solutions being designed from the needs of the problem. Here’s a rule of thumb that I’ve come to adopt: You know that algorithm that all the papers make fun of in their intro? Implement that and forget the rest of the paper.”
Perhaps influenced by schooling, we as data scientists often dream about having these flashes of brilliance that identifies a proof! QED! In practice, what delivers results is an error-focused, iterative process of continuous model improvement (see my talk here). It’s the unglamorous engineering & detective work of starting with the biggest outliers of the model, and reasoning from first principles to eliminate them. Model debugger describes the role better than data scientist. It’s about the toil.
Forming a perspective based on incomplete data
Intellectual honesty, scientific doubt and a healthy dose of paranoia are generally great things to have. But beware of analysis-paralysis and failure to put a stake in the ground. Decisions need to be made in a timely fashion. In many cases we’re operating with 80% information (if lucky!), and your teammates are counting on you for a recommendation.
Earlier in my career, I would be reluctant in forming and articulating a strong perspective. Partly due to skepticism inculcated through school, and partly because it didn’t seem like it was my job as a data scientist to do so (more on titles being a constraint later). Making an actual policy recommendation seems so messy relative to the clean code and beautiful plots staring at me on the monitor. But I’ve since learned that this is an abdication of responsibility. Our job is to help the company make data-driven decisions, which means thinking through the implications of an analysis, consulting stakeholders, and coming up with a point of view.
Communication as craft
The value of an analysis is measured by whether it influences decisions. Even the most brilliant analysis becomes ineffective if not delivered to the audience in an accessible manner. This Jeff Atwood blog post explains the concept well.
Other things!
There are many other things I wish I knew earlier. How do I pick up software engineering skills? What does a great data scientist look like? How do I progress to become better? How do I foster effective debate and engagement of my work? I’ll omit them for now since this is getting long…
4. How do you respond when you hear the phrase big data’? What about AI’?
On Big Data
There’s big data, and there’s Big Data. If you’re referring to the latter, I think it’s a bit passé at this point (with some exceptions).
Turns out that more data beats better algorithms most of the time. As an industry we have worked really hard to make count(x) group by y scale to terabytes of data. But as alluded to earlier, the tools and infrastructure are increasingly commoditized. We’re ready to move onto higher parts of the application stack vs. focusing on the base layer. (e.g., Opendoor! See also the vertical AI piece by Bradford Cross.)
On AI
Turns out machines are tireless and can count much more reliably than human beings. This has implications as we enter the age of abundant data. This can get philosophical quick! But there are both benefits and hazards we’ll need to navigate.
5. What is the most exciting thing about your field?
As technology and education improves and become more accessible, there’ll be an increased supply of data science and machine learning talent. These individuals will become the next generation builders and leaders. Algorithmic sophistication is going to seep into all parts of our daily lives. The products they create are going to be smarter, easier to use and more personal (Opendoor being an example smiling face with open mouth).
6. How do you go about framing a data problem in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 
As Alan Kay once said: A change in perspective is worth 80 IQ points. Framing a problem well is probably the most important part of the solution.
Within the context of building predictive services, defining the objective function with a clear metric that’s ideally back-testable is half the battle. It provides a foundation for the rest of the work, which entails applying the simplest approach, iterating until convergence to some threshold that’s set by business needs. The art comes in how to define the ML problem in a way that aligns with business outcome (ideally tracing all the way through to the top- or bottom-line).
In terms of when is good enough, that depends on the need. Running a company is kind of like trying to solve an NP-hard resource optimization problem. We have to be rigorous about ROI for each initiative that we spend energy on.
In terms of managing expectation, it’s hard. It folds into longer term project and team planning. What is the business problem we’re trying to solve? What does success mean? Where do we need to be today, a quarter from now, a year from now? Who are the stakeholders? How should we provide updates and receive feedback?
7. You’ve spoken about people not needing to be constrained by titles. Could you expand a bit on that? What sort of skills should someone with ML skills be learning in your opinion? What have you learned working at Opendoor?
On Being Boxed in by Titles and Venn Diagrams
Titles should enable, not constrain. We are all problem solvers first. A title acknowledges that an individual is skilled in a certain area. But one shouldn’t let that define their boundaries. When misused, titles could be a escape hatch to avoid doing the things that matter. For instance, a misinformed data scientist may think of productionizing their insight” as unnecessary implementation detail, while a misinformed software engineer may think of defining data quality SLA for predictive systems as esoteric. In practice, there’s a metric to move, a question to be answered. Titles endow neither immunity nor magical problem-solving powers. What matters is clarifying the job to be done.
In a PhD program, there’s a tendency to put blinders on and focus on one problem, specified by one professor in one department. In industry, solutions tend to be multi-disciplinary. A lot of what we do as data scientists is to take human intuition and generalize them, seeing which withstand the backtest or an experiment. And to do this well, we need to be open to new ideas and continuously develop new skills. As allude to earlier, some of the key ones are (a) business intelligence and (b) software engineering (including frontend!).
But I would be remiss not to mention the following
  • Scrappy + Pragmatic + Business Acumen > Technical Expertise
The best data scientists are relentlessly resourceful and impact- / solution-oriented. Mindset shifts from I need to gain skill x” to I am going to solve problem y”; from not my job” to run towards where the impact is.”
Opendoor
It’s been an incredible journey thus far at Opendoor. We are on a mission to empower everyone with the freedom to move by building a seamless, end-to-end customer experience that makes buying and selling a home stress-free and instant. The experience of growing our team, scaling up as a leader and servicing thousands of customers has been really rewarding. It’s the perfect blend of crazy-hard technical challenges and creating positive impact in people’s lives.
We’re only getting started! If any of this seems exciting, check out http://opendoor.com/jobs, or email me at ian@opendoor.com.
These are great questions and I had a fun time (partially) answering them!
Bio
Ian is working on modernizing the residential real-estate industry. He is the technical co-founder of Opendoor, where he leads their engineering and data science team. He previously built fraud detection systems at Square as their first data scientist, and products at Prismatic. Ian received his BS in EE (Electrical Engineering), MS in EE and MS in Statistics from Stanford before dropping out of the PhD.

Interview with a Data Scientist: Mick Cooney

Standard

I’m delighted to feature my friend Mick Cooney here as an interviewee. Mick has many years of experience in Finance and more recently in Insurance, he co-ran the Dublin R meetup which was very successful and helped foster a data science community in Dublin. More recently he’s been working over in London at an Actuarial Consultancy – building out a data science practice.

q1. What project have you worked on do you wish you could go back to,
and do better?

I started my career as a quant in a small startup hedge fund. We
developed time series models to forecast short-term volatility in
equities and equity indices as part of an option trading strategy. It
is a fascinating topic and I still dabble in it. Thinking back on the
work done, I would re-engineer large portions of it. I made a ton of
mistakes on both the modelling and implementation side, and the R
language in particular has progressed in strides since I did the bulk
of the work.

For example, the system automatically generates PDF reports of the
forecasts but it does so by hand creating La-TeX files compiled into
PDF. One of the first things I would do is switch all that over to use
either ‘knitr’ or ‘rmarkdown’. I would also use more ‘reproducible
research’ concepts.

That said, I had worked on the modeling for a long time, so I am
content with the basic model. There are many things still to
investigate or implement.

On the modeling side, I worked on a persistency model using survival
analysis, which is how I learned about the subject in the first
place. As a result, there are a lot of different things I would love
to return to and do differently. In retrospect, I was too quick to
move past the simpler models. We could see the assumptions were not
consistent with the data, and so did not fully explore simpler
approaches. I am now curious to learn what insights those simpler
approaches would yield.

Customer churn is such a universal problem I expect I will be working
on it again in the near future. Hopefully I can apply those lessons
then.

***
q2. What advice do you have to younger analytics professionals and in
particular PhD students in the Sciences?

I think the key advice I would give is the same for everyone – never
stop learning. This may be the availability heuristic at play with me,
but I have never seen a connection between qualifications and analyst
quality. All the good analysts I know have curiosity and
initiative. Academic achievements do not come into it at all.

Initiative manifests in many ways. First, when they encounter a
problem they learn what they need to do and get on with it. Second,
much of their knowledge is self-taught. Finally, and I believe most
importantly, they have an inherent curiosity – the best analysts I
know engage in the field in their own time, mainly because they want
to.

This brings up a related issue I have been pondering for some time. I
am ambitious. I want to be a top data scientist some day. I have no
academic ambition whatsoever, but my goal is to be able to hold my own
in any conversation with anyone in the field.

How do I achieve this? What do I need to do to get to that point?

While probably not as keen as the average fan, I love sport – soccer,
the NFL and Gaelic Football in particular. For anyone who has met me
in person, comparing me to a top athlete seems preposterous, but
there is a lot to be learned from top athletes if you want to excel
at your chosen field. Look at how they prepare and train. These
principles almost certainly apply to other professions too, but it is
more fun to talk about sport. 🙂

When I read about Lionel Messi, Tom Brady or Colm Cooper (for our
non-Irish readers the recently-retired ‘Gooch’ is arguably the
greatest GAA player to ever play the game – he was majestic to watch),
the one thing that always stands out for me is their fanatical
devotion to their chosen career not their obvious talent. All their
team-mates mention how hard they worked despite their abundance of
natural advantages. Players with huge natural talent often coast, but
elite players are the opposite – they work as hard as the fringe
players slogging to just survive the cut.

In our field, we need to work constantly on improving – going to
Meetups, reading about new techniques, watching videos on YouTube and
looking to strengthen areas where you are weak. This is why a natural
interest and curiosity is so invaluable – it makes these necessary
tasks much less of a burden as they are things you would want to do
anyway.

Secondly, top players do the simple things well, almost never making a
mistake. They are fallible of course, and make mistakes, but almost
never on the basics. They are rigorous about practicing the basic
skills and principles, and that is why they are so good. The bread and
butter of their craft is second-nature to them.

This is why I focus so much on basic statistics classes and reread and
re-watch the books and lectures I find useful. I want these things to
be second nature and they are not.

Probability and statistics are so counter-intuitive that I almost
never get things right on gut feeling. I am almost always wrong. So
much so that I gave a talk about probabilistic graphical models about
a year ago and during the questions at the end made an off-hand joke
about going with the opposite of my intuition.

It was said in jest at the time but is sadly true!

One final piece of advice is to help as many people as you can. Help
people with their homework, with some programming, with their computer
problems and with data problems. You get exposed to all sorts of
topics and problems, most of which you will see again in your
career. You also get the added bonus of people thinking you are
selfless and altruistic, despite being self-serving in reality!

***
q3. What do you wish you knew earlier about being a data scientist?

I have two main things I wish I learned early on in my career, and
both are connected philosophically. First, I wish I had learned about
probabilistic thinking, risk management, economics and statistics –
you can never learn enough about these fundamental topics. Secondly, I
wish I learned it is okay to start working with a bad model that you
know is wrong but simple.

To that first point, I spend a long time fighting my natural desire
for a clean, elegant and correct answer to a problem. I would work on
a problem, get to a point that I was confident pointed us in the right
direction, but then realise that ‘proving’ this was right involved a
huge amount of time and effort, assuming it was possible.

I attributed my natural reluctance to pursue this ‘answer’ as
laziness, and felt guilty. I felt I was being unprofessional and
sloppy. But working on forecasting models for trading taught me that
this was not the case. Models are so imperfect, with so many
compromises it is often more optimal to think about other things first
– what are the limitations of the model in practice, what is it
saying, how are you going to use it. Answer those questions first,
THEN worry about improving it.

This is why I always start with simple, stupid, wrong models. They are
quick to produce, they help you learn a lot about what you are doing,
they fail in spectacular ways and they are sometimes all you need. In
terms of costs and benefits, they are hard to beat.

***
q4. How do you respond when you hear the phrase ‘big data’?

I hate it. It has become a meaningless buzzword used as a means of
making sales.

My attitude to the term is best summarised by the interview you had
with Hadley Wickham: there are three categories of data size,
in-memory, on-disk and finally the truly ‘big data’ problems like
recommender systems. I believe the majority of problems can be solved
by appropriate sampling of your data down to a manageable size and
then analysing those subsets.

After all, the whole point of statistics is to make inferences about a
population from a sample of the data.

Once decided on a solution, putting the model into production and
scaling it for your business is a major issue, but is a problem more
belonging to the realm of network and software engineering. That said,
it is important to keep people with a solid understanding of the
concepts stay involved, just in case some ‘optimisations’ ruin the
output.

***
q5. What is the most exciting thing about your field?

Robert McNamara in ‘The Fog of War’ mentioned that you should never
answer the question asked but instead answer the question you wanted
to be asked, so with your forebearance I will first answer a liberal
interpretation of that question: what work gets me excited?

The short answer to that question is all sorts of things do, but they
are often small things related to work I am doing. In the last few
months, I was excited to try out dataexpks (a data exploration package
I am co-creating) on a brand new data set to see what it showed me and
how well my code worked. I love think of ways to use Monte Carlo
simulation to test the output of various regression models, and over
Christmas I was fascinated by a short project trying out methods of
investigating differences between a subpopulation within a larger
population.

I am fascinated by new ways to learn the fundamentals – there are a
few excellent ones out there and I read them all the time. I can never
learn enough as in my experience reality tends to present us with
basic statistical problems in new and unusual ways.

Having multiple perspectives and multiple approaches is invaluable in
those situations.

Regarding your original question as I think you intended, I think the
advances in reinforcement learning techniques probably have the
biggest potential – some of the Atari gameplaying from Deep Mind was
eye-opening. Sadly, if history is any guide, much of it will prove to
be hype, but I imagine some very interesting results to come from the
work.

***
q6. How do you go about framing a data problem – in particular, how do
you avoid spending too long, how do you manage expectations etc. How
do you know what is good enough?

Framing a data problem is a tough one to answer – I am not sure what I
do or how to articulate it. I have had the good fortune to help a lot
of people with their projects and problems, exposing me to a wide
variety of problems. I learned something from all of them and I rely
on that a lot.

I also read a lot of blogs, articles and subscribe to mailing
lists. While rarely having the time to read all this, often all you
need to get started on a problem is a vague memory of some technical
topic that may help and some terminology to Google.

As a result, the first thing I focus on is understanding the problem:
what is being asked? Do we have any data? What does is it look like?
Are there other data available we can use to enrich or use as a
substitute?

Going through that process will suggest approaches to use, and at that
point I draw upon previous experience, however tangential to the
problem..

By keeping this focus, your other questions are straightforward to
answer: if the current model is not likely to improve the answer by an
amount relevant to the goal, it is not worth spending more time
on. Similarly, knowing what is needed will tell you if your current
model is good enough, or often if there is a model that is good enough
– it is possible the level of accuracy required is not feasible.

In the latter case, discovering that early is much better than later –
you know not to waste time, money and resources on a lost cause.

***
q7. You’ve spoken before about the ‘need for apprenticeships’ in Data
Science. Do you have any suggestions on what that would involve? Are
meetups and coaching a good first start?

To explain the point I was making on that note, I think there is a lot
of implicit knowledge in this field, and I have been told a number of
times from people looking for help that people feel overwhelmed by the
sheer amount of knowledge people feel they need to know.

I do not think this is true, but I understand its origin: there is so
many different aspects to working with data it is tough to know where
to start. I always start very simple, but as I mentioned early, it
took a lot of time, thought and effort to get to that point, and it is
not easy to explain these ideas in theory – you have to work on a
number of different datasets to get a feel for how to do this.

As a result, I believe an approach such as mentoring or
apprenticeships are an effective approach to teach people – more
experienced analysts can guide junior members around the various
pitfalls and traps that are easy to fall into. It allows us to
illustrate that fancy and sophisticated techniques and algorithms are
not needed to do interesting work – some of the most interesting work
I have seen involved little more than summary statistics along with
basic models like linear regression and decision trees.

This is hard to learn from a book – almost impossible. The closest
book I read that talks about this is “Data Analysis Using Regression
and Multilevel/Hierarchical Models” by Gelman and Hill, stressing the
importance of starting from simple models. I would love to know if
there are more.

That said, I could only appreciate the point because I was already
experienced, a younger version of myself would have missed the
point. It would not have occurred to me that the right way to do
something is to do the simple and obvious thing.

I am a firm believer in the KISS principle. Keep It Simple, Stupid.

Interview with a Data Scientist: Greg Linden

Standard
I caught up with Greg Linden via email recently
Greg was one of the first people to work on data science in Industry – he invented the item-to-item collaborative filtering algorithm at Amazon.com in the late 90s.
I’ll quote his bio from Linkedin:
“Much of my past work was in artificial intelligence, personalization, recommendations, search, and advertising. Over the years, I have worked at Amazon, Google, and Microsoft, founded and run my own startups, and advised several other startups, some of which were acquired. I invented the now widely used item-to-item collaborative filtering algorithm, contributed to many patents and academic publications, and have been quoted often in books and in the press. I have an MS in Computer Science from University of Washington and an MBA from Stanford.”
img_6330

Greg Linden: Source Personal Website

1. What project have you worked on do you wish you could go back to, and do better?
All of them! There’s always more to do, more improvements to make, another thing to try. Every time you build anything, you learn what you could do to make it better next time.

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Learn to code. Computers are a tool, and coding is the way to get the most out of that tool. If you can code, you can do things in your field that others cannot. Coding is a major force multiplier. It makes you more powerful.

3. What do you wish you knew earlier about being a data scientist?
I was doing what is now called data science at Amazon.com in 1997.The term wasn’t even coined until 2008 (by Jeff Hammerbacher and DJ Patil). It’s hard to be much earlier. As for what I wish, I mostly wish I had the powerful tools we have now back then; today is a wonderland of data, tools, and computation. It’s a great time to be a data scientist.

4. How do you respond when you hear the phrase ‘big data’?
I usually think of Peter Norvig talking about the unreasonable effectiveness of data and Michele Banko and Eric Brill finding that more data beat better algorithms in their 2001 paper. Big data is why Amazon’s recommendations work so well. Big data is what tunes search and helps us find what we need. Big data is what makes web and mobile intelligent.

5. What is the most exciting thing about your field?
I very much enjoy looking at huge amounts of data that no one has looked at yet. Being one of only a few to explore a previously unmined new source of information is very fun. Low hanging fruit galore! It’s also fraught with peril, as you’re the first to find all the problems in the data as well.

6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
Data problems should be iterative. Start simple. Solve a small problem. Explore the data. Then solve a harder problem. Then a harder one. Each time you take a step, you’ll get ideas on where to go next, and you also get something out at each step. Too many people start trying to solve the entire problem at the beginning, flailing for a long time, usually to discover that it was the wrong problem to solve when they finally struggle to completion. Start with easier problems, learn where to go, and you might be surprised by all the goodies you find along the way.

Interview with a Data Scientist – Jessica Graves

Standard

Jessica Graves is a Data Scientist who currently works on fashion problems in New York City. She’s worked with Hilary Mason at Fast Forward Labs and keeps in regular contact with the London startup scene. After many months of asking her for an interview she finally gave in – and she shares her unique perspective on the datafication of Fashion. She comes from a background in visual and performing arts, as well as fashion design. In her spare time you’ll find her reading a stack of papers or studying dance.

Cover image: unsplash.com CCO

Jessica Graves_02-1

  1. What project have you worked on do you wish you could go back to, and do better?

I worked with Dr. Laurens Mets on an iteration of the technology behind Electrochaea, a device where microbes convert waste electricity to clean natural gas. My job was to translate models from electrochemistry journals into code, to help simulate, measure and optimize the parameters of the device. We needed to facilitate electron transport and keep the microbes happy. Read papers, write code, and design alternative energy technology with math + data?! I would hand my past self How to Design Programs as a guide and learn to re-implement from scratch in an open source language. 

  1. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

Listen! If you are a data scientist, listen carefully to the business problems of your industry, and see the problems for what they are, rather than putting the technical beauty of and personal interest in the solution first and foremost. You may find it’s more important to you to work with a certain type of problem than it is to work at a certain type of company, or vice versa. Watch very carefully when your team expresses frustration in general – articulate problems that no one knows they should be asking you to solve. At the same time, it can be tempting to work on a solution that has no problem. If you’re most interested in a specific machine learning technique, can you justify its use over another, or will high technical debt be a serious liability? Will a project be leveragable (legally, financially, technically, operationally)? Can you quantify the risk of not doing a project? 

  1. What do you wish you knew earlier about being a data scientist?

I wish I realized that data science is classical realist painting.

Classical realists train to accurately represent a 3D observation as a 2D image. In the strictest cases, you might not be allowed to use color for 1-3 years, working only with a stick of graphite, graduating to charcoal and pencils, eventually monotone paintings. Only after mastering the basics of form, line, value, shade, tone, are you allowed a more impactful weapon, color. With oil painting in particular, it matters immensely in what order at what layer you add which colors, which chemicals compose each color, of which quality pigment, at what thickness, with what ratio of which medium, with which shape of brush, at what angle, after what period of drying. Your primary objective is to continuously correct your mistakes of translating what you observe and suspending your preconception of what an object should look like.

There are many parallels with data science. At no point as a classical realist painter should you say, ‘well it’s a face, so I’m going to draw the same lines as last time’ just like as a data scientist, you should look carefully at the data before applying algorithm x, even if that’s what every blog post Google surfaces to the top of your results says to do in that situation. You have to be really true to what you observe and not what you know – sometimes a hand looks more like a potato than a hand, and obsessing over anatomical details because you know it’s a hand is a mistake. Does it produce desirable results in the domain of problems that you’re in? Are you assuming Gaussian distributions on skewed data? Did you go directly to deep learning when logistic regression would have sufficed? I wish I knew how often data science course offerings are paint by numbers. You won’t get very far once the lines are removed, the data is too big to extract on your laptop, and an out-of-memory error pops up running what you thought was a pretty standard algorithm on the subset you used instead. Let alone that you have to create or harvest the data set in the first place – or sweet talk someone into letting you have access to it.  

In addition, Nulla dies sine linea – it’s true for drawing, ballet, writing. It’s true for data science. No day without a line. It’s very difficult to achieve sophistication without crossing off days and days of working through code or theoretical examples (I think this is why Recurse Center is so special for programmers). Sets of bland but well-executed tiny piece of software. Unspectacular, careful work in high volumes raises the quality of all subsequent complex works. Bigger, slower projects benefit from myriads of partially explored pathways you already know not to take.

Also side notes to my past self: Linux. RAM. Thunderbolt ports. 

  1. How do you respond when you hear the phrase ‘big data’?

Big data? Like in the cloud? Or are we in the fog now? Honestly the first thing I see in my mind is PETABYTES. I think of petabytes of selfies raining from the sky and flowing into a data lake. Stagnant. Data-efficient AI is all the rage — less data, more primitives, smarter agents. In the meantime, optimizing hardware and code to work with large datasets is pretty fun. Fetishizing the size of the data works well …as long as you don’t care about robustness to diverse inputs. Can your algorithm do well with really niche patterns? What can you do with the bare minimum amount of data? 

  1. What is the most exciting thing about your field?

Fashion is visual. It’s inescapable. Every culture has garb or adornment, however minimal. A few trillion dollars of apparel, textiles, and accessories across the globe. The problems of the industry are very diverse and largely unsolved. A biologist might come to fashion to grow better silk. An AI researcher might turn to deep learning to sift through the massive semi-structured set of apparel images available online. So many problems that may have a tech solution are unsolved. Garment manufacturing is one of the most neglected areas of open source software development. LVMH and Richemont don’t fight over who provided the most sophisticated open-source tools to researchers the way that Amazon and Google do. You can start a deep learning company on a couple grand and use state-of-the-art software tools for cheap or free. You cannot start an apparel manufacturing vertical using state-of-the-art tools without serious investment, because the climate is still extremely unfavorable to support a true ecosystem of small-scale independent designers. The smartest software tools for the most innovative hardware are excessively expensive, closed-source, and barely marketed — or simply not talked about in publically accessible ways. Sewing has resisted automation for decades, although is finally now at a place now were the joining of fabrics into a seam is robot-automatable with computer vision used on a thread-by-thread basis to determine the location of the next stitch. 

High end, low end, or somewhere in between, the apparel side of fashion’s output is a physical object that has to be brought to life from scratch, or delivered seamlessly, to a human, who will put the object on their body. Many people participate in apparel by default, but the fashion crowd is largely self-selected and passionate, so it’s exciting (and difficult) to build for such an engaged group that don’t fit standard applications of standard machine learning algorithms.

  1. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

Artists learn this eventually: volume of works produced trumps perfectionism. Even to match something in classical realism, you start with ridiculous abstractions. Cubes and cylinders to approximate heads and arms. Break it down into the smallest possible unit. Listen to Polya, “If you can’t solve a problem, then there is an easier problem you can solve: find it.”

As for when to finish? Nothing is never good enough. The thing that is implemented is better than the abstract, possibly better thing, for now, and will probably outlive its original intentions. But make sure that solution correlates thoroughly with the problem, as described in the words of the stakeholder. Otherwise, for a consumer-facing product or feature, your users will usually give you clues as to what’s working. 

  1. You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?

Be open. Fashion has a lot of space for innovation if you understand and quantify your impact on problems that are actually occurring and costing money or time, and show that you can solve them fast enough. “We built this new thing” has absolutely nothing to do with “We built this useful thing” and certainly not “We built this backwards-compatible thing”. You might be tempted to recommend a “new thing” and then complain that fashion isn’t sophisticated enough or “data” enough for it. As an industry that in some cases has largely ignored data for gut feelings with a serious payoff, I think the attitude should be more of pure respect than of condescension, and of transitioning rather than scrapping. That or build your own fashion thing instead of updating existing ones.  

  1. You have worked in fashion. Can you talk about the biggest opportunities for data in the fashion industry. Are there cultural challenges with datafication in such a ‘creative industry’.

Fashion needs ‘datafication’ that clearly benefits fashion. If you apply off-the-shelf collaborative filtering to fashion items with a fixed seasonal shelf life to users that never really interact with, you’re going to get poor results. Algorithms that work badly in other domains might work really well in fashion with a few tweaks. NIPS had an ecommerce workshop last year, and KDD has a fashion-specific workshop this year, which is exciting to see, although I’ll point out that researchers have been trying to solve textile manufacturing problems with neural networks since the 90s.

A fashion creative might very well LOVE artificial intelligence, machine learning, and data science if you tailor your language into what makes their lives easier. Louis Vuitton uses an algorithm to arrange handbag pattern pieces advantageously on a piece of leather (not all surfaces of the leather are appropriate for all pattern pieces of the handbag) and marks the lines with lasers before artisans hand-cut the pieces. The artisans didn’t seem particularly upset about this. 

The two main problems I still see right now are the doorman problem and fit. Use data and software to make it simple for designers of all scales to adjust garments to fit their real markets instead of their imagined muses. And, use as little input as possible to help online shoppers know which existing items will fit. Once they buy, make sure they get their packages on time, securely, discreetly. 

Interview with a Data Scientist: Phillip Higgins

Standard

Phillip Higgins is a data science consultant based in New Zealand. His experience includes financial services and working for SAS, amongst other experience including some in Germany.

What project have you worked on do you wish you could go back and do better?

Hindsight is a wonderful thing, we can always find things we could have done better in projects.  On the other hand, analytic and modelling projects are often frought with uncertainty- uncertainty that despite the best planning, is not available to foresight. Most modelling projects that I have worked on could have been improved with the benefit of better foresight!

What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

Firstly, I would advise younger analytics professionals to develop both deep knowledge of a particular area and at the same time, to broaden their knowledge and to maintain this focus of learning on both specialised and general subjects throughout their careers.  Secondly, its important to gain as much practice as possible – data science is precisely that because it deals with real-world problems.  I think PhD students should cultivate industry contacts and network widely- staying abreast of business and technology trends is essential.

What do you wish you knew earlier about being a data scientist?
Undoubtedly I wish I knew the importance of communication skills in the whole analytics life-cycle.  Its particularly important to be able to communicate findings to a wide audience and so refined presentation skills are a must.

How do you respond when you hear the phrase ‘Big Data’?

I think Big Data offers data scientists with new possibilities in terms of the work they are able to perform and the significance of their work.  I don’t think it’s a coincidence that the importance and demand of data scientists has risen sharply right at the time that Big Data has become mainstream- for Big Data to yield insights, “Big Analytics” need to be performed – they go hand in hand.

What is the most exciting thing about your field?

For me personally it’s the interesting people I meet along the way.  I’m continually astounded by the talented people I meet.

How do you go about framing a data problem – in particular, how do you manage expectations etc.  How do you know what is good enough?

I think its important to never lose sight of the business objectives that are the rationale for most data-scientific projects.  Although it is essential that businesses allow for data science to disprove hypotheses, at the end of the day, most evidence will be proving hypotheses (or disproving the null hypothesis).  The path to formulating those hypotheses lies obviously mostly in exploratory data analysis (combined with domain knowledge).  It is important to communicate this uncertainty as to framing from the outset, so that there are no surprises.

You spent some time as a consultant in data analytics.  How did you manage cultural challenges, dealing with stakeholders and executives?  What advice do you have for new starters about this?

In consulting you get to mix with a wide variety of stakeholders and that’s certainly an enjoyable aspect of the job.  I have dealt with a wide range of stakeholders, from C-level executives through to mid- level managers and analysts and each group requires a different approach.  A stakeholder analysis matrix is a good place to start- analysing stakeholders by importance and influence.  Certainly, adjusting your pitch and being aware of the politics behind and around any project is very important.

 

Interview with a Data Scientist: Ivana Balazevic

Standard

Ivana Balazevic is a Data Scientist at a Berkeley based startup Wise.io, where she is working in a small team of data scientists on solving problems in customer service for different clients. She did her bachelor’s degree in Computer Science at the Faculty of Electrical Engineering and Computing in Zagreb and she recently finished her master’s degree in Computer Science with the focus on Machine Learning at the Technical University Berlin.

 

1. What do you think about ‘big data’?

I try not to think about it that much, although nowadays that’s quite hard to avoid. 🙂 It’s definitely an overused term, a buzzword.

I think that adding more and more data can certainly be helpful up to a point, but the outcome of majority of the problems that people are trying to solve depends primarily on the feature engineering process, i.e. on extracting the necessary information from the data and deciding which features to create. However, I’m certain there are problems out there which require large amounts of data, but they are definitely not so common for the whole world to obsess about.

 

2. What is the hardest thing for you to learn about data science?

I would say the hardest things are those which can’t be learned at school, but which you gain through experience. Coming out of school and working mostly on toy datasets, you are rarely prepared for the messiness of the real-world data. It takes time to learn how to deal with it, how to clean it up, select the important pieces of information, and transform this information into good features. Although that can be quite challenging, it is a core process of the whole data science creativity and one of the things that make data science so interesting.

 

3. What advice do you have for graduate students in the sciences who wish to become Data Scientists?

I don’t know if I’m qualified enough to give such advice, being a recent graduate myself, but I’ll try to write down things that I learned from my own experience.

Invest time in your math and statistics courses, because you’re going to need it. Take a side project, which might give you a chance to learn some new programming concepts and introduce you to interesting datasets. Do your homeworks and don’t be afraid to ask questions whenever you don’t understand something in the lecture, since the best time to learn the basics is now and it’s much harder to fill those holes in knowledge than to learn everything the right way from the beginning.

 

4. What project would you back to do and change? How would you change it?

Most of them! I often catch myself looking back at a project I did a couple of years ago and wishing I knew then what I know now. The most recent project is my master’s thesis, I wish I tried out some things I didn’t have time for, but I hope I’ll manage to catch some time to work on it further in the next couple of months.

 

5. How do you go about scoping a data science project?

Usually when I’m faced with a new dataset, I get very excited about it and can’t wait to dig into it, which gets in the way of all the planning that should have been done beforehand. I hope I’ll manage to become more patient about it with time and learn to do it the “right” way.

One of the things that I find a bit limiting about the industry is that you often have to decide whether something is worth the effort of trying it out, since there are always certain deadlines you need to hold on to. Therefore, it is very important to have a clear final goal right from the beginning. However, one needs to be flexible and take into account that things at the end user’s side might change along the way and be prepared to adapt to the user’s needs accordingly.

 

6. What do you wish you knew earlier about being a data scientist?

That you don’t spend all of your time doing the fun stuff! A lot of the work done by the data scientists is invested into getting the data, making it into the right format, cleaning it up, battling different encoding issues, writing tests for the code you wrote, etc. When you sum everything up, you spend only a part of your time doing the actual “data science magic”.

 

7. What is the most exciting thing you’ve been working on lately?

We are a small team of data scientists at Wise who are working on many interesting projects. I am mostly involved with the natural language processing tasks, since that is the field I’m planning to do my PhD in starting this fall. My most recent project is on expanding the customer service support to multilingual datasets, which can be quite challenging considering the highly skewed language distribution (80% English, 20% all other languages) in the majority of datasets we are dealing with.

 

8. How do you manage learning the ‘soft’ skills and the ‘hard’ skills? Any tips?

Learning the hard skills requires a lot of time, patience, and persistence, and I highly doubt there is a golden formula for it. You just have to read a lot of books and papers, talk to people that are smarter and/or have more experience than you and be patient, because it will all pay off.

Soft skills, on the other hand, somehow come naturally to me. I’m quite an open person and I’ve never had problems talking to people. However, if you do have problems with it, I suggest you to take a deep breath, try to relax, focus and tell yourself that the people you are dealing with are just humans like you, with their good and bad days, their strengths and imperfections. I believe that picturing things this way takes a lot of pressure off your chest and gives you the opportunity to think much more clearly.

Interview with a Data Scientist: Brad Klingenberg

Standard

Bio

Brad Klingenberg is the Director of Styling Algorithms at Stitch Fix in San Francisco. His team uses data and algorithms to improve the selection of merchandise sent to clients. Prior to joining Stitch Fix Brad worked with data and predictive analytics at financial and technology companies. He studied applied mathematics at the University of Colorado at Boulder and earned his PhD in Statistics at Stanford University in 2012.


 

1. What project have you worked on do you wish you could go back to, and do better?

 

Nearly everything! A common theme would be not taking the framing of a problem for granted. Even seemingly basic questions like how to measure success can have subtleties. As a concrete example, I work at Stitch Fix, an online personal styling service for women. One of the problems that we study is predicting the probability that a client will love an item that we select and send to her. I have definitely tricked myself in the past by trying to optimize a measure of prediction error like AUC.

This is trickier than it seems because there are some sources of variance that are not useful for making recommendations. For example, if I can predict the marginal probability that a given client will love any item then that model may give me a great AUC when making predictions over many clients, because some clients may be more likely to love things than others and the model will capture this. But if the model has no other information it will be useless for making recommendations because it doesn’t even depend on the item. Despite its AUC, such a model is therefore useless for ranking items for a given client. It is important to think carefully about what you are really measuring.


 

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences and Social Sciences?

 

Focus on learning the basic tools of applied statistics. It can be tempting to assume that more complicated means better, but you will be well-served by investing time in learning workhorse tools like basic inference, model selection and linear models with their modern extensions. It is very important to be practical. Start with simple things.

Learn enough computer science and software engineering to be able to get things done. Some tools and best practices from engineering, like careful version control, go a long ways. Try to write clean, reusable code. Popular tools in R and Python are great for starting to work with data. Learn about convex optimization so you can fit your own models when you need to – it’s extremely useful to be able to cast statistical estimates as the solution to optimization problems.

Finally, try to get experience framing problems. Talk with colleagues about problems they are solving. What tools did they choose? Why? How should did they measure success? Being comfortable with ambiguity and successfully framing problems is a great way to differentiate yourself. You will get better with experience – try to seek out opportunities.


 

3. What do you wish you knew earlier about being a data scientist?

 

I have always had trouble identifying as a data scientist – almost everything I do with data can be considered applied statistics or (very) basic software engineering. When starting my career I was worried that there must be something more to it – surely, there had to be some magic that I was missing. There’s not. There is no magic. A great majority of what an effective data scientist does comes back to the basic elements of looking at data, framing problems, and designing experiments. Very often the most important part is framing problems and choosing a reasonable model so that you can estimate its parameters or make inferences about them.


 

4. How do you respond when you hear the phrase ‘big data’?

 

I tend to lose interest. It’s a very over-used phrase. Perhaps more importantly I find it to be a poor proxy for problems that are interesting. It can be true that big data brings engineering challenges, but data science is generally made more interesting by having data with high information content rather than by sheer scale. Having lots of data does not necessarily mean that there are interesting questions to answer or that those answers will be important to your business or application. That said, there are some applications like computer vision where it can be important to have a very large amount of data.


 

5. What is the most exciting thing about your field?

 

While “big data” is overhyped, a positive side effect has been an increased awareness of the benefits of learning from data, especially in tech companies. The range of opportunities for data scientists today is very exciting. The abundance of opportunities makes it easier to be picky and to find the problems you are most excited to work on. An important aspect of this is to look in places you might not expect. I work at Stitch Fix, an online personal styling service for women. I never imagined working in women’s apparel, but due to the many interesting problems I get to work on it has been the most exciting work of my career.


 

6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

 

As I mentioned previously, it can be helpful to start framing a problem by thinking about how you would measure success. This will often help you figure out what to focus on. You will also seldom go wrong by starting simple. Even if you eventually find that another approach is more effective a simple model can be a hugely helpful benchmark. This will also help you understand how well you can reasonably expect your ultimate approach to perform. In industry, it is not uncommon to find problems where (1) it is just not worth the effort to do more than something simple, or (2) no plausible method will do well enough to be considered successful. Of course, measuring these trade-offs depends on the context of your problem, but a quick pass with a simple model can often help you make an assessment.


 

7. How do you explain to C-level execs the importance of Data Science? How do you deal with the ‘educated selling’ parts of the job? In particular – how does this differ from sports and industry?

 

It is usually better if you are not the first to evangelize the use of data. That said, data scientists will be most successful if they put themselves in situations where they have value to offer a business. Not all problems that are statistically interesting are important to a business. If you can deliver insights, products or predictions that have the potential to help the business then people will usually listen. Of course this is most effective when the data scientist clearly articulates the problem they are solving and what its impact will be.

The perceived importance of data science is also a critical aspect of choosing where to work – you should ask yourself if the company values what you will be working on and whether data science can really make it better. If this is the case then things will be much easier.


 

8. What is the most exciting thing you’ve been working on lately and tell us a bit about it.

 

I lead the styling algorithms team at Stitch Fix. Among the problems we work on is making recommendations to our stylists, human experts who curate our recommendations for our clients. Making recommendations with humans in the loop is fascinating problem because it introduces an extra layer of feedback – the selections made by our stylists. Combining this feedback with direct feedback from our clients to make better recommendations is an interesting and challenging problem.


 

9. What is the biggest challenge of leading a data science team?

 

Hiring and growing a team are constant challenges, not least because there is not much consensus around what data science even is. In my experience a successful data science team needs people with a variety of skills. Hiring people with a command of applied statistics fundamentals is a key element, but having enough engineering experience and domain knowledge can also be important. At Stitch Fix we are fortunate to partner with a very strong data platform team, and this enables us to handle the engineering work that comes with taking on ever more ambitious problems.