Interview with a Data Scientist: Greg Linden

Standard
I caught up with Greg Linden via email recently
Greg was one of the first people to work on data science in Industry – he invented the item-to-item collaborative filtering algorithm at Amazon.com in the late 90s.
I’ll quote his bio from Linkedin:
“Much of my past work was in artificial intelligence, personalization, recommendations, search, and advertising. Over the years, I have worked at Amazon, Google, and Microsoft, founded and run my own startups, and advised several other startups, some of which were acquired. I invented the now widely used item-to-item collaborative filtering algorithm, contributed to many patents and academic publications, and have been quoted often in books and in the press. I have an MS in Computer Science from University of Washington and an MBA from Stanford.”
img_6330

Greg Linden: Source Personal Website

1. What project have you worked on do you wish you could go back to, and do better?
All of them! There’s always more to do, more improvements to make, another thing to try. Every time you build anything, you learn what you could do to make it better next time.

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Learn to code. Computers are a tool, and coding is the way to get the most out of that tool. If you can code, you can do things in your field that others cannot. Coding is a major force multiplier. It makes you more powerful.

3. What do you wish you knew earlier about being a data scientist?
I was doing what is now called data science at Amazon.com in 1997.The term wasn’t even coined until 2008 (by Jeff Hammerbacher and DJ Patil). It’s hard to be much earlier. As for what I wish, I mostly wish I had the powerful tools we have now back then; today is a wonderland of data, tools, and computation. It’s a great time to be a data scientist.

4. How do you respond when you hear the phrase ‘big data’?
I usually think of Peter Norvig talking about the unreasonable effectiveness of data and Michele Banko and Eric Brill finding that more data beat better algorithms in their 2001 paper. Big data is why Amazon’s recommendations work so well. Big data is what tunes search and helps us find what we need. Big data is what makes web and mobile intelligent.

5. What is the most exciting thing about your field?
I very much enjoy looking at huge amounts of data that no one has looked at yet. Being one of only a few to explore a previously unmined new source of information is very fun. Low hanging fruit galore! It’s also fraught with peril, as you’re the first to find all the problems in the data as well.

6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
Data problems should be iterative. Start simple. Solve a small problem. Explore the data. Then solve a harder problem. Then a harder one. Each time you take a step, you’ll get ideas on where to go next, and you also get something out at each step. Too many people start trying to solve the entire problem at the beginning, flailing for a long time, usually to discover that it was the wrong problem to solve when they finally struggle to completion. Start with easier problems, learn where to go, and you might be surprised by all the goodies you find along the way.

Interview with a Data Scientist – Jessica Graves

Standard

Jessica Graves is a Data Scientist who currently works on fashion problems in New York City. She’s worked with Hilary Mason at Fast Forward Labs and keeps in regular contact with the London startup scene. After many months of asking her for an interview she finally gave in – and she shares her unique perspective on the datafication of Fashion. She comes from a background in visual and performing arts, as well as fashion design. In her spare time you’ll find her reading a stack of papers or studying dance.

Cover image: unsplash.com CCO

Jessica Graves_02-1

  1. What project have you worked on do you wish you could go back to, and do better?

I worked with Dr. Laurens Mets on an iteration of the technology behind Electrochaea, a device where microbes convert waste electricity to clean natural gas. My job was to translate models from electrochemistry journals into code, to help simulate, measure and optimize the parameters of the device. We needed to facilitate electron transport and keep the microbes happy. Read papers, write code, and design alternative energy technology with math + data?! I would hand my past self How to Design Programs as a guide and learn to re-implement from scratch in an open source language. 

  1. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

Listen! If you are a data scientist, listen carefully to the business problems of your industry, and see the problems for what they are, rather than putting the technical beauty of and personal interest in the solution first and foremost. You may find it’s more important to you to work with a certain type of problem than it is to work at a certain type of company, or vice versa. Watch very carefully when your team expresses frustration in general – articulate problems that no one knows they should be asking you to solve. At the same time, it can be tempting to work on a solution that has no problem. If you’re most interested in a specific machine learning technique, can you justify its use over another, or will high technical debt be a serious liability? Will a project be leveragable (legally, financially, technically, operationally)? Can you quantify the risk of not doing a project? 

  1. What do you wish you knew earlier about being a data scientist?

I wish I realized that data science is classical realist painting.

Classical realists train to accurately represent a 3D observation as a 2D image. In the strictest cases, you might not be allowed to use color for 1-3 years, working only with a stick of graphite, graduating to charcoal and pencils, eventually monotone paintings. Only after mastering the basics of form, line, value, shade, tone, are you allowed a more impactful weapon, color. With oil painting in particular, it matters immensely in what order at what layer you add which colors, which chemicals compose each color, of which quality pigment, at what thickness, with what ratio of which medium, with which shape of brush, at what angle, after what period of drying. Your primary objective is to continuously correct your mistakes of translating what you observe and suspending your preconception of what an object should look like.

There are many parallels with data science. At no point as a classical realist painter should you say, ‘well it’s a face, so I’m going to draw the same lines as last time’ just like as a data scientist, you should look carefully at the data before applying algorithm x, even if that’s what every blog post Google surfaces to the top of your results says to do in that situation. You have to be really true to what you observe and not what you know – sometimes a hand looks more like a potato than a hand, and obsessing over anatomical details because you know it’s a hand is a mistake. Does it produce desirable results in the domain of problems that you’re in? Are you assuming Gaussian distributions on skewed data? Did you go directly to deep learning when logistic regression would have sufficed? I wish I knew how often data science course offerings are paint by numbers. You won’t get very far once the lines are removed, the data is too big to extract on your laptop, and an out-of-memory error pops up running what you thought was a pretty standard algorithm on the subset you used instead. Let alone that you have to create or harvest the data set in the first place – or sweet talk someone into letting you have access to it.  

In addition, Nulla dies sine linea – it’s true for drawing, ballet, writing. It’s true for data science. No day without a line. It’s very difficult to achieve sophistication without crossing off days and days of working through code or theoretical examples (I think this is why Recurse Center is so special for programmers). Sets of bland but well-executed tiny piece of software. Unspectacular, careful work in high volumes raises the quality of all subsequent complex works. Bigger, slower projects benefit from myriads of partially explored pathways you already know not to take.

Also side notes to my past self: Linux. RAM. Thunderbolt ports. 

  1. How do you respond when you hear the phrase ‘big data’?

Big data? Like in the cloud? Or are we in the fog now? Honestly the first thing I see in my mind is PETABYTES. I think of petabytes of selfies raining from the sky and flowing into a data lake. Stagnant. Data-efficient AI is all the rage — less data, more primitives, smarter agents. In the meantime, optimizing hardware and code to work with large datasets is pretty fun. Fetishizing the size of the data works well …as long as you don’t care about robustness to diverse inputs. Can your algorithm do well with really niche patterns? What can you do with the bare minimum amount of data? 

  1. What is the most exciting thing about your field?

Fashion is visual. It’s inescapable. Every culture has garb or adornment, however minimal. A few trillion dollars of apparel, textiles, and accessories across the globe. The problems of the industry are very diverse and largely unsolved. A biologist might come to fashion to grow better silk. An AI researcher might turn to deep learning to sift through the massive semi-structured set of apparel images available online. So many problems that may have a tech solution are unsolved. Garment manufacturing is one of the most neglected areas of open source software development. LVMH and Richemont don’t fight over who provided the most sophisticated open-source tools to researchers the way that Amazon and Google do. You can start a deep learning company on a couple grand and use state-of-the-art software tools for cheap or free. You cannot start an apparel manufacturing vertical using state-of-the-art tools without serious investment, because the climate is still extremely unfavorable to support a true ecosystem of small-scale independent designers. The smartest software tools for the most innovative hardware are excessively expensive, closed-source, and barely marketed — or simply not talked about in publically accessible ways. Sewing has resisted automation for decades, although is finally now at a place now were the joining of fabrics into a seam is robot-automatable with computer vision used on a thread-by-thread basis to determine the location of the next stitch. 

High end, low end, or somewhere in between, the apparel side of fashion’s output is a physical object that has to be brought to life from scratch, or delivered seamlessly, to a human, who will put the object on their body. Many people participate in apparel by default, but the fashion crowd is largely self-selected and passionate, so it’s exciting (and difficult) to build for such an engaged group that don’t fit standard applications of standard machine learning algorithms.

  1. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

Artists learn this eventually: volume of works produced trumps perfectionism. Even to match something in classical realism, you start with ridiculous abstractions. Cubes and cylinders to approximate heads and arms. Break it down into the smallest possible unit. Listen to Polya, “If you can’t solve a problem, then there is an easier problem you can solve: find it.”

As for when to finish? Nothing is never good enough. The thing that is implemented is better than the abstract, possibly better thing, for now, and will probably outlive its original intentions. But make sure that solution correlates thoroughly with the problem, as described in the words of the stakeholder. Otherwise, for a consumer-facing product or feature, your users will usually give you clues as to what’s working. 

  1. You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?

Be open. Fashion has a lot of space for innovation if you understand and quantify your impact on problems that are actually occurring and costing money or time, and show that you can solve them fast enough. “We built this new thing” has absolutely nothing to do with “We built this useful thing” and certainly not “We built this backwards-compatible thing”. You might be tempted to recommend a “new thing” and then complain that fashion isn’t sophisticated enough or “data” enough for it. As an industry that in some cases has largely ignored data for gut feelings with a serious payoff, I think the attitude should be more of pure respect than of condescension, and of transitioning rather than scrapping. That or build your own fashion thing instead of updating existing ones.  

  1. You have worked in fashion. Can you talk about the biggest opportunities for data in the fashion industry. Are there cultural challenges with datafication in such a ‘creative industry’.

Fashion needs ‘datafication’ that clearly benefits fashion. If you apply off-the-shelf collaborative filtering to fashion items with a fixed seasonal shelf life to users that never really interact with, you’re going to get poor results. Algorithms that work badly in other domains might work really well in fashion with a few tweaks. NIPS had an ecommerce workshop last year, and KDD has a fashion-specific workshop this year, which is exciting to see, although I’ll point out that researchers have been trying to solve textile manufacturing problems with neural networks since the 90s.

A fashion creative might very well LOVE artificial intelligence, machine learning, and data science if you tailor your language into what makes their lives easier. Louis Vuitton uses an algorithm to arrange handbag pattern pieces advantageously on a piece of leather (not all surfaces of the leather are appropriate for all pattern pieces of the handbag) and marks the lines with lasers before artisans hand-cut the pieces. The artisans didn’t seem particularly upset about this. 

The two main problems I still see right now are the doorman problem and fit. Use data and software to make it simple for designers of all scales to adjust garments to fit their real markets instead of their imagined muses. And, use as little input as possible to help online shoppers know which existing items will fit. Once they buy, make sure they get their packages on time, securely, discreetly. 

Interview with a Data Scientist: Phillip Higgins

Standard

Phillip Higgins is a data science consultant based in New Zealand. His experience includes financial services and working for SAS, amongst other experience including some in Germany.

What project have you worked on do you wish you could go back and do better?

Hindsight is a wonderful thing, we can always find things we could have done better in projects.  On the other hand, analytic and modelling projects are often frought with uncertainty- uncertainty that despite the best planning, is not available to foresight. Most modelling projects that I have worked on could have been improved with the benefit of better foresight!

What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

Firstly, I would advise younger analytics professionals to develop both deep knowledge of a particular area and at the same time, to broaden their knowledge and to maintain this focus of learning on both specialised and general subjects throughout their careers.  Secondly, its important to gain as much practice as possible – data science is precisely that because it deals with real-world problems.  I think PhD students should cultivate industry contacts and network widely- staying abreast of business and technology trends is essential.

What do you wish you knew earlier about being a data scientist?
Undoubtedly I wish I knew the importance of communication skills in the whole analytics life-cycle.  Its particularly important to be able to communicate findings to a wide audience and so refined presentation skills are a must.

How do you respond when you hear the phrase ‘Big Data’?

I think Big Data offers data scientists with new possibilities in terms of the work they are able to perform and the significance of their work.  I don’t think it’s a coincidence that the importance and demand of data scientists has risen sharply right at the time that Big Data has become mainstream- for Big Data to yield insights, “Big Analytics” need to be performed – they go hand in hand.

What is the most exciting thing about your field?

For me personally it’s the interesting people I meet along the way.  I’m continually astounded by the talented people I meet.

How do you go about framing a data problem – in particular, how do you manage expectations etc.  How do you know what is good enough?

I think its important to never lose sight of the business objectives that are the rationale for most data-scientific projects.  Although it is essential that businesses allow for data science to disprove hypotheses, at the end of the day, most evidence will be proving hypotheses (or disproving the null hypothesis).  The path to formulating those hypotheses lies obviously mostly in exploratory data analysis (combined with domain knowledge).  It is important to communicate this uncertainty as to framing from the outset, so that there are no surprises.

You spent some time as a consultant in data analytics.  How did you manage cultural challenges, dealing with stakeholders and executives?  What advice do you have for new starters about this?

In consulting you get to mix with a wide variety of stakeholders and that’s certainly an enjoyable aspect of the job.  I have dealt with a wide range of stakeholders, from C-level executives through to mid- level managers and analysts and each group requires a different approach.  A stakeholder analysis matrix is a good place to start- analysing stakeholders by importance and influence.  Certainly, adjusting your pitch and being aware of the politics behind and around any project is very important.

 

Interview with a Data Scientist: Ivana Balazevic

Standard

Ivana Balazevic is a Data Scientist at a Berkeley based startup Wise.io, where she is working in a small team of data scientists on solving problems in customer service for different clients. She did her bachelor’s degree in Computer Science at the Faculty of Electrical Engineering and Computing in Zagreb and she recently finished her master’s degree in Computer Science with the focus on Machine Learning at the Technical University Berlin.

 

1. What do you think about ‘big data’?

I try not to think about it that much, although nowadays that’s quite hard to avoid. 🙂 It’s definitely an overused term, a buzzword.

I think that adding more and more data can certainly be helpful up to a point, but the outcome of majority of the problems that people are trying to solve depends primarily on the feature engineering process, i.e. on extracting the necessary information from the data and deciding which features to create. However, I’m certain there are problems out there which require large amounts of data, but they are definitely not so common for the whole world to obsess about.

 

2. What is the hardest thing for you to learn about data science?

I would say the hardest things are those which can’t be learned at school, but which you gain through experience. Coming out of school and working mostly on toy datasets, you are rarely prepared for the messiness of the real-world data. It takes time to learn how to deal with it, how to clean it up, select the important pieces of information, and transform this information into good features. Although that can be quite challenging, it is a core process of the whole data science creativity and one of the things that make data science so interesting.

 

3. What advice do you have for graduate students in the sciences who wish to become Data Scientists?

I don’t know if I’m qualified enough to give such advice, being a recent graduate myself, but I’ll try to write down things that I learned from my own experience.

Invest time in your math and statistics courses, because you’re going to need it. Take a side project, which might give you a chance to learn some new programming concepts and introduce you to interesting datasets. Do your homeworks and don’t be afraid to ask questions whenever you don’t understand something in the lecture, since the best time to learn the basics is now and it’s much harder to fill those holes in knowledge than to learn everything the right way from the beginning.

 

4. What project would you back to do and change? How would you change it?

Most of them! I often catch myself looking back at a project I did a couple of years ago and wishing I knew then what I know now. The most recent project is my master’s thesis, I wish I tried out some things I didn’t have time for, but I hope I’ll manage to catch some time to work on it further in the next couple of months.

 

5. How do you go about scoping a data science project?

Usually when I’m faced with a new dataset, I get very excited about it and can’t wait to dig into it, which gets in the way of all the planning that should have been done beforehand. I hope I’ll manage to become more patient about it with time and learn to do it the “right” way.

One of the things that I find a bit limiting about the industry is that you often have to decide whether something is worth the effort of trying it out, since there are always certain deadlines you need to hold on to. Therefore, it is very important to have a clear final goal right from the beginning. However, one needs to be flexible and take into account that things at the end user’s side might change along the way and be prepared to adapt to the user’s needs accordingly.

 

6. What do you wish you knew earlier about being a data scientist?

That you don’t spend all of your time doing the fun stuff! A lot of the work done by the data scientists is invested into getting the data, making it into the right format, cleaning it up, battling different encoding issues, writing tests for the code you wrote, etc. When you sum everything up, you spend only a part of your time doing the actual “data science magic”.

 

7. What is the most exciting thing you’ve been working on lately?

We are a small team of data scientists at Wise who are working on many interesting projects. I am mostly involved with the natural language processing tasks, since that is the field I’m planning to do my PhD in starting this fall. My most recent project is on expanding the customer service support to multilingual datasets, which can be quite challenging considering the highly skewed language distribution (80% English, 20% all other languages) in the majority of datasets we are dealing with.

 

8. How do you manage learning the ‘soft’ skills and the ‘hard’ skills? Any tips?

Learning the hard skills requires a lot of time, patience, and persistence, and I highly doubt there is a golden formula for it. You just have to read a lot of books and papers, talk to people that are smarter and/or have more experience than you and be patient, because it will all pay off.

Soft skills, on the other hand, somehow come naturally to me. I’m quite an open person and I’ve never had problems talking to people. However, if you do have problems with it, I suggest you to take a deep breath, try to relax, focus and tell yourself that the people you are dealing with are just humans like you, with their good and bad days, their strengths and imperfections. I believe that picturing things this way takes a lot of pressure off your chest and gives you the opportunity to think much more clearly.

Interview with a Data Scientist: Brad Klingenberg

Standard

Bio

Brad Klingenberg is the Director of Styling Algorithms at Stitch Fix in San Francisco. His team uses data and algorithms to improve the selection of merchandise sent to clients. Prior to joining Stitch Fix Brad worked with data and predictive analytics at financial and technology companies. He studied applied mathematics at the University of Colorado at Boulder and earned his PhD in Statistics at Stanford University in 2012.


 

1. What project have you worked on do you wish you could go back to, and do better?

 

Nearly everything! A common theme would be not taking the framing of a problem for granted. Even seemingly basic questions like how to measure success can have subtleties. As a concrete example, I work at Stitch Fix, an online personal styling service for women. One of the problems that we study is predicting the probability that a client will love an item that we select and send to her. I have definitely tricked myself in the past by trying to optimize a measure of prediction error like AUC.

This is trickier than it seems because there are some sources of variance that are not useful for making recommendations. For example, if I can predict the marginal probability that a given client will love any item then that model may give me a great AUC when making predictions over many clients, because some clients may be more likely to love things than others and the model will capture this. But if the model has no other information it will be useless for making recommendations because it doesn’t even depend on the item. Despite its AUC, such a model is therefore useless for ranking items for a given client. It is important to think carefully about what you are really measuring.


 

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences and Social Sciences?

 

Focus on learning the basic tools of applied statistics. It can be tempting to assume that more complicated means better, but you will be well-served by investing time in learning workhorse tools like basic inference, model selection and linear models with their modern extensions. It is very important to be practical. Start with simple things.

Learn enough computer science and software engineering to be able to get things done. Some tools and best practices from engineering, like careful version control, go a long ways. Try to write clean, reusable code. Popular tools in R and Python are great for starting to work with data. Learn about convex optimization so you can fit your own models when you need to – it’s extremely useful to be able to cast statistical estimates as the solution to optimization problems.

Finally, try to get experience framing problems. Talk with colleagues about problems they are solving. What tools did they choose? Why? How should did they measure success? Being comfortable with ambiguity and successfully framing problems is a great way to differentiate yourself. You will get better with experience – try to seek out opportunities.


 

3. What do you wish you knew earlier about being a data scientist?

 

I have always had trouble identifying as a data scientist – almost everything I do with data can be considered applied statistics or (very) basic software engineering. When starting my career I was worried that there must be something more to it – surely, there had to be some magic that I was missing. There’s not. There is no magic. A great majority of what an effective data scientist does comes back to the basic elements of looking at data, framing problems, and designing experiments. Very often the most important part is framing problems and choosing a reasonable model so that you can estimate its parameters or make inferences about them.


 

4. How do you respond when you hear the phrase ‘big data’?

 

I tend to lose interest. It’s a very over-used phrase. Perhaps more importantly I find it to be a poor proxy for problems that are interesting. It can be true that big data brings engineering challenges, but data science is generally made more interesting by having data with high information content rather than by sheer scale. Having lots of data does not necessarily mean that there are interesting questions to answer or that those answers will be important to your business or application. That said, there are some applications like computer vision where it can be important to have a very large amount of data.


 

5. What is the most exciting thing about your field?

 

While “big data” is overhyped, a positive side effect has been an increased awareness of the benefits of learning from data, especially in tech companies. The range of opportunities for data scientists today is very exciting. The abundance of opportunities makes it easier to be picky and to find the problems you are most excited to work on. An important aspect of this is to look in places you might not expect. I work at Stitch Fix, an online personal styling service for women. I never imagined working in women’s apparel, but due to the many interesting problems I get to work on it has been the most exciting work of my career.


 

6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

 

As I mentioned previously, it can be helpful to start framing a problem by thinking about how you would measure success. This will often help you figure out what to focus on. You will also seldom go wrong by starting simple. Even if you eventually find that another approach is more effective a simple model can be a hugely helpful benchmark. This will also help you understand how well you can reasonably expect your ultimate approach to perform. In industry, it is not uncommon to find problems where (1) it is just not worth the effort to do more than something simple, or (2) no plausible method will do well enough to be considered successful. Of course, measuring these trade-offs depends on the context of your problem, but a quick pass with a simple model can often help you make an assessment.


 

7. How do you explain to C-level execs the importance of Data Science? How do you deal with the ‘educated selling’ parts of the job? In particular – how does this differ from sports and industry?

 

It is usually better if you are not the first to evangelize the use of data. That said, data scientists will be most successful if they put themselves in situations where they have value to offer a business. Not all problems that are statistically interesting are important to a business. If you can deliver insights, products or predictions that have the potential to help the business then people will usually listen. Of course this is most effective when the data scientist clearly articulates the problem they are solving and what its impact will be.

The perceived importance of data science is also a critical aspect of choosing where to work – you should ask yourself if the company values what you will be working on and whether data science can really make it better. If this is the case then things will be much easier.


 

8. What is the most exciting thing you’ve been working on lately and tell us a bit about it.

 

I lead the styling algorithms team at Stitch Fix. Among the problems we work on is making recommendations to our stylists, human experts who curate our recommendations for our clients. Making recommendations with humans in the loop is fascinating problem because it introduces an extra layer of feedback – the selections made by our stylists. Combining this feedback with direct feedback from our clients to make better recommendations is an interesting and challenging problem.


 

9. What is the biggest challenge of leading a data science team?

 

Hiring and growing a team are constant challenges, not least because there is not much consensus around what data science even is. In my experience a successful data science team needs people with a variety of skills. Hiring people with a command of applied statistics fundamentals is a key element, but having enough engineering experience and domain knowledge can also be important. At Stitch Fix we are fortunate to partner with a very strong data platform team, and this enables us to handle the engineering work that comes with taking on ever more ambitious problems.

Interview with a Data Scientist: Alice Zheng

Standard
I recently caught up with Alice Zheng a Director of Data Science at Dato – Alice is an expert on building scalable Machine Learning models and currently works for www.dato.com who are a company providing tooling to help you build scalable machine learning models easily. She is also a keen advocate of encouraging women in Machine Learning and Computer Science. Alice has a PhD from UC Berkeley and spent some of her post docs at Microsoft Research in Redmond. She is currently based in Washington State in the US.

1. What project have you worked on do you wish you could go back to, and do better?
Too many! The top of the list is probably my PhD thesis. I collaborated with folks in software engineering research and we proposed a new way of using statistics to debug software. They instrumented programs to spit out logs for each run that provide statistics on the state of various program variables. I came up with an algorithm to cluster the failed runs and the variables. The algorithm identifies variables that are most correlated with each subset of failures. Those variables, in turn, can take the programmer very close to the location of the bug in the code.
It was a really fun project. But I’m not happy with the way that I solved the problem. For one thing, the algorithm that I came up with had no theoretical guarantees. I did not appreciate theory when I was younger. But nowadays, I’m starting to feel bad about the lack of rigor in my own work. It’s too easy in machine learning to come up with something that seems to work, maybe even have an intuitive explanation for why it makes sense, and yet not be able to write down a mathematical formula for what the algorithm is actually doing.
Another thing that I wish I had learned earlier is to respect the data more. In machine learning research, the emphasis is on new algorithms and models. But solving real data science problems require having the right data, developing the right features, and finally using the right model. Most of the time, new algorithms and methods are not needed. But a combination of data, features, and model is the key. I wish I’d realized this earlier and spent less time focusing on just one aspect of the whole pipeline.

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Be curious. Go deep. And study the arts.
Being curious gives you breadth. Knowing about other fields pulls you out of a narrow mindset focused on just one area of study. Your work will be more inspired, because you are drawing upon diverse sources of information.
Going deep into a subject gives you depth and expertise, so that you can make the right choices when trying to solve a problem, and so that you might more adequately assess the pros and cons of each approach.
Why study the arts? Well, if I had my druthers, art, music, literature, mathematics, statistics, and computer science would be required courses for K12. They offer completely different ways of understanding the world. They are complementary of each other. Knowing more than one way to see the world makes us more whole as human beings. Science _is_ an art form. Analytics is about problem solving, and it requires a lot of creativity and inspiration. It’s art in a different form.

3. What do you wish you knew earlier about being a data scientist?
Hmm, probably just what I said above–respect the data. Look at it in all different ways. Understand what it means. Data is the first class citizen. Algorithms and models are just helpers. Also, tools are important. Finding and learning to use good tools will save a lot of time down the line.

4. How do you respond when you hear the phrase ‘big data’?
Cringe? Although these days I’ve become de-sensitized. 🙂
I think a common misconception about “big data” is that, while the total amount of data maybe big, the amount of _useful_ data is very small in comparison. People might have a lot of data that has nothing to do with the questions they want to answer. After the initial stages of data cleaning and pruning, the data often becomes much much smaller. Not big at all.

5. What is the most exciting thing about your field?
So much data is being collected these days. Machine learning is being used to analyze them and draw actionable insights. It is being used to not just understand static patterns but to predict things that have not yet happened. Predicting what items someone is likely to buy or which customers are likely to churn, detecting financial fraud, finding anomalous patterns, finding relevant documents or images on the web. These applications are changing the way people do business, find information, entertain and socialize, and so much of it is powered by machine learning. So it has great practical use.
For me, an extra exciting part of it is to witness applied mathematics at work. Data presents different aspects of reality, and my job as a machine learning practitioner is to piece them together, using math. It is often treacherous and difficult. The saying goes “Lies, lies, and statistics.” It’s completely true; I often arrive at false conclusions and have to start over again. But it is so cool when I’m able to peel away the noise and get a glimpse of the underlying “truth.” When I’m getting nowhere, it’s frustrating. But when I get somewhere, it’s absolutely beautiful and gratifying.

6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 
Oh! I know the answer to this question: before embarking on a project, always think about “what will success look like? How would I be able to measure it?” This is a great lesson that I learned from mentors at Microsoft Research. It’s saved me from many a dead end. It’s easy to get excited about a new endeavor and all the cool things you’ll get to try out along the way. But if you don’t set a metric and a goal beforehand, you’ll never know when to stop, and eventually the project will peter out. If your goal IS to learn a new tool or try out a new method, then it’s fine to just explore. But with more serious work, it’s crucial to think about evaluation metrics up front.

7. You spent sometime at other firms before Dato. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?
I think this is a continuous learning experience. Every organization is different, and it’s incredible how much of a leader’s personality gets imprinted upon the whole organization.  I’m fascinated by the art and science behind creating successful organizations. Having been through a couple of very different companies makes me more aware of the differences between them. It’s very much like traveling to a different country: you realize that many of the things you took for granted do not actually need to be so. It makes me appreciate diversity. I also learn more about myself, about what works and what doesn’t work for me.
How to manage cultural challenges? I think the answer to that is not so different between work and life. No matter what the circumstance, we always have the freedom and the responsibility to choose who we want to be. How I work is a reflection of who I am. Being in a new environment can be challenging, but it can also be good. Challenge gets us out of our old patterns and demands that we grow into a new way of being. For me, it’s helpful to keep coming back to the knowledge of who I am, and who I want to be. When faced with a conflict, it’s important to both speak up and to listen. Speaking up (respectfully) affirms what is true for us. Listening is all about trying to see the other person’s perspective. It sounds easy but can be very difficult, especially in high stress situations where both sides hold to their own perspective. But as long as there’s communication, and with enough patience and skill, it’s possible to understand the other side. Once that happens, things are much easier to resolve.

8. How do you explain to C-level execs the importance of Data Science? How do you deal with the ‘educated selling’ parts of the job?
I point to all the successful examples of data science today. With successful companies like Amazon, Google, Netflix, Uber, AirBnB, etc. leading the way, it’s not difficult to convince people that data science is useful. A lot of people are curious and need to learn more before they make the jump. Others may have already bought into it but just don’t have the resources to invest in it yet. The market is not short no demand. It is short on supply: data scientists, good tools, and knowledge. It’s a great time to be part of this ecosystem!

Interview with a Data Scientist: Maria Rosario Mestre

Standard

I recently caught with with Maria Rosario Mestre – she shared her personal views on Data Science – like all these interviewee subjects – these do not reflect her employers views.

Linkedin profile picture

Biography – Maria: 

I completed a PhD in signal processing at Cambridge developing models of user behaviour using brain data. After the PhD I joined Skimlinks as a data scientist, where I model online user behaviour and work on much larger datasets. My main role is implementing large-scale machine learning models processing terabytes of data.


What project have you worked on do you wish you could go back to, and do better?

I think that pretty much applies to any project you do as a data scientist. When you’re developing algorithms that become a service used by someone either internally or externally, I think it is best to use an iterative approach where you wait for some feedback from the client before doing any further improvements. I am a true believer of “lean data science”.


What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

I guess it depends what the advice is for . If is it for PhD students thinking about a career as a data scientist in industry, then I would strongly recommend them to get some experience working on real-world data at some point during the PhD. It is quite common in academia to work mainly on synthetic data. In addition to that, I would say it is important to keep a curious and open mind about the research carried out by other people, since it is very easy to only stay focused on your specific research project. For analytics professionals, I would say that learning how to code is quite useful, especially in a scripting language like Python. Knowing some classical statistics is also very helpful, if you want to learn how to apply a scientific approach to any type of data analysis.


What do you wish you knew earlier about being a data scientist?

There is not much I can think about, but maybe I wish I had spent more time using version control platforms, such as github. During my PhD I had a very rudimentary version control method: copying my whole project into a different folder with today’s date. It was definitely not the best way of managing my project. In my current role we work on a shared codebase and we need to keep track of changes, so I had to start using github. I wish I had taken more time to learn how to use it properly before diving into it, as it would have saved me a lot of time.


How do you respond when you hear the phrase ‘big data’?

I say that’s boring, now it’s all about “massive data”! Now seriously, I have experienced big data at Skimlinks, where we run daily jobs on terabytes of data using Spark. I think “big data” is a real thing, but people sometimes believe they have it when they don’t, or if they have it, then they think they need to do something about it, but don’t know what. I don’t think that you should approach “big data” as a solution in search of a problem. You should always think of the problem first that you’re trying to solve, see if your data scale qualifies as “big data”, and then finally start using big data tools once you have defined all these parameters. It is a waste of time and resources to start using these tools just because they are fashionable and you’re scared of missing out.


What is the most exciting thing about your field?

I find solving real problems exciting, and if these problems are hard, then it’s double as exciting. As a data scientist, you have to solve hard problems all the time, mainly because real data is never like in the textbooks! It’s always biased, with missing columns or wrong values. Then, I also find it exciting to solve problems with large-scale data. It is very easy to use out-of-the-box Python libraries to run a machine learning algorithm, but what happens when you have to adapt that algorithm to run on 500 gigabytes? That’s when you need to start thinking creatively using the tools you already know to solve a new problem. You might even be the first person to solve such a problem!

In more general terms, I think that machine learning will have a huge impact on our daily lives. We have already started seeing the effects now that we are always connected and use increasingly intelligent apps, but I think this is only the beginning.


How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 

This is a great question, and one that I keep asking myself. As I said earlier, I believe in lean data science. What this means is that I believe you need to start with a very clear objective you are trying to solve and use an iterative approach over it, always gathering feedback from the end user. If possible, the end goal should be stated in clear objective metrics, like increasing the accuracy of a classifier by 10%, or make better recommendations in 20% of the cases. You know it’s good enough when the end user is happy. I also believe that sometimes when you look at a problem from a lot of different angles and don’t seem to make a lot of progress, it is good to document all the attempts, leave it on the side, and get back to it later with a fresh pair of eyes.


How do you explain to C-level execs the importance of Data Science? How do you deal with the ‘educated selling’ parts of the job?

As a data scientist, your role is not only to develop algorithms, but also to be an evangelist in your own company on the use of data science, and generally the scientific method. If you want to convince business people that data science is important, then the best you can do is talk business. You need to think of data science projects in terms of the value they can add to your business, either because they can increase conversion rates, or keep some customers happy, or make someone’s job in the company much easier… You can start by running small experiments and gather some results to show to the executives in your company. However, data science is not the solution to any problem, and sometimes a simple rule-based model could do the job just as well. It is important not to oversell what you can do, and be realistic about what you can offer.


What is the most exciting thing you’ve been working on lately and tell us a bit about?

Skimlinks is about to launch a new product in the coming weeks, and the data science team has been heavily involved in its making. I cannot say much about it unfortunately, but these are exciting times for the company. From a technical point of view, the last thing that I have done which was exciting was classifying 1.2 billion data points using Spark. I broke a personal record in terms of the size of the data involved.


What is the biggest challenge of building a data science team?

I would have to ask my manager, since I have never built a team myself. I have been involved in the hiring process though, and I think it is sometimes difficult to find the right combination of skills across the team. You want some people who have experience working with data, others than may be stronger in engineering. It is also important to manage people’s expectations about the role, since data scientists spend a lot of time doing data processing and setting up data pipelines before they can apply machine learning algorithms. It’s all part of the job!