I recently caught up with Alice Zheng a Director of Data Science at Dato – Alice is an expert on building scalable Machine Learning models and currently works for www.dato.com who are a company providing tooling to help you build scalable machine learning models easily. She is also a keen advocate of encouraging women in Machine Learning and Computer Science. Alice has a PhD from UC Berkeley and spent some of her post docs at Microsoft Research in Redmond. She is currently based in Washington State in the US.
1. What project have you worked on do you wish you could go back to, and do better?
Too many! The top of the list is probably my PhD thesis. I collaborated with folks in software engineering research and we proposed a new way of using statistics to debug software. They instrumented programs to spit out logs for each run that provide statistics on the state of various program variables. I came up with an algorithm to cluster the failed runs and the variables. The algorithm identifies variables that are most correlated with each subset of failures. Those variables, in turn, can take the programmer very close to the location of the bug in the code.
It was a really fun project. But I’m not happy with the way that I solved the problem. For one thing, the algorithm that I came up with had no theoretical guarantees. I did not appreciate theory when I was younger. But nowadays, I’m starting to feel bad about the lack of rigor in my own work. It’s too easy in machine learning to come up with something that seems to work, maybe even have an intuitive explanation for why it makes sense, and yet not be able to write down a mathematical formula for what the algorithm is actually doing.
Another thing that I wish I had learned earlier is to respect the data more. In machine learning research, the emphasis is on new algorithms and models. But solving real data science problems require having the right data, developing the right features, and finally using the right model. Most of the time, new algorithms and methods are not needed. But a combination of data, features, and model is the key. I wish I’d realized this earlier and spent less time focusing on just one aspect of the whole pipeline.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Be curious. Go deep. And study the arts.
Being curious gives you breadth. Knowing about other fields pulls you out of a narrow mindset focused on just one area of study. Your work will be more inspired, because you are drawing upon diverse sources of information.
Going deep into a subject gives you depth and expertise, so that you can make the right choices when trying to solve a problem, and so that you might more adequately assess the pros and cons of each approach.
Why study the arts? Well, if I had my druthers, art, music, literature, mathematics, statistics, and computer science would be required courses for K12. They offer completely different ways of understanding the world. They are complementary of each other. Knowing more than one way to see the world makes us more whole as human beings. Science _is_ an art form. Analytics is about problem solving, and it requires a lot of creativity and inspiration. It’s art in a different form.
3. What do you wish you knew earlier about being a data scientist?
Hmm, probably just what I said above–respect the data. Look at it in all different ways. Understand what it means. Data is the first class citizen. Algorithms and models are just helpers. Also, tools are important. Finding and learning to use good tools will save a lot of time down the line.
4. How do you respond when you hear the phrase ‘big data’?
Cringe? Although these days I’ve become de-sensitized. 🙂
I think a common misconception about “big data” is that, while the total amount of data maybe big, the amount of _useful_ data is very small in comparison. People might have a lot of data that has nothing to do with the questions they want to answer. After the initial stages of data cleaning and pruning, the data often becomes much much smaller. Not big at all.
5. What is the most exciting thing about your field?
So much data is being collected these days. Machine learning is being used to analyze them and draw actionable insights. It is being used to not just understand static patterns but to predict things that have not yet happened. Predicting what items someone is likely to buy or which customers are likely to churn, detecting financial fraud, finding anomalous patterns, finding relevant documents or images on the web. These applications are changing the way people do business, find information, entertain and socialize, and so much of it is powered by machine learning. So it has great practical use.
For me, an extra exciting part of it is to witness applied mathematics at work. Data presents different aspects of reality, and my job as a machine learning practitioner is to piece them together, using math. It is often treacherous and difficult. The saying goes “Lies, lies, and statistics.” It’s completely true; I often arrive at false conclusions and have to start over again. But it is so cool when I’m able to peel away the noise and get a glimpse of the underlying “truth.” When I’m getting nowhere, it’s frustrating. But when I get somewhere, it’s absolutely beautiful and gratifying.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
Oh! I know the answer to this question: before embarking on a project, always think about “what will success look like? How would I be able to measure it?” This is a great lesson that I learned from mentors at Microsoft Research. It’s saved me from many a dead end. It’s easy to get excited about a new endeavor and all the cool things you’ll get to try out along the way. But if you don’t set a metric and a goal beforehand, you’ll never know when to stop, and eventually the project will peter out. If your goal IS to learn a new tool or try out a new method, then it’s fine to just explore. But with more serious work, it’s crucial to think about evaluation metrics up front.
7. You spent sometime at other firms before Dato. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?
I think this is a continuous learning experience. Every organization is different, and it’s incredible how much of a leader’s personality gets imprinted upon the whole organization. I’m fascinated by the art and science behind creating successful organizations. Having been through a couple of very different companies makes me more aware of the differences between them. It’s very much like traveling to a different country: you realize that many of the things you took for granted do not actually need to be so. It makes me appreciate diversity. I also learn more about myself, about what works and what doesn’t work for me.
How to manage cultural challenges? I think the answer to that is not so different between work and life. No matter what the circumstance, we always have the freedom and the responsibility to choose who we want to be. How I work is a reflection of who I am. Being in a new environment can be challenging, but it can also be good. Challenge gets us out of our old patterns and demands that we grow into a new way of being. For me, it’s helpful to keep coming back to the knowledge of who I am, and who I want to be. When faced with a conflict, it’s important to both speak up and to listen. Speaking up (respectfully) affirms what is true for us. Listening is all about trying to see the other person’s perspective. It sounds easy but can be very difficult, especially in high stress situations where both sides hold to their own perspective. But as long as there’s communication, and with enough patience and skill, it’s possible to understand the other side. Once that happens, things are much easier to resolve.
8. How do you explain to C-level execs the importance of Data Science? How do you deal with the ‘educated selling’ parts of the job?
I point to all the successful examples of data science today. With successful companies like Amazon, Google, Netflix, Uber, AirBnB, etc. leading the way, it’s not difficult to convince people that data science is useful. A lot of people are curious and need to learn more before they make the jump. Others may have already bought into it but just don’t have the resources to invest in it yet. The market is not short no demand. It is short on supply: data scientists, good tools, and knowledge. It’s a great time to be part of this ecosystem!