Building Full-Stack Vertical Data Products

Standard

I’ve been in the Data Science space for a number of years now, I first got interested in AI/Machine Learning in 2009 and have a background typical of a number of people in my field – I come from Physics and Mathematics.

One trend I’ve run into both at Corporates and Startups is that there are many challenges to deploying Data Science in a bureaucratic organisation – or delivering Enterprise Intelligence. Running into this problem led me to be interested in building data products.

One of the first people I saw building AI startups was Bradford Cross – and he’s been writing lately about his predictions for the 2017 in the Machine Learning startups space.

I agree with his precis that we’ll begin to see successful vertically-oriented AI startups solving full-stack industry problems that require subject matter expertise, unique data, and a product that uses AI to deliver its core value proposition.

At Elevate Direct we’re working on this working on the problem of sourcing and hiring contractors – so one of the fundamental problems that companies have which is hiring the best contractor talent out there.

So what are some of the reasons that it can be hard to deploy Data Science internally at a corporate organisation? I think a number of the patterns are related to other patterns we see in terms of software.

  1. Not being capable of building consumer facing software – Large (non-tech) organisations sometimes struggle to build and deliver software internally – I’ve seen a number of organisations fail to do this – their build process can be 6 months.
  2. Organisational anti-patterns – I’ve seen some organisations that rapidly inhibit the ability to deploy product. Some of these anti-patterns are driven by concerns about the risk of deploying software. And often end up with diffuse ownership – where an R and D team can blame the operations team and vice versa.
  3. Building Data Products is risky – Building data products is hard and risky – I think you really need to approach data products in a lean-startup kinda way. Deploy often, if it works it works, if not cut it. Sometimes the middle-management of large corporates is risk-averse and so find these kinds of projects scary. It also needs a lot of expertise –  subject-matter expertise, software expertise, machine learning expertise.
  4. Not allowing talented technical practitioners to use Open Source/ pick the tools – I once worked at a FTSE 100 company that it took me about 6 weeks to be able to install Open Source software tools such as R and Python. It severely restricted my productivity, in that time at a startup my team probably deployed into production, to a customer facing app about 1000 changes. This reminds me of the number 3 here. Don’t restrict the ability of your talented and well-trained people to deliver value. It makes no sense from a business point of view. Data Science produces value only when it produces products or insights for the business or the customers.
  5. Not having a Data Strategy – Data Science is most valuable when it aligns with the business strategy. Too often I’ve seen companies hiring data scientists before they have actual problems for them to work on. I’ve written about this before.
  6. Long term outsourcing deals – This is an insidious one, and one that came from a period of time when “IT didn’t matter”, before big Tech companies proved the value in the consumer space of for example e-commerce. It’s impossible to predict what will be the key tech for the next 10 years, so don’t lock yourself to a vendor for that period of time. Luckily this trend is reversing – we’re seeing the rise of agile, MVP, cloud computing, design thinking, getting closer to the customer. A great article on this re-shoring is here.

I think fundamentally a lot of these anti-patterns come from not knowing how to handle risk correctly. I like the idea in that RedMonk article that big outsourcing is a bit like CDOs in finance. Bundling the risk into one big lump doesn’t make the risk go away.

I learn this day after day working on building data products and tools at Elevate. Being honest about the risks and working hard to de-risk projects and drive down that risk in an agile way is the best we can do.

Finally, I think we’re just getting started building Data Products and deploying data science. It’ll be interesting what we see what other anti-patterns emerge as we grow up as an industry. This is also one of the reasons I’ve joined a startup and why I’m very excited to work on an end-to-end Data Product, which is solving a real-business problem.

Advertisements

Interview with a Data Scientist: Juan Pablo Isaza Aristizábal

Standard
I recently gave a keynote at www.pycon.co the first PyCon conference in Colombia. I spoke on Data Science Models in Production, lessons learned and the cultural aspects.
I interviewed a Colombian Data Scientist – Juan Pablo Isaza Aristizábal
1. What project have you worked on do you wish you could go back to, and do better?
Back in 2015 I was working for Tappsi, a popular app to call taxis. They have a huge problem with fulfilling cab demand on peak hours, because Bogotá has a horrible traffic congestion problem, that is only getting worse. So we were using algorithms to try to fulfil as much demand as we could. One of the projects was to predict demand on a 10 minutes future window for each neighbourhood, so that drivers could head to neighbourhoods with the highest odds. I did the real time data ingestion and machine learning, then we published an MVP of the feature, but the algorithm was too slow. In the end the project was never completed because developers were doing other stuff and I ended up optimising other algorithms that increased metrics with half the hassle and complexity. Afterwards I realised that the lack of experience led me to write the program in the wrong language and I made wrong assumptions that led to low performance.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
I think academia is a great place to learn abstract and complex subjects, while industry is a great place to learn practical and social skills. Its easier to succeed at a working environment if you find a balance between being academic enough without forgetting practical aspects and communication skills with non-tech people. At the university there are brilliant people at cutting edge topics, although they might not know how to deal with more concrete aspects. While in industry you can see fast practical developers that know the latest tools but fall short trying to optimise a SQL query because they don’t know how an indexed query works behind the scenes in a database.
So I would advise you to study as much as you can, but always try to think in the possible applications of what you are learning. Also, communications skills with non tech people is extremely important, I have seen a couple of guys having meetings with sales people and try to explain statistical tests and p-values with no success.
3. What do you wish you knew earlier about being a data scientist?
I wish I had an earlier chance of working on startups. Being in Colombia made it difficult to work in the tech industry; before I had a couple of jobs not related to data science or software development.
4. How do you respond when you hear the phrase ‘big data‘?
I find the term a little misleading and simplistic. Its popular because it’s easy to understand for the general public, while algorithms are not. Although technically the term might just be a synonym with a couple of tools such Amazon redshift and Hadoop. Big data  has enabled new and exciting applications but by no means it’s the only or biggest factor contributing to current advances in the field; new algorithms such as deep neural networks, reinforcement learning and a strong open source community has enabled a lot of improvement over the last few years.
5. What is the most exciting thing about your field?
For me is the excitement of science, which I have always embraced since I was a little kid, and the development speed and practicality of engineering. Being able to take an idea and transforming it to a working prototype in a few days is an amazing feeling; specially machine learning applications are really exciting to work with.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
Usually I take an iterative approach, starting with the most obvious relations and the easiest data to handle. Trying to get to the answer in a series of incremental steps as I refine each input and expand the data set. Its similar to how you would build a MVP, first it is simple, then it becomes better with each version, finally the customer or user says: that’s good enough!
7. Can you talk a bit about the state of the tech industry and data science in Colombia? What would you change? What gives you hope?
Data science is coming as a byproduct of software development; there isn’t much, but still we are improving at giant steps. In the last few years startups such as Tappsi, Domicilios, Mercadoni, Rappi or Bunny have become more commonplace than in the past. What gives me hope is the peace deal with FARC guerrilla group, this crucial event will make many more foreigners come here as well as investment that can power new ideas.
I would shift the local focus of many startups for a more global one. This is difficult because the market is small and the economy is far behind other nations, making our problems and solutions different from the more advances economies. Still there are Colombian startups with global focus such as Bunny or VOIQ.
Thanks and best regards!

Cookies

Standard

Based on http://larahogan.me/donuts/

Taken from her blog

Years ago, I found that whenever something awesome happened in my career – maybe I got published, or promoted, or launched a project – I wouldn’t take the time to celebrate the achievement. I’m an achiever by nature, the kind who feels like every day starts at zero. Not deliberately marking these moments left me feeling like I wasn’t actually accomplishing anything. “Oh cool, that A List Apart article went up,” I would think, then move on with my day.

Once I realized that this was happening, I decided to be deliberate about marking achievements by eating one donut. Well, sometimes more than one, if it’s a really big deal. The act of donut-eating has actually helped me feel like I’m accomplishing my career goals. As I started to share this idea with more people, I found that it resonated with others, especially young career-driven women who are routinely achieving goals and furthering their career but don’t take the time to note their own success.

I decided to start celebrating in a public way so that more people may be inspired to find their own ways of marking their career achievements. These are those donuts.

I’ve decided to use chocolate chip cookies instead.

cookie_1.jpg

Cookie 13/2/17 to celebrate giving the keynote at PyCon Colombia

AI in the Enterprise (the problem)

Standard

I was recently chatting to a friend who works as a Data Science consultant in the London Area – and a topic dear to my heart came up. How to successfully do ‘AI’ (or Data Science) in the enterprise. Now I work for an Enterprise SaaS company in the recruitment space, so I’ve got a certain amount of professional interest in doing this successfully.

My aim in this essay is to outline what the problem is, and provide some solutions.

Firstly it’s worth reflecting on the changes we’ve seen in Consumer apps – Spotify, Google, Amazon, etc – all of these apps have personalised experiences which are enhanced by machine learning techniques depending on the labelled data that consumers provide.

I’ll quote what Daniel Tuckelang (formerly of Linkedin) said about the challenges of doing this in the enterprise.

First, most enterprise data still lives in silos, whereas the intelligence comes from joining across data sets. Second, the enterprise suffers from weak signals — there’s little in the way of the labels or behavioral data that consumer application developers take for granted. Third, there’s an incentive problem: everyone promotes data reuse and knowledge sharing, but most organizations don’t reward it

I’ve personally seen this when working with enterprises, and being a consultant. The data is often very noisy, and while there are techniques to overcome that such as ‘distant supervision‘ it does make things harder than say building Ad-Tech models in the consumer space or customer churn models. Where the problem is more explicitly solvable by supervised techniques.

In my experience and the experience of others. Enterprises are much more likely to try buy in off-the-shelf solutions, but (to be sweepingly general) they still don’t have the expertise to understand/validate/train the models.There are often individuals in small teams here & there who’ve self-taught or done some formal education, but they’re not supported. (My friend Martin Goodson highlights this here)  There needs to be a cultural shift. At a startup you might have a CTO who’s willing to trust a bunch of relatively young data science chaps to try figure out an ML-based solution that does something useful for the company without breaking anything. And it’s also worth highlighting that there’s a difference in risk aversion between enterprises (with established practices  etc) and the more exploratory or R and D mindset of a startup.

The somewhat more experienced of us these days tend to have a reasonable idea of what can be done, what’s feasible, and furthermore how to convince the CEO that it’s doing something useful for his valuation.

Startups are far more willing to give things a go, there’s an existential threat. And not to forget that often Venture Capitalists and the assorted machinery expect Artificial Intelligence, and this is encouraged.

Increasingly I speculate that established companies now outsource their R and D to startups, hence the recent acquisitions like the one by GE Digital.

So I see roughly speaking two solutions to this problem. Two ways to de-risk data science projects in the enterprise.

1) Build it as an internal consultancy with two goals: identifying problems which can be solved with data solutions, and exposing other departments to new cultural thinking & approaches. I know of one large retailer who implemented this by doing 13 week agile projects, they’d do a consultation, then choose one team to build a solution for.

2) Start putting staff through training schemes similar to what is offered by General Assembly (there are others), but do it whole teams at a time, the culture of code review and programmatic analysis has to come back and be implemented at work. Similarly, give the team managers additional training in agile project management etc.

The first can have varied success – you need the right problems, and the right internal customers – and the second I’ve never seen implemented.

I’d love to hear some of the solutions you have seen. I’d be glad to chat about this.

Acknowledgements: I’d like to thank the following people for their conversations: John Sandall, Martin Goodson, Eddie Bell, Ian Ozsvald, Mick Delaney and I’m sorry about anyone else I’ve forgotten.

 

2016: In Review

Standard

I’m mostly writing this for me, but maybe it will be interesting to you too! Here’s are some things that happened in 2016. (just to me personally, mostly about programming) This is based on the excellent post by Julia Evans.

Open Source

I continued being involved in PyMC3. This has taught me a lot about programming, the challenges of shipping software. The code reviews by Thomas Wiecki and the others have been amazing.

I helped pick the new logo, worked on becoming a fiscally sponsored project by NumFOCUS. For those of you who don’t know NumFOCUS is an organisation that supports diversity in open source, open source projects and the conferences associated with Open Source. It largely focuses on the Python ecosystem but has branched out to other projects.

Learning about this has taught me a lot about the governance aspects of OSS – and our responsibilities to encourage more people into this ecosystem. I consider that an important part of my duties as a member of the Open Source world.

Talks

  • spoke at PyData London – About Statistics with Python.
  • spoke at the Toulouse Data Science Meetup – I spoke about the PyData ecosystem
  • I keynoted at PyData Amsterdam – I spoke about the current PyData ecosystem and what various tools like Dask, NumPy, Numba, etc are all for.
  • Gave a talk at the Bayesian Mixer in London on the state of PyMC3 I spoke a bit about the new tools in Variational Inference, which has been a research topic of mine for the past year. I wish I had time to finally write some slides on that.

Doing the PyData keynote was kind of exciting/scary (me??? keynote???) and I think it turned out well and I’m happy I did it. I love the PyData community and I’m happy with the talk I gave.

It’s been fun to experience some of the other places that are doing Data Science and forming communities. At each of these events I’ve met a lot of cool people. It’s great to see our industry grow up!

In 2017 I’ll be keynoting in Colombia in Feb at their PyCon Colombia conference. I’m excited to give this talk. I want to goto a conference like NIPS/KDD/ICLR/ICML to stay a bit closer to some of the improvements in the Machine Learning world from Academia/ Industry.

cool: Writing for Hakka Labs

  • I was honoured to be featured on Hakka Labs, Hakka Labs run the excellent Data Eng Conference and some awesome content on their blog. I wrote about Three Things I learned about Machine Learning, this is an ongoing journey where I realise how little I know.

cool: Blog

Some of my favourite posts this year have been.

  • A map of the PyData Stack  – This was an idea floated with Thomas Wiecki before. I finally got around to doing this for my keynote, the aim was to give some people a ‘map of the pydata stack’ and what different tools were for.
  • I interviewed one of my heroes – Greg Linden who helped devise the first Collaborative Filtering algorithm in production at Amazon.
  • I did some other interviews – I liked this one too with Masaaki Horikoshi one of the most prolific contributors to the PyData ecosystem.

I’ll continue to do some interviews over the next year, and hopefully add them to a revised book.

cool: moving to London

I moved to London in late March. I’ve found it very exciting to be close to the Machine Learning community and Data Science community out there. It was a hectic few months adjusting to new job(s) however I’m glad I made the move.

I think everyone should spend sometime in a major city when they’re young.

I hope to blog a bit more about work in the next few months.

cool: Teaching Data Science

My friend John Sandall  mentioned a Teaching Assistant gig at General Assembly.

I helped about 20 students learn more about Data Science, they came from various backgrounds and sharing my own experiences – reminded me that 1) I knew stuff and 2) teaching is hard.

I recommend to all Data Scientists and Engineers if they get the time to teach. It’s a great experience and I learned a lot about what was easy and hard in Machine Learning.

conclusions?

some things that worked:

  • asking a lot of questions about how computers work (not a surprise)
  • working on a team of people who know more stuff than me, and listening to what they have to say
  • asking for advice from people who are more experienced than me.
  • at work, figuring out what’s important to do and then doing the work to get it done, especially if that work is boring / tedious
  • working on one thing at a time (or at least not too many things)
  • getting a bit better at software “process” things like design documents and project plans
  • learning how to mentor junior data scientists – this is something I’m continuing to do
  • learning more about leading teams in ML – which is hard. I’ll not probably be doing too much people stuff over the next few months.

Interview with a Data Scientist: Greg Linden

Standard
I caught up with Greg Linden via email recently
Greg was one of the first people to work on data science in Industry – he invented the item-to-item collaborative filtering algorithm at Amazon.com in the late 90s.
I’ll quote his bio from Linkedin:
“Much of my past work was in artificial intelligence, personalization, recommendations, search, and advertising. Over the years, I have worked at Amazon, Google, and Microsoft, founded and run my own startups, and advised several other startups, some of which were acquired. I invented the now widely used item-to-item collaborative filtering algorithm, contributed to many patents and academic publications, and have been quoted often in books and in the press. I have an MS in Computer Science from University of Washington and an MBA from Stanford.”
img_6330

Greg Linden: Source Personal Website

1. What project have you worked on do you wish you could go back to, and do better?
All of them! There’s always more to do, more improvements to make, another thing to try. Every time you build anything, you learn what you could do to make it better next time.

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Learn to code. Computers are a tool, and coding is the way to get the most out of that tool. If you can code, you can do things in your field that others cannot. Coding is a major force multiplier. It makes you more powerful.

3. What do you wish you knew earlier about being a data scientist?
I was doing what is now called data science at Amazon.com in 1997.The term wasn’t even coined until 2008 (by Jeff Hammerbacher and DJ Patil). It’s hard to be much earlier. As for what I wish, I mostly wish I had the powerful tools we have now back then; today is a wonderland of data, tools, and computation. It’s a great time to be a data scientist.

4. How do you respond when you hear the phrase ‘big data’?
I usually think of Peter Norvig talking about the unreasonable effectiveness of data and Michele Banko and Eric Brill finding that more data beat better algorithms in their 2001 paper. Big data is why Amazon’s recommendations work so well. Big data is what tunes search and helps us find what we need. Big data is what makes web and mobile intelligent.

5. What is the most exciting thing about your field?
I very much enjoy looking at huge amounts of data that no one has looked at yet. Being one of only a few to explore a previously unmined new source of information is very fun. Low hanging fruit galore! It’s also fraught with peril, as you’re the first to find all the problems in the data as well.

6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
Data problems should be iterative. Start simple. Solve a small problem. Explore the data. Then solve a harder problem. Then a harder one. Each time you take a step, you’ll get ideas on where to go next, and you also get something out at each step. Too many people start trying to solve the entire problem at the beginning, flailing for a long time, usually to discover that it was the wrong problem to solve when they finally struggle to completion. Start with easier problems, learn where to go, and you might be surprised by all the goodies you find along the way.

What happens when you import modules in Python

Standard

 

I’ve been using Python for a number of years now – but like most things I didn’t really understand this until I investigated it.

Firstly let’s introduce what a module is, this is one of Python’s main abstraction layers, and probably the most natural one.

Abstraction layers allow a programmer to separate code into
parts that hold related data and functionality.

In python you use ‘import’ statements to use modules.

Importing modules

The

import modu

statement will look for the definition
of modu in a file called `modu.py` in the same directory as the caller
if a file with that name exists.

If it is not found, the Python interpreter will search for modu.py in `Python’s search path`.

Python search path can be inspected really easily

import sys
`>>> sys.path`

Here is mine for a conda env.

['', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/pymc3-3.0rc1-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/numpydoc-0.6.0-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/nbsphinx-0.2.9-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/Sphinx-1.5a1-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/recommonmark-0.4.0-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/CommonMark-0.5.4-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/tqdm-4.8.4-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/joblib-0.10.3.dev0-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/pandas-0.19.0rc1-py3.5-macosx-10.6-x86_64.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/Theano-0.8.2-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/numpy-1.11.2rc1-py3.5-macosx-10.6-x86_64.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/imagesize-0.7.1-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/alabaster-0.7.9-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/Babel-2.3.4-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/snowballstemmer-1.2.1-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python35.zip', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/plat-darwin', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/lib-dynload', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/setuptools-27.2.0-py3.5.egg']

What is a namespace?

We say that the modules variables, functions, and classes will be available
to the caller through the modules `namespace`, a central concept in programming that
is particularly helpful and powerful in Python. Namespaces provide a scope containing
named attributes that are visible to each other but not directly accessible outside of the namespace.

So there you have it this is an explanation of what happens when you import, and what a namespace is.

This is based on the Hitchikers guide which is well worth a read 🙂