How do we deliver Data Science in the Enterprise

Standard
Source

I’ve worked on Data Science projects and delivered Machine Learning models both in production code and more research type work at a few companies now. Some of these companies were around the Seed stage/ Series A stage and some are established companies listed on stock exchanges. The aim of this article is to simply share what I’ve learned — I don’t think I know everything. I think my audience consists of both managers and technical specialists who’ve just started working in the corporate world — perhaps after some years in Academia or in a Startup. My aim is to simply articulate some of the problems, and propose some solutions — and highlight the importance of culture in enabling data science.

I’ve been reflecting over the years as a practitioner why some of this ‘big data’ stuff is hard to do. I’ll present in this article a take that’s similar to some other commentary on the internet, so this won’t be unusual.

My views are inspired by http://mattturck.com/2016/02/01/big-data-landscape/ in this article Matt says:

Big Data success is not about implementing one piece of technology (like Hadoop or anything else), but instead requires putting together an assembly line of technologies, people and processes. You need to capture data, store data, clean data, query data, analyse data, visualise data. Some of this will be done by products, and some of it will be done by humans. Everything needs to be integrated seamlessly. Ultimately, for all of this to work, the entire company, starting from senior management, needs to commit to building a data-driven culture, where Big Data is not “a” thing, but “the” thing.

Often while speaking about our nascent profession with friends working in other companies we speak about ‘change management’. Change is very hard — particularly for established and non-digital native companies, companies who don’t produce e-commerce websites, social networks or search engines. These companies often have legacy infrastructure and don’t necessarily have technical product managers nor technical cultures. Also for them traditional Business Intelligence systems work quite well — reporting is done correctly, and it’s hard to make a case for machine learning in risk-averse environments like that.

Continue reading

Advertisements

One weird tip to improve the success of Data Science projects

Standard

I was recently speaking to some data science friends on Slack, and we were discussing projects and war stories. Something that came across was that ‘data science’ projects aren’t always successful.

light-311119_1280.png

Source: pixabay

Somewhere around this discussion a lightbulb went off in my head about some of the problems we have with embarking on data science projects. There’s a certain amount of Cargo cult Data Science and so collectively we as a community – of business people, technologists and executives don’t think deeply enough about the risks and opportunities of projects.

So I had my lightbulb moment and now I share it with everyone.

The one weird trick is to write down risks before embarking on a project.

Here’s some questions you should ask you start a project – preferably gather all data .

  • What happens if we don’t do this project? What is the worse case scenario?
  • What legal, ethical or reputational risks are there involved if we successfully deliver results with this project?
  • What engineering risks are there in the project? Is it possible this could turn into a 2 year engineering project as opposed to a quick win?
  • What data risks are there? What kinds of data do we have, and what are we not sure we have? What risks are there in terms of privacy and legal/ ethics?

I’ve found that gathering stakeholders around helps a lot with this, you hear different perspectives and it can help you figure out what the key risks in your project are. I’ve found for instance in the past that ‘lack of data’ killed certain projects. It’s good to clarify that before you spend 3 months on a project.

Try this out and let me know how it works for you! Share your stories with me at myfullname[at]google[dot]com.

 

 

Avoiding being a ‘trophy’ data scientist

Standard

Recently I’ve been speaking to a number of data scientists about the challenges of adding value to companies. This isn’t an argument that data science doesn’t have positive ROI, but that there needs to be an understanding of the ‘team sport’ and organisational maturity to take advantage of these skills.

13120399074_cf3e261b75_m.jpg

Trophies are nice to look at but they don’t drive business decisions.

The biggest anti-pattern I’ve experienced personally as an individual contributor has been a lack of ‘leadership’ for data science. I’ve seen organisations without the budgetary support, the right champions or clear alignment of data science with their organisational goals. These are some of the anti-patterns I’ve seen, it’s non-exhaustive so I provide it.

The follow is an opinionated list of some of the anti-patterns.

  • I’ve written before about data strategy. I still think this is one of the things that’s most lacking in organisations. I think a welcome distinction is that data collection which needs to happen before data analysis, and that this needs to happen in accordance with the strategy of the company.

Solution: Organisations should map their data science projects to the key business concerns of the organisation. This will help shape how resources are allocated.

  • There needs to be an understanding of what kind of leadership you need for a data science team. This needs to be someone with hands-on experience of doing data science. This is not someone familiar with ‘analytics’ or ‘reporting systems’ and ‘delivery’. It is someone familiar with things like ‘probabilistic programming’, ‘neural networks’ and ‘A/B tests’. So don’t put an ‘analytics leader’ in charge of a team of data scientists.

Solution: Executives – feel free to reach out to me to discuss data strategy, I’ll gladly point you in the right direction.

  • You need Business intelligence not data science – there’s nothing wrong with reporting, or building analytics systems, but it’s not data science. Be honest about what your organisation needs.

Solution: Ask clarifying questions when interviewing about why the organisation needs data science versus other things.

Continue reading

Interview with a Data Scientist – Ian Wong of OpenDoor

Standard
I interviewed the interesting and fascinating Ian Wong – he’s the technical co-founder of OpenDoor, which I personally think is amazing as a concept!
(Ian Wong – Source: Linkedin)
1. What project have you worked on do you wish you could go back to, and do better?
Pretty much any project I’ve worked on in the past smiling face with open mouth Two projects stick out though.
a. Uniform API for Data Fetching
When we fit a model, call it y = f(X),  (X, y) are often taken for granted to be well-formed. How do you design a service that generates consistent (X, y)? Turns out this is not straightforward, and is specific to the domain and the data-capture systems.
The ideal solution would satisfy both batch & real-time needs; makes it easy to ship new features to production; and enables rapid prototyping. While I scrapped together the initial version at Opendoor, our amazing team of data engineers have really taken it to the next level. We hope to report our findings soon.
b. Interactive Visualizations of ML Algorithms
A few years ago, I tried putting together a visualizer for random forests using d3. I had the tremendous fortune of working with Mike Bostock for a bit, and was inspired by his ability to make abstract concepts tangible through interactive visualizations. At the time, I was working with these big sets of random forests, and wanted a get a better feel of the model outputs. So I rendered hundreds of decisions trees on screen, where when you hovered over one node, all other nodes belonging to the same features across different trees would be highlighted. It was pretty neat! But the prototype suffered from performance issues plus my own technical incompetence.
More broadly, I’m really excited about better interactive tools for ML algorithms because they’ll help us deeply understand them as tools. During my undergrad studies in electrical engineering, we used to play with circuits a lot. In the labs, you could make a change to the input or to the circuitry, and see the corresponding change in output on the oscilloscope instantaneously. When we run a simple regression, wouldn’t it be great to get immediate feedback on the fitted line if we were to drag, add, delete a data point?
I’m hopeful we’ll see more innovation in the read-eval-print loop for data science.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
My view may be slightly biased as I’m a PhD drop-out winking face In grad school, I became increasingly frustrated at the divergence between what’s interesting and what’s impactful. But that’s a whole separate conversation.
For folks looking to enter industry, nothing replaces hands-on practice. I would strongly encourage students to look for internships, participate in Kaggle competitions / Google Summer of Code, seek open source projects to contribute to. If you’re in school, take a wide variety of classes, especially computer science and project-based courses.
The higher order bit here is that the industry faces a different, evolving set of challenges than academia. The focus is typically on solving a business problem.
Here are additional pointers depending on the reader’s interest.
Business Analytics & Decision Science
  • The grammar of graphics and tidydata lay a great foundation for reasoning about data. I personally learned more by working through Hadley’s ggplot2 book than from many of my stats classes at Stanford. 
  • Develop business acumen and communication skills. Sometimes academics prefer to stay in the rarified air of theory and mathematics. Success in the analytics profession requires the ability to (a) meet hard business challenges head on, (b) break them down into smaller, quantifiable sub-problems, (c) rapid analysis, (d) present findings in a way that the audience can engage with, (e) take feedback and iterate.
Machine Learning & Engineering
  • Code, code, code. As I mentioned in Doing Data Science: ML is founded in math, expressed in code, and assembled into software. Being able to build robust software systems is becoming more important, as tools and algorithms are increasingly available.
  • While a strong grasp of theory will help narrow design choices, nothing beats rapidly exploring hypotheses. This demands coding proficiency, which from experience is a differentiating trait of highly productive data scientists.
Also, don’t let your field tie you down! Beware of sunk cost fallacy. Though PhDs may have invested years studying a certain field, the techniques investigated through a graduate program may not be transferable to a new domain. The most important quality of the PhD is persistence in doing research. Remember, it’s re-search search and search again. That’s what defines a great problem solver.
3. What do you wish you knew earlier about being a data scientist?
There are so many! Here’s a few.
How to build great predictive services
While we spend a lot of energy in grad school studying techniques, advanced techniques often yield only incremental lift over a simple solution (and in many cases comes with complexity that becomes a heavy tax). I think the big focus on modeling techniques contributes to the phenomenon of solutions chasing problems, rather than solutions being designed from the needs of the problem. Here’s a rule of thumb that I’ve come to adopt: You know that algorithm that all the papers make fun of in their intro? Implement that and forget the rest of the paper.”
Perhaps influenced by schooling, we as data scientists often dream about having these flashes of brilliance that identifies a proof! QED! In practice, what delivers results is an error-focused, iterative process of continuous model improvement (see my talk here). It’s the unglamorous engineering & detective work of starting with the biggest outliers of the model, and reasoning from first principles to eliminate them. Model debugger describes the role better than data scientist. It’s about the toil.
Forming a perspective based on incomplete data
Intellectual honesty, scientific doubt and a healthy dose of paranoia are generally great things to have. But beware of analysis-paralysis and failure to put a stake in the ground. Decisions need to be made in a timely fashion. In many cases we’re operating with 80% information (if lucky!), and your teammates are counting on you for a recommendation.
Earlier in my career, I would be reluctant in forming and articulating a strong perspective. Partly due to skepticism inculcated through school, and partly because it didn’t seem like it was my job as a data scientist to do so (more on titles being a constraint later). Making an actual policy recommendation seems so messy relative to the clean code and beautiful plots staring at me on the monitor. But I’ve since learned that this is an abdication of responsibility. Our job is to help the company make data-driven decisions, which means thinking through the implications of an analysis, consulting stakeholders, and coming up with a point of view.
Communication as craft
The value of an analysis is measured by whether it influences decisions. Even the most brilliant analysis becomes ineffective if not delivered to the audience in an accessible manner. This Jeff Atwood blog post explains the concept well.
Other things!
There are many other things I wish I knew earlier. How do I pick up software engineering skills? What does a great data scientist look like? How do I progress to become better? How do I foster effective debate and engagement of my work? I’ll omit them for now since this is getting long…
4. How do you respond when you hear the phrase big data’? What about AI’?
On Big Data
There’s big data, and there’s Big Data. If you’re referring to the latter, I think it’s a bit passé at this point (with some exceptions).
Turns out that more data beats better algorithms most of the time. As an industry we have worked really hard to make count(x) group by y scale to terabytes of data. But as alluded to earlier, the tools and infrastructure are increasingly commoditized. We’re ready to move onto higher parts of the application stack vs. focusing on the base layer. (e.g., Opendoor! See also the vertical AI piece by Bradford Cross.)
On AI
Turns out machines are tireless and can count much more reliably than human beings. This has implications as we enter the age of abundant data. This can get philosophical quick! But there are both benefits and hazards we’ll need to navigate.
5. What is the most exciting thing about your field?
As technology and education improves and become more accessible, there’ll be an increased supply of data science and machine learning talent. These individuals will become the next generation builders and leaders. Algorithmic sophistication is going to seep into all parts of our daily lives. The products they create are going to be smarter, easier to use and more personal (Opendoor being an example smiling face with open mouth).
6. How do you go about framing a data problem in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough? 
As Alan Kay once said: A change in perspective is worth 80 IQ points. Framing a problem well is probably the most important part of the solution.
Within the context of building predictive services, defining the objective function with a clear metric that’s ideally back-testable is half the battle. It provides a foundation for the rest of the work, which entails applying the simplest approach, iterating until convergence to some threshold that’s set by business needs. The art comes in how to define the ML problem in a way that aligns with business outcome (ideally tracing all the way through to the top- or bottom-line).
In terms of when is good enough, that depends on the need. Running a company is kind of like trying to solve an NP-hard resource optimization problem. We have to be rigorous about ROI for each initiative that we spend energy on.
In terms of managing expectation, it’s hard. It folds into longer term project and team planning. What is the business problem we’re trying to solve? What does success mean? Where do we need to be today, a quarter from now, a year from now? Who are the stakeholders? How should we provide updates and receive feedback?
7. You’ve spoken about people not needing to be constrained by titles. Could you expand a bit on that? What sort of skills should someone with ML skills be learning in your opinion? What have you learned working at Opendoor?
On Being Boxed in by Titles and Venn Diagrams
Titles should enable, not constrain. We are all problem solvers first. A title acknowledges that an individual is skilled in a certain area. But one shouldn’t let that define their boundaries. When misused, titles could be a escape hatch to avoid doing the things that matter. For instance, a misinformed data scientist may think of productionizing their insight” as unnecessary implementation detail, while a misinformed software engineer may think of defining data quality SLA for predictive systems as esoteric. In practice, there’s a metric to move, a question to be answered. Titles endow neither immunity nor magical problem-solving powers. What matters is clarifying the job to be done.
In a PhD program, there’s a tendency to put blinders on and focus on one problem, specified by one professor in one department. In industry, solutions tend to be multi-disciplinary. A lot of what we do as data scientists is to take human intuition and generalize them, seeing which withstand the backtest or an experiment. And to do this well, we need to be open to new ideas and continuously develop new skills. As allude to earlier, some of the key ones are (a) business intelligence and (b) software engineering (including frontend!).
But I would be remiss not to mention the following
  • Scrappy + Pragmatic + Business Acumen > Technical Expertise
The best data scientists are relentlessly resourceful and impact- / solution-oriented. Mindset shifts from I need to gain skill x” to I am going to solve problem y”; from not my job” to run towards where the impact is.”
Opendoor
It’s been an incredible journey thus far at Opendoor. We are on a mission to empower everyone with the freedom to move by building a seamless, end-to-end customer experience that makes buying and selling a home stress-free and instant. The experience of growing our team, scaling up as a leader and servicing thousands of customers has been really rewarding. It’s the perfect blend of crazy-hard technical challenges and creating positive impact in people’s lives.
We’re only getting started! If any of this seems exciting, check out http://opendoor.com/jobs, or email me at ian@opendoor.com.
These are great questions and I had a fun time (partially) answering them!
Bio
Ian is working on modernizing the residential real-estate industry. He is the technical co-founder of Opendoor, where he leads their engineering and data science team. He previously built fraud detection systems at Square as their first data scientist, and products at Prismatic. Ian received his BS in EE (Electrical Engineering), MS in EE and MS in Statistics from Stanford before dropping out of the PhD.

Interview with a Data Scientist: Mick Cooney

Standard

I’m delighted to feature my friend Mick Cooney here as an interviewee. Mick has many years of experience in Finance and more recently in Insurance, he co-ran the Dublin R meetup which was very successful and helped foster a data science community in Dublin. More recently he’s been working over in London at an Actuarial Consultancy – building out a data science practice.

q1. What project have you worked on do you wish you could go back to,
and do better?

I started my career as a quant in a small startup hedge fund. We
developed time series models to forecast short-term volatility in
equities and equity indices as part of an option trading strategy. It
is a fascinating topic and I still dabble in it. Thinking back on the
work done, I would re-engineer large portions of it. I made a ton of
mistakes on both the modelling and implementation side, and the R
language in particular has progressed in strides since I did the bulk
of the work.

For example, the system automatically generates PDF reports of the
forecasts but it does so by hand creating La-TeX files compiled into
PDF. One of the first things I would do is switch all that over to use
either ‘knitr’ or ‘rmarkdown’. I would also use more ‘reproducible
research’ concepts.

That said, I had worked on the modeling for a long time, so I am
content with the basic model. There are many things still to
investigate or implement.

On the modeling side, I worked on a persistency model using survival
analysis, which is how I learned about the subject in the first
place. As a result, there are a lot of different things I would love
to return to and do differently. In retrospect, I was too quick to
move past the simpler models. We could see the assumptions were not
consistent with the data, and so did not fully explore simpler
approaches. I am now curious to learn what insights those simpler
approaches would yield.

Customer churn is such a universal problem I expect I will be working
on it again in the near future. Hopefully I can apply those lessons
then.

***
q2. What advice do you have to younger analytics professionals and in
particular PhD students in the Sciences?

I think the key advice I would give is the same for everyone – never
stop learning. This may be the availability heuristic at play with me,
but I have never seen a connection between qualifications and analyst
quality. All the good analysts I know have curiosity and
initiative. Academic achievements do not come into it at all.

Initiative manifests in many ways. First, when they encounter a
problem they learn what they need to do and get on with it. Second,
much of their knowledge is self-taught. Finally, and I believe most
importantly, they have an inherent curiosity – the best analysts I
know engage in the field in their own time, mainly because they want
to.

This brings up a related issue I have been pondering for some time. I
am ambitious. I want to be a top data scientist some day. I have no
academic ambition whatsoever, but my goal is to be able to hold my own
in any conversation with anyone in the field.

How do I achieve this? What do I need to do to get to that point?

While probably not as keen as the average fan, I love sport – soccer,
the NFL and Gaelic Football in particular. For anyone who has met me
in person, comparing me to a top athlete seems preposterous, but
there is a lot to be learned from top athletes if you want to excel
at your chosen field. Look at how they prepare and train. These
principles almost certainly apply to other professions too, but it is
more fun to talk about sport. 🙂

When I read about Lionel Messi, Tom Brady or Colm Cooper (for our
non-Irish readers the recently-retired ‘Gooch’ is arguably the
greatest GAA player to ever play the game – he was majestic to watch),
the one thing that always stands out for me is their fanatical
devotion to their chosen career not their obvious talent. All their
team-mates mention how hard they worked despite their abundance of
natural advantages. Players with huge natural talent often coast, but
elite players are the opposite – they work as hard as the fringe
players slogging to just survive the cut.

In our field, we need to work constantly on improving – going to
Meetups, reading about new techniques, watching videos on YouTube and
looking to strengthen areas where you are weak. This is why a natural
interest and curiosity is so invaluable – it makes these necessary
tasks much less of a burden as they are things you would want to do
anyway.

Secondly, top players do the simple things well, almost never making a
mistake. They are fallible of course, and make mistakes, but almost
never on the basics. They are rigorous about practicing the basic
skills and principles, and that is why they are so good. The bread and
butter of their craft is second-nature to them.

This is why I focus so much on basic statistics classes and reread and
re-watch the books and lectures I find useful. I want these things to
be second nature and they are not.

Probability and statistics are so counter-intuitive that I almost
never get things right on gut feeling. I am almost always wrong. So
much so that I gave a talk about probabilistic graphical models about
a year ago and during the questions at the end made an off-hand joke
about going with the opposite of my intuition.

It was said in jest at the time but is sadly true!

One final piece of advice is to help as many people as you can. Help
people with their homework, with some programming, with their computer
problems and with data problems. You get exposed to all sorts of
topics and problems, most of which you will see again in your
career. You also get the added bonus of people thinking you are
selfless and altruistic, despite being self-serving in reality!

***
q3. What do you wish you knew earlier about being a data scientist?

I have two main things I wish I learned early on in my career, and
both are connected philosophically. First, I wish I had learned about
probabilistic thinking, risk management, economics and statistics –
you can never learn enough about these fundamental topics. Secondly, I
wish I learned it is okay to start working with a bad model that you
know is wrong but simple.

To that first point, I spend a long time fighting my natural desire
for a clean, elegant and correct answer to a problem. I would work on
a problem, get to a point that I was confident pointed us in the right
direction, but then realise that ‘proving’ this was right involved a
huge amount of time and effort, assuming it was possible.

I attributed my natural reluctance to pursue this ‘answer’ as
laziness, and felt guilty. I felt I was being unprofessional and
sloppy. But working on forecasting models for trading taught me that
this was not the case. Models are so imperfect, with so many
compromises it is often more optimal to think about other things first
– what are the limitations of the model in practice, what is it
saying, how are you going to use it. Answer those questions first,
THEN worry about improving it.

This is why I always start with simple, stupid, wrong models. They are
quick to produce, they help you learn a lot about what you are doing,
they fail in spectacular ways and they are sometimes all you need. In
terms of costs and benefits, they are hard to beat.

***
q4. How do you respond when you hear the phrase ‘big data’?

I hate it. It has become a meaningless buzzword used as a means of
making sales.

My attitude to the term is best summarised by the interview you had
with Hadley Wickham: there are three categories of data size,
in-memory, on-disk and finally the truly ‘big data’ problems like
recommender systems. I believe the majority of problems can be solved
by appropriate sampling of your data down to a manageable size and
then analysing those subsets.

After all, the whole point of statistics is to make inferences about a
population from a sample of the data.

Once decided on a solution, putting the model into production and
scaling it for your business is a major issue, but is a problem more
belonging to the realm of network and software engineering. That said,
it is important to keep people with a solid understanding of the
concepts stay involved, just in case some ‘optimisations’ ruin the
output.

***
q5. What is the most exciting thing about your field?

Robert McNamara in ‘The Fog of War’ mentioned that you should never
answer the question asked but instead answer the question you wanted
to be asked, so with your forebearance I will first answer a liberal
interpretation of that question: what work gets me excited?

The short answer to that question is all sorts of things do, but they
are often small things related to work I am doing. In the last few
months, I was excited to try out dataexpks (a data exploration package
I am co-creating) on a brand new data set to see what it showed me and
how well my code worked. I love think of ways to use Monte Carlo
simulation to test the output of various regression models, and over
Christmas I was fascinated by a short project trying out methods of
investigating differences between a subpopulation within a larger
population.

I am fascinated by new ways to learn the fundamentals – there are a
few excellent ones out there and I read them all the time. I can never
learn enough as in my experience reality tends to present us with
basic statistical problems in new and unusual ways.

Having multiple perspectives and multiple approaches is invaluable in
those situations.

Regarding your original question as I think you intended, I think the
advances in reinforcement learning techniques probably have the
biggest potential – some of the Atari gameplaying from Deep Mind was
eye-opening. Sadly, if history is any guide, much of it will prove to
be hype, but I imagine some very interesting results to come from the
work.

***
q6. How do you go about framing a data problem – in particular, how do
you avoid spending too long, how do you manage expectations etc. How
do you know what is good enough?

Framing a data problem is a tough one to answer – I am not sure what I
do or how to articulate it. I have had the good fortune to help a lot
of people with their projects and problems, exposing me to a wide
variety of problems. I learned something from all of them and I rely
on that a lot.

I also read a lot of blogs, articles and subscribe to mailing
lists. While rarely having the time to read all this, often all you
need to get started on a problem is a vague memory of some technical
topic that may help and some terminology to Google.

As a result, the first thing I focus on is understanding the problem:
what is being asked? Do we have any data? What does is it look like?
Are there other data available we can use to enrich or use as a
substitute?

Going through that process will suggest approaches to use, and at that
point I draw upon previous experience, however tangential to the
problem..

By keeping this focus, your other questions are straightforward to
answer: if the current model is not likely to improve the answer by an
amount relevant to the goal, it is not worth spending more time
on. Similarly, knowing what is needed will tell you if your current
model is good enough, or often if there is a model that is good enough
– it is possible the level of accuracy required is not feasible.

In the latter case, discovering that early is much better than later –
you know not to waste time, money and resources on a lost cause.

***
q7. You’ve spoken before about the ‘need for apprenticeships’ in Data
Science. Do you have any suggestions on what that would involve? Are
meetups and coaching a good first start?

To explain the point I was making on that note, I think there is a lot
of implicit knowledge in this field, and I have been told a number of
times from people looking for help that people feel overwhelmed by the
sheer amount of knowledge people feel they need to know.

I do not think this is true, but I understand its origin: there is so
many different aspects to working with data it is tough to know where
to start. I always start very simple, but as I mentioned early, it
took a lot of time, thought and effort to get to that point, and it is
not easy to explain these ideas in theory – you have to work on a
number of different datasets to get a feel for how to do this.

As a result, I believe an approach such as mentoring or
apprenticeships are an effective approach to teach people – more
experienced analysts can guide junior members around the various
pitfalls and traps that are easy to fall into. It allows us to
illustrate that fancy and sophisticated techniques and algorithms are
not needed to do interesting work – some of the most interesting work
I have seen involved little more than summary statistics along with
basic models like linear regression and decision trees.

This is hard to learn from a book – almost impossible. The closest
book I read that talks about this is “Data Analysis Using Regression
and Multilevel/Hierarchical Models” by Gelman and Hill, stressing the
importance of starting from simple models. I would love to know if
there are more.

That said, I could only appreciate the point because I was already
experienced, a younger version of myself would have missed the
point. It would not have occurred to me that the right way to do
something is to do the simple and obvious thing.

I am a firm believer in the KISS principle. Keep It Simple, Stupid.

Building Full-Stack Vertical Data Products

Standard

I’ve been in the Data Science space for a number of years now, I first got interested in AI/Machine Learning in 2009 and have a background typical of a number of people in my field – I come from Physics and Mathematics.

One trend I’ve run into both at Corporates and Startups is that there are many challenges to deploying Data Science in a bureaucratic organisation – or delivering Enterprise Intelligence. Running into this problem led me to be interested in building data products.

One of the first people I saw building AI startups was Bradford Cross – and he’s been writing lately about his predictions for the 2017 in the Machine Learning startups space.

I agree with his precis that we’ll begin to see successful vertically-oriented AI startups solving full-stack industry problems that require subject matter expertise, unique data, and a product that uses AI to deliver its core value proposition.

At Elevate Direct we’re working on this working on the problem of sourcing and hiring contractors – so one of the fundamental problems that companies have which is hiring the best contractor talent out there.

So what are some of the reasons that it can be hard to deploy Data Science internally at a corporate organisation? I think a number of the patterns are related to other patterns we see in terms of software.

  1. Not being capable of building consumer facing software – Large (non-tech) organisations sometimes struggle to build and deliver software internally – I’ve seen a number of organisations fail to do this – their build process can be 6 months.
  2. Organisational anti-patterns – I’ve seen some organisations that rapidly inhibit the ability to deploy product. Some of these anti-patterns are driven by concerns about the risk of deploying software. And often end up with diffuse ownership – where an R and D team can blame the operations team and vice versa.
  3. Building Data Products is risky – Building data products is hard and risky – I think you really need to approach data products in a lean-startup kinda way. Deploy often, if it works it works, if not cut it. Sometimes the middle-management of large corporates is risk-averse and so find these kinds of projects scary. It also needs a lot of expertise –  subject-matter expertise, software expertise, machine learning expertise.
  4. Not allowing talented technical practitioners to use Open Source/ pick the tools – I once worked at a FTSE 100 company that it took me about 6 weeks to be able to install Open Source software tools such as R and Python. It severely restricted my productivity, in that time at a startup my team probably deployed into production, to a customer facing app about 1000 changes. This reminds me of the number 3 here. Don’t restrict the ability of your talented and well-trained people to deliver value. It makes no sense from a business point of view. Data Science produces value only when it produces products or insights for the business or the customers.
  5. Not having a Data Strategy – Data Science is most valuable when it aligns with the business strategy. Too often I’ve seen companies hiring data scientists before they have actual problems for them to work on. I’ve written about this before.
  6. Long term outsourcing deals – This is an insidious one, and one that came from a period of time when “IT didn’t matter”, before big Tech companies proved the value in the consumer space of for example e-commerce. It’s impossible to predict what will be the key tech for the next 10 years, so don’t lock yourself to a vendor for that period of time. Luckily this trend is reversing – we’re seeing the rise of agile, MVP, cloud computing, design thinking, getting closer to the customer. A great article on this re-shoring is here.

I think fundamentally a lot of these anti-patterns come from not knowing how to handle risk correctly. I like the idea in that RedMonk article that big outsourcing is a bit like CDOs in finance. Bundling the risk into one big lump doesn’t make the risk go away.

I learn this day after day working on building data products and tools at Elevate. Being honest about the risks and working hard to de-risk projects and drive down that risk in an agile way is the best we can do.

Finally, I think we’re just getting started building Data Products and deploying data science. It’ll be interesting what we see what other anti-patterns emerge as we grow up as an industry. This is also one of the reasons I’ve joined a startup and why I’m very excited to work on an end-to-end Data Product, which is solving a real-business problem.

What happens when you import modules in Python

Standard

 

I’ve been using Python for a number of years now – but like most things I didn’t really understand this until I investigated it.

Firstly let’s introduce what a module is, this is one of Python’s main abstraction layers, and probably the most natural one.

Abstraction layers allow a programmer to separate code into
parts that hold related data and functionality.

In python you use ‘import’ statements to use modules.

Importing modules

The

import modu

statement will look for the definition
of modu in a file called `modu.py` in the same directory as the caller
if a file with that name exists.

If it is not found, the Python interpreter will search for modu.py in `Python’s search path`.

Python search path can be inspected really easily

import sys
`>>> sys.path`

Here is mine for a conda env.

['', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/pymc3-3.0rc1-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/numpydoc-0.6.0-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/nbsphinx-0.2.9-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/Sphinx-1.5a1-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/recommonmark-0.4.0-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/CommonMark-0.5.4-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/tqdm-4.8.4-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/joblib-0.10.3.dev0-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/pandas-0.19.0rc1-py3.5-macosx-10.6-x86_64.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/Theano-0.8.2-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/numpy-1.11.2rc1-py3.5-macosx-10.6-x86_64.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/imagesize-0.7.1-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/alabaster-0.7.9-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/Babel-2.3.4-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/snowballstemmer-1.2.1-py3.5.egg', '/Users/peadarcoyle/anaconda/envs/py3/lib/python35.zip', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/plat-darwin', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/lib-dynload', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages', '/Users/peadarcoyle/anaconda/envs/py3/lib/python3.5/site-packages/setuptools-27.2.0-py3.5.egg']

What is a namespace?

We say that the modules variables, functions, and classes will be available
to the caller through the modules `namespace`, a central concept in programming that
is particularly helpful and powerful in Python. Namespaces provide a scope containing
named attributes that are visible to each other but not directly accessible outside of the namespace.

So there you have it this is an explanation of what happens when you import, and what a namespace is.

This is based on the Hitchikers guide which is well worth a read 🙂