Sexism in Tech conferences

Writing about sexism in tech conferences is hard. Especially as a young white male. I can only speak anecdotally – but most women in the Tech industry I speak to, talk a bit about moments of subtle sexism or sometimes out-and-out harassment. As a member of the tech community I’m completely behind any promotion of minorities in the industry, and feel that more can be done. It is interesting that most men I speak to in the industry don’t notice any problem.

Two articles spring to mind:

http://womeninastronomy.blogspot.com/2014/11/its-not-about-that-damn-shirt.html

This was written about STEM but I feel the same rules apply to the Tech community (especially since I personally straddle both communities).

It’s “not a big deal” when someone tells you he came to your talk because you’re attractive.
It’s “not a big deal” when a coworker comments on your appearance.
It’s “not a big deal” when someone makes a “joke” at work demeaning women.
It’s “not a big deal” when you are asked in a job interview if you have or are planning to have kids.
It’s “not a big deal” that you were warned about what professor to avoid basically as soon as you got to school.
It’s “not a big deal” that that same professor was untouchable by the administration because he was too famous.
It’s “not a big deal” when someone assumes you are your own secretary on the phone.
It’s “not a big deal” when someone calls you “Miss” and your male colleague “Doctor.”
It’s “not a big deal” when going to parties at a conference comes with warnings of which of your fellow scientists are dangerous.
It’s “not a big deal” when your boss, adviser, or senior colleague asks you out.
All of this stuff IS a big deal. One of the things I hear about the tech industry – partly because of the passive agression that Hackers sometimes adopt is that as a community we need to grow up and become more professional AND inclusive. I agree wholeheartedly with this and applaud the conferences that encourage more female participation and more female speakers. Diversity is a good thing and I think it makes us smarter :).
The other link I saw was http://adainitiative.org/2012/08/defcon-why-conference-harassment-matters/ about Defcon a famous security conference. I found the following paragraph to be very powerful.
When you say, “Women shouldn’t go to DEFCON if they don’t like it,” you are saying that women shouldn’t have all of the opportunities that come with attending DEFCON: jobs, education, networking, book contracts, speaking opportunities – or else should be willing to undergo sexual harassment and assault to get access to them. Is that really what you believe?
I am glad things are getting better but there are still a number of actions that we can all take. I think this is a subproblem of the larger problem that Pete Warden commented about. I consider his article to be self-recommending http://petewarden.com/2014/10/05/why-nerd-culture-must-die/
Comments are welcome. The articles I linked to, contain some excellent resources on how to enforce or come up with policies in regards harassment – which is a legal issue. Lots of us like to avoid legal issues like this – but an advantage of policies and ‘processes’ is that they are transparent and fair. Some of us consider these things to be too formal – but as I get older I see that some of these ‘formalities’ that we have in corporations and other organizations are useful and save a lot of hassle.

Book Review: Analytics in a Big Data World

So this is a quick review of a book that ended up in my mailbox a few months ago.

Firstly the good: this is a good academic introduction to a variety of techniques all in one reference book. I particularly liked the discussion of Process mining and survival analysis as I feel these are techniques often neglected in the discussion of data science. I know that the author of the Lifelines library. Cameron-Davidson Pilon has done some screencasts on Data Origami of this technique and the applications it has to say Customer Churn modelling but this is the first time I saw it in a book aimed at Data Scientists.

I believe that Bart is an expert in risk modelling so there is a lot of discussion of financial services applications – this is fine and a good addition to the literature on data science, since a lot of the literature is focused on Machine learning applications for social networking websites or the e-commerce sector.This last point may be due to the fact that Bart is based in Europe as opposed to the Bay Area.

An interesting addition to the data science literature is in his applications chapter – and he includes Business Process Analytics, as someone who has worked on some Business Process Mining I’ve not seen too many remarks to this field in the literature and certainly none in book form so this is a worthy addition.

The bad: The print of the book is terrible and the paper a colour that makes reading it extremely difficult. I also felt that the type face for the mathematics equations was hard to read. This may not be Dr Baesens fault. I felt that some of the material was not new to me – but this is fine I’ve probably got more experience in this sector than the target audience who seems to be soon to finish Masters students or PhD students in STEM subjects who are considering a career in Data Science.

I would also like more discussion on how to present your ideas to clients but I guess this is for a separate book or a book on ‘Creating Data Products’.

Nevertheless I would give recommend the book to any MSc or PhD students interested in a career in data science and any analysts like myself who want a good reference for Survival Analysis and Process Mining. I think those chapters and subchapters make this a worthy addition to my own library. I think also that the discussion of risk modelling and customer churn modelling is excellent as this is a bottom up approach from the Mathematical models and data processing to how a model could be produced and evaluated. Together with say a good Coursera course this could be an excellent preparation for interviews for Data Science roles.

Disclaimer: Dr Bart Baesens sent me a copy of this book for review but I have no stake in it’s success.

Business Analytics versus Data Science

When you work in the IT industry you often realize that a big challenge of getting to grip with the industry is learning the memes and buzzwords. One of the difficulties I came across recently was understanding the difference between Business Analytics and Data Science. 

I won’t define data science here – primarily because it has no formal definition. Yet I will describe some of the differences that came up when I reached out to some luminaries on Social Media. 

I asked the question what is the difference between Business Analytics and Data Science, these are some of the answers I got. 

Andrew Clegg@andrew_clegg  more fundamentally: testing hypotheses in controlled experiments (a/b etc) rather than just making changes and reporting results 

reporting vs predicting/recommending/explaining maybe?

Andrew Clegg is a Director of Data Science at Pearson Education, and has academic experience in Natural Language Processing and a strong Engineering background. He clearly still thinks somewhat like a scientist, and this is obvious from his own blog and his own presentations. I think sometimes though people bash ‘reporting’ because it has become such a bane of our professional existence. I live in the Financial Centre that is Luxembourg, and regularly hear my friends complain about the ‘reports’ they have to produce for regulatory compliance. This can be very descriptive work – but I still think that a good ‘data product’ can be in report form. 

 

context :-) age and industry of practitioner, start-up vs. enterprise, etc. most of the time it’s a semantic difference 

John loves to cut through the bullshit – he has experience as an Analytics Consultant and now leads a team at Mailchimp. If I were willing to move to Atlanta I would probably pester him for a job interview :p I think he has a good point here about the semantic difference, at Amazon I met and worked with business analysts who were effectively doing data science, building predictive models of the changes in price elasticity of certain goods. That is not to say that sometimes they had to build reports and that a lot of their day to day work was ad-hoc analysis – which often means ad-hoc scripting to count something. 

17h

very similar. Typically biz analytics don’t build models. But they should. Too often biz analytics is report & slice/dice data

And the wonderful JD Long – who I interviewed for this blog before @ Data Artisan interview – gives a good point about the fact that Biz Analytics should often build models rather than just report and slice/dice data. I think this is personally where the scientific aspect comes in – and this is a corollary of one of Andrew Cleggs comments above. 

I am often surprised by how decisions are made in business – one of the best parts of being a consultant is learning how important culture is. I hear sometimes the job title ‘data strategy consultant’ which strikes me as a data scientist working with companies to create a data-driven culture. Culture is a hard thing to change in an organization even if you are a member of the C-class. I will probably comment on that when I am old and wise enough to understand it. 

So what do you do if you want to work on your predictive modelling skills?

Well I can recommend one good lecture notes or pdf and one good book. 

Applied Data Science Notes

Applied Predictive Modelling

Both are good hands-on introductions to the applied part of Data Science.

Which is thankfully more exciting than mere reporting.

P.s Although sometimes data scientists have to build the infrastructure themselves – I think that a data scientist should be capable of building a data mart for themselves, since one often doesn’t have the software engineering support to do that. 

Business Analytics versus Data Science

When you work in the IT industry you often realize that a big challenge of getting to grip with the industry is learning the memes and buzzwords. One of the difficulties I came across recently was understanding the difference between Business Analytics and Data Science. 

I won’t define data science here – primarily because it has no formal definition. Yet I will describe some of the differences that came up when I reached out to some luminaries on Social Media. 

I asked the question what is the difference between Business Analytics and Data Science, these are some of the answers I got. 

Andrew Clegg@andrew_clegg  more fundamentally: testing hypotheses in controlled experiments (a/b etc) rather than just making changes and reporting results 

reporting vs predicting/recommending/explaining maybe?

Andrew Clegg is a Director of Data Science at Pearson Education, and has academic experience in Natural Language Processing and a strong Engineering background. He clearly still thinks somewhat like a scientist, and this is obvious from his own blog and his own presentations. I think sometimes though people bash ‘reporting’ because it has become such a bane of our professional existence. I live in the Financial Centre that is Luxembourg, and regularly hear my friends complain about the ‘reports’ they have to produce for regulatory compliance. This can be very descriptive work – but I still think that a good ‘data product’ can be in report form. 

 

context :-) age and industry of practitioner, start-up vs. enterprise, etc. most of the time it’s a semantic difference 

John loves to cut through the bullshit – he has experience as an Analytics Consultant and now leads a team at Mailchimp. If I were willing to move to Atlanta I would probably pester him for a job interview :p I think he has a good point here about the semantic difference, at Amazon I met and worked with business analysts who were effectively doing data science, building predictive models of the changes in price elasticity of certain goods. That is not to say that sometimes they had to build reports and that a lot of their day to day work was ad-hoc analysis – which often means ad-hoc scripting to count something. 

17h

very similar. Typically biz analytics don’t build models. But they should. Too often biz analytics is report & slice/dice data

And the wonderful JD Long – who I interviewed for this blog before @ Data Artisan interview – gives a good point about the fact that Biz Analytics should often build models rather than just report and slice/dice data. I think this is personally where the scientific aspect comes in – and this is a corollary of one of Andrew Cleggs comments above. 

I am often surprised by how decisions are made in business – one of the best parts of being a consultant is learning how important culture is. I hear sometimes the job title ‘data strategy consultant’ which strikes me as a data scientist working with companies to create a data-driven culture. Culture is a hard thing to change in an organization even if you are a member of the C-class. I will probably comment on that when I am old and wise enough to understand it. 

So what do you do if you want to work on your predictive modelling skills?

Well I can recommend one good lecture notes or pdf and one good book. 

Applied Data Science Notes

Applied Predictive Modelling

Both are good hands-on introductions to the applied part of Data Science.

Which is thankfully more exciting than mere reporting.

P.s Although sometimes data scientists have to build the infrastructure themselves – I think that a data scientist should be capable of building a data mart for themselves, since one often doesn’t have the software engineering support to do that. 

On the Education debate

Education reform is a politically sensitive issue.
Yet a few articles I came across recently made me think about the issue.

“As descriptions, both arguments—accountability and autonomy—contain a measure of truth. Teachers do lack some of the freedom they need to teach well, and they also lack adequate feedback. But as prescriptions, actual suggestions for how to improve teaching, the arguments fail. Neither change, on its own, will produce better teachers. Basic math makes the problem with accountability clear: Discard the bottom 10 percent and, as Obama said, that’s thirty thousand teachers who will need to be replaced. And that’s just in California. Nationally, the number is more than ten times that. Autonomy, meanwhile, is an experiment that many schools have tried for years, and still seen teachers struggle.”

Another article that talks about these ‘free market reflexes’ is the following http://baselinescenario.com/2013/12/11/free-market-reflexes/#more-10763

“That just doesn’t follow. And anyone who’s worked in an actual company should realize that. Yes, it’s always better to have better workers. One way to get better workers is to hire more effective people and to fire less effective people. But the other way—which, in most industries, is by far more important—is to make your current workforce more effective. You do that in part by figuring out what attributes or processes make people more effective, and in part by training people and implementing processes in ways that improve productivity.”

I think the myth of the ‘naturally born teacher’ leads to such logical absurdities as those above. Talent and skill need cultivation, and we rarely hear the need for such improvements in Education. In software engineering – and I work for a company in that sector – there is feedback from other experts via say ‘code review’. But how much feedback is given to new teachers from their peers. Feedback from exam results is not necessarily correlated with good teaching skill.

Data Scientist is the hot new job title

All of us involved in Tech know that ‘Data Scientist’ is the hot new job title.
I saw recently from Martin Fowler an Infodeck.

I reproduce this.

“Data Scientist” will soon be the most over-hyped job title in our industry. Lots of people will attach it to their resumé in the hopes of better positions
but despite the hype, there is a genuine skill set”

    The ability to explore questions and formulate them as hypotheses that can be tested with statistics.
    Business knowledge, consulting, and collaboration skills
    Understanding machine-learning techniques.
    Programming ability enough to implement the various models they are working with.

Although most data scientists will be comfortable using specialized tools, all this is much more than knowing how to use R. The understanding of when to use models is usually more important than being able to use them, as is how to avoid probabilistic illusions and overfitting.

Data Science as a Process

Hilary Mason one of the shining lights of the world of data science Tweeted recently 

‘Data people: What is the very first thing you do when you get your hands on a new data set?’ 

What I do when I get a new dataset is a recent article on the Simple Statistics blog, is a response to this. 

I’ve been thinking about my own data science process. My academic background is in Physics and Mathematics, so I am influenced by those disciplines. This is a personal blog post, just to document my own Data Science Process. 

0) Try to understand the data set: I must admit there have been projects that I’ve forgotten this during. I’ve been so eager to apply ‘Cool algorithm X’ or ‘Cool model Y’ to a problem that I’ve forgotten that Exploratory Data Analysis – which always strikes me as low tech is extremely valuable. Now I love this period, I plot graphs, I look for trends, I clean out outliers. I try to figure out what sort of distribution the data follows. I try to figure out anything interesting or novel. This often highlights duplicates, or possible duplicates. And I find myself coming back to this period. 

So I have three interesting heuristics to follow which I find work

  • Talk to the Business Analysts or domain experts: Never be so arrogant as a data scientist that you think that you can ignore an experienced Business Analyst. When I was at Amazon I learned countless things from chatting to Supply Chain Analysts or Financial Analysts. Some of these included business logic, or ‘what does that acronym’ mean and some of these things involved an explanation of a process that was new. Remember that data in lots of companies is designed for operations not data analysis. 
  • Import the data into a database: Building a database doesn’t take too long, and often you are given data in some sort of ugly csv or log format. You can’t underestimate the effects of running a few groupby and count statements to see if the data is corrupt. 
  • Examine the heads and tails. In R this is sapply(df, class); head(df), tail(df). You can do similar things in Python – Pandas. This generally gives you a chance to look at outliers or NA’s. 

1) Learn about what kind of elephant you are trying to eat.

One of my old mentors at Amazon told me that all problems at Amazon are huge elephants. At a dynamic and exciting company like that, he was right. So we used to talk about ‘eating the elephant, piece by piece’. I would add a corollary, you need to know what kind of ‘elephant’ you are dealing with. For example what data sources are you dealing with, what is the business process? Are there quirks to this business process that you’ll need a domain specific expert to help you understand it.
Here chatting to business analysts helps as well. Unless I know the domain really well, I find this period takes some time. Sometimes you’ll have stakeholders who are impatient about this period. But you must be frank with them that this period is needed. That this period is valuable for the later on analysis. And if you aren’t allowed to do this, you can just update your linkedin profile and find another job :) 

I consider the outliers checks and reading column headings to be part of this as well.

You can learn a lot just by documenting what the column headers are.

The final part of this step is to generate a question about the data. Write this down in a document, with the steps your taking. It doesn’t have to be scientific but this is a good way to keep you focused on the business questions. 

2) Clean the data 

I don’t necessarily enjoy this part. It is painful but necessary, and involves a lot of tedious and sometimes annoying work.

  • Look for encoding issues
  • Test boring hypothesis. Like if you have telecommunication data that num_of_foreign_calls <= total_number_of_calls
  • Sanity checks

3) Plot the data

Now I plot the data and try to understand it. Ggplot and Matplotlib are your friends here – until they aren’t. I am currently investing time in learning about the visualization libraries in Python and R. I suggest you do the same. Yet histograms and simple plots work best here. The purpose of plotting data is to lead to insight. The sexy stuff can come later. 

Scatter plots, density plots, historgrams, etc etc.

    Here is an example of a scatter plot from a logistic regression model.
    Logistic Regression

    Logistic Regression

    4)In step 1 we came up with a hypothesis such as ‘correlation between X and revenue’ answer this simple hypothesis

    Now you should know enough about the data to do an analysis. It could be a simple machine learning model or a regression model.