# A short email from Marvin Minsky – RIP

Standard

As a data scientist I regularly use results based upon the work of Marvin Minsky.

This is an email exchange I had with him about 6 years ago, when I was working in Education and deciding to go back to school for Graduate School.

On Mon, Jun 21, 2010 at 10:53 AM, Peadar Coyle wrote:

Hi Marvin,
I shan’t bore you with how much of an inspiration and role model your work has been for me.
I’m a Mathematics and Physics Graduate student, with an interest in all sorts of problems.
I am particularly writing in regards your OLPC memos, I found them terribly interesting and important especially in regards the Linguistic desert in Mathematics.
I’ve taught Maths in High Schools, and do find that the richness of the subject is destroyed. ‘The National Curriculum’ is held up as some sort of Biblical text and subsequently many students leave without a sense of what a researcher does, nor that Mathematics is a beautiful art form in itself.
Another aspect: although I had the privilege to attend outstanding schools (Fieldston to 8th grade, and then Bronx Science and Andover) — I don’t recall having had the idea (until college) that it was still possible to invent new mathematics.  (I did know there there still was progress in Physics, Chemistry and Biology — but didn’t have the clear idea that Mathematics was still Alive!)
I used to be taunted as a teenager for wanting to use words like ‘non-linear’ or ‘negative feedback’. This can be discouraging even for ambitious students like myself. I feel that things haven’t got much better. Seymour Papert was correct that we teach quadratic formulas due to technological constraints. Frank Quinn (a topologist) has written a book (on his website) about mathematics education and computers. With demonstrations and Mathematica and visualizations, there is no reason that students can’t learn somethings about Dynamics, Moments of Inertia. Yes some of the integrals are terribly difficult – I even struggle with some of the algebra – but with facilities like Wolfram Alpha there one can learn to check ones work, and not be hindered by such algebraic manipulations.
I haven’t actually used it much, but it surely will be exciting to see what happens when it gets combined with systems (that don’t yet exist) which exploit large collections of common-sense knowledge.
Gian Carlo Rota pointed out that it is not enough to be computer literate, one should be computer literate squared.
Did you know Mr Rota? I believe he was at MIT as well.
Yes, Rota was a long-time friend.

I provide this without commentary, to just share how great it is that some of the most inspiring people in my world of Artificial Intelligence and Mathematics have responded to emails.
This link is to his Obituary.

# Interview with a Data Scientist: Ivana Balazevic

Standard

Ivana Balazevic is a Data Scientist at a Berkeley based startup Wise.io, where she is working in a small team of data scientists on solving problems in customer service for different clients. She did her bachelor’s degree in Computer Science at the Faculty of Electrical Engineering and Computing in Zagreb and she recently finished her master’s degree in Computer Science with the focus on Machine Learning at the Technical University Berlin.

1. What do you think about ‘big data’?

I try not to think about it that much, although nowadays that’s quite hard to avoid. 🙂 It’s definitely an overused term, a buzzword.

I think that adding more and more data can certainly be helpful up to a point, but the outcome of majority of the problems that people are trying to solve depends primarily on the feature engineering process, i.e. on extracting the necessary information from the data and deciding which features to create. However, I’m certain there are problems out there which require large amounts of data, but they are definitely not so common for the whole world to obsess about.

2. What is the hardest thing for you to learn about data science?

I would say the hardest things are those which can’t be learned at school, but which you gain through experience. Coming out of school and working mostly on toy datasets, you are rarely prepared for the messiness of the real-world data. It takes time to learn how to deal with it, how to clean it up, select the important pieces of information, and transform this information into good features. Although that can be quite challenging, it is a core process of the whole data science creativity and one of the things that make data science so interesting.

3. What advice do you have for graduate students in the sciences who wish to become Data Scientists?

I don’t know if I’m qualified enough to give such advice, being a recent graduate myself, but I’ll try to write down things that I learned from my own experience.

Invest time in your math and statistics courses, because you’re going to need it. Take a side project, which might give you a chance to learn some new programming concepts and introduce you to interesting datasets. Do your homeworks and don’t be afraid to ask questions whenever you don’t understand something in the lecture, since the best time to learn the basics is now and it’s much harder to fill those holes in knowledge than to learn everything the right way from the beginning.

4. What project would you back to do and change? How would you change it?

Most of them! I often catch myself looking back at a project I did a couple of years ago and wishing I knew then what I know now. The most recent project is my master’s thesis, I wish I tried out some things I didn’t have time for, but I hope I’ll manage to catch some time to work on it further in the next couple of months.

5. How do you go about scoping a data science project?

Usually when I’m faced with a new dataset, I get very excited about it and can’t wait to dig into it, which gets in the way of all the planning that should have been done beforehand. I hope I’ll manage to become more patient about it with time and learn to do it the “right” way.

One of the things that I find a bit limiting about the industry is that you often have to decide whether something is worth the effort of trying it out, since there are always certain deadlines you need to hold on to. Therefore, it is very important to have a clear final goal right from the beginning. However, one needs to be flexible and take into account that things at the end user’s side might change along the way and be prepared to adapt to the user’s needs accordingly.

6. What do you wish you knew earlier about being a data scientist?

That you don’t spend all of your time doing the fun stuff! A lot of the work done by the data scientists is invested into getting the data, making it into the right format, cleaning it up, battling different encoding issues, writing tests for the code you wrote, etc. When you sum everything up, you spend only a part of your time doing the actual “data science magic”.

7. What is the most exciting thing you’ve been working on lately?

We are a small team of data scientists at Wise who are working on many interesting projects. I am mostly involved with the natural language processing tasks, since that is the field I’m planning to do my PhD in starting this fall. My most recent project is on expanding the customer service support to multilingual datasets, which can be quite challenging considering the highly skewed language distribution (80% English, 20% all other languages) in the majority of datasets we are dealing with.

8. How do you manage learning the ‘soft’ skills and the ‘hard’ skills? Any tips?

Learning the hard skills requires a lot of time, patience, and persistence, and I highly doubt there is a golden formula for it. You just have to read a lot of books and papers, talk to people that are smarter and/or have more experience than you and be patient, because it will all pay off.

Soft skills, on the other hand, somehow come naturally to me. I’m quite an open person and I’ve never had problems talking to people. However, if you do have problems with it, I suggest you to take a deep breath, try to relax, focus and tell yourself that the people you are dealing with are just humans like you, with their good and bad days, their strengths and imperfections. I believe that picturing things this way takes a lot of pressure off your chest and gives you the opportunity to think much more clearly.

# Interview with a Data Scientist Tool Developer

Standard
I interviewed one of the core members of the Pandas Python Library Masaaki Horikoshi (sinhrks). I was really happy to interview him, and glad to show that Data-science and software development are really global things 🙂 I lightly edited his answers at his request because English is not his native language.
My Biography:
I work as a data analyst in a Japanese company. I mostly use Python and R in the work.
Because I don’t expose project details of my job publicly, allow me to answer
as a tool developer. I contribute to some open source software such as pandas (Python package for data analysis) in private, see https://github.com/sinhrks

1. What project have you worked on do you wish you could go back to, and do better?
I’ve learned a lot from the projects I’ve worked on, therefore I expect I can do better in most of them today. It’s because the most difficult part of the project is to clarify what the problem actually is, and I already know what the it was on the previous ones at least some extent:)

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

I don’t have PhD, so my point may be basic. Even though the requirements are depending on what you’re working for.

I think it is a good learning experience to read source codes of popular OSS related to statistics / machine learning. I sometimes find myself not understanding a subject only by reading a textbook. Reading source codes and confirming each step sometimes reveal my misunderstandings. Also it can improve your programming skills because the software are mostly written in optimized and sophisticated ways.

3. What do you wish you knew earlier about being a data scientist/ data tool developer?
That communities are really important. It was only after I started attending some programming language conferences, I could meet a lot of skilled people in a broad range of fields, and communicating with them gives me a lot of knowledge in the fields I’m not familiar with. Also, feedback from tool users helps me to understand the needs and raises my motivation.

4. How do you respond when you hear the phrase ‘big data’?

I believe most of today’s companies have a lot of data. But it depends on the problem whether we actually need all of them. Using ‘big data’ without any specific objective looks unprofitable.

Technically I’m interested in data processing and visualization of these data and use some tools like Spark.

Popularity of data-science and related programming languages (R and Python). I see many interesting news and blog posts about data-science almost every day, and small conferences hold few times in a month. It is a good opportunity to join the field. And we need more people, there is a lot of work to do!

6. How do you go about framing a software engineering problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

This is what I feel the most difficult question. The important thing is to clarify the target and goal first.
Then we can decide a measurable indicator and consider executable action / implementation. During the discussion with end users, we can get back to the target and goal once agreed and can judge whether it is “good enough”.
7. You’re involved with some open source projects, can you comment how important you feel these are and also what exciting new things you’ve worked on?
OSS is important to fulfill my daily requirements, besides this it is great place where we can learn more and give back to. I appreciate all the users and great contributors who I’ve got to work with!

Regards,

Masaaki Horikoshi (sinhrks)

# What I’ve been working on – late 2015 and early 2016

Standard

I find it useful for morale just to write up what I’ve been working on and what I’ve learned over the last few months.

PyMC3: Bayesian Logistic Regression: Bayesian Logistic Regression and Model Selection – I wrote an example of how to use Deviance Information Criterion for model selection in a Bayesian Logistic Regression. This example includes quite a few plots and visualisations in Seaborn.

Rugby Analytics: A Hierarchical Model of the Six Nations 2015 in PyMC3. This is based on the work I showcased at my talks, I finally got it into the PyMC3 Examples directory.

Comparison of Fibonacci functions – This is a classic interview question but I was interested in putting together an example comparing different data structures in Python. In particular this was a good exercise to make sure I understood lazy evaluation.

Hamiltonian Monte Carlo – I wrote up some notes on the Hamiltonian Monte-Carlo algorithm. This is used a lot in PyMC3 but I hadn’t gone through the theory before. The piece isn’t original but I thought it was worth putting on my blog.

Deep Learning – I wrote a short post based on a days work on getting Deep Learning to work on AWS. My advice is don’t re-invent the wheel and some of the Nvidia drivers are incredibly difficult to install. I was able to finally get GPU speedup and reproduce some examples from Tensorflow.

The Setup – I interviewed myself with my own version of the ‘Setup’ a noted website. This is just me talking about what tools I use both software and hardware. I found it useful to think about how my tools affect my thought processes and creativity so I recommend you do it too 🙂

Hacking InsideAirBnB – I was using AirBnB over the last few months, so I thought it would be good to look for examples of data sources. This isn’t a very complete Machine Learning project but I put it here anyway. I might fix it up and add some more feature extraction, visualisation and PCA/SVD type tools to this.

Image Similarity Database – I haven’t had the chance to work with image data much professionally. So when I came across this from my friend Thomas Hunger I forced myself to reproduce it. I used Zalando image data in this example.

Three Things I wish I learned earlier about Machine Learning – I first got interested in Machine Learning in 2009 when I was interning in Shanghai. I think the only notable work I did back then was using Matlab to do some simple clustering algorithms for customer segmentation. I don’t claim several years professional data-science or Machine Learning experience but I’m not a complete neophyte, and this article is just about what I’ve learned. I republished it on Medium  too, so pick whichever version you prefer.

Dataconomy  – I interviewed Kevin Hillstrom a consultant in Analytics, he discussed the need for accuracy and business acumen, which certainly applies to Data Analytics.

What does Big Data have to do with the Food Industry – I wrote a non-technical  article on the opportunities for Data Science in the Food industry, this was the first time my commentary was featured on IrishTechNews.

There’ll be more stuff from me soon.

# A quick comparison of Fibonacci functions

Standard

A really common Interview question is about Fibonacci functions.

In this short gist I compare two simple ways of doing this. This is mainly to highlight the power of generators. For those of us who want to see a good video on this – I recommend this video by the talented, charming James Powell.

# What is a Hamiltonian Monte-Carlo method?

Standard

(Editor Note: These notes are not my own text they are copied from MacKay and the Astrophysics source below. They’ll be gradually edited over the next few months I provide them because writing them was useful for me)

The Hamiltonian Monte Carlo is a Metropolis method, applicable to continuous state spaces, that makes use of gradient information to reduce random walk behaviour.
These notes are based on a combination of  http://python4mpia.github.io/index.html and David MacKays book on Inference and Information theory.

For many systems whose probability $P(x)$ can be written in the form
$P(x) = \frac{\exp^{-E(x)}}{Z}$
not only $-E(x)$ but also its gradient with respect to $x$
can be readily evaluated. It seems wasteful to use a simple random-walk
which direction one should go in to find states that have higher probability!

Overview
In the Hamiltonian Monte Carlo method, the state space x is augmented by
momentum variables p, and there is an alternation of two types of proposal.
The first proposal randomize the momentum variable, leaving the state x un-changed. The second proposal changes both x and p using simulated Hamiltonian dynamics as defined by the Hamiltonian
$H(\mathbf{x}, \mathbf{p}) = E(\mathbf{x}) + K(\mathbf{p})$,
where $K(\mathbf{p})$ is a ‘kinetic energy’ such as $K(\mathbf{p}) = \mathbf{p^{\dagger}}\mathbf{p} / 2.$ These two proposals are used to create
(asymptotically) samples from the joint density
$P_{H}(\mathbf{x}, \mathbf{p}) = \frac{1}{Z_{H}} \exp{[-H(\mathbf{x}, \mathbf{p})]} = \frac{1}{Z_{H}}\exp{[-E(\mathbf{x})]} \exp{[-K(\mathbf{p})]}$

This density is separable, so the marginal distribution of $\mathbf{x}$ is the
desired distribution $\exp{[-E(\mathbf{x})]}/Z$. So, simply discarding the momentum variables, we obtain a sequence of samples $\lbrace\mathbf{x}^{(t)}\rbrace$ that asymptotically come from $P(\mathbf{x})$

An algorithm for Hamiltonian Monte-Carlo
First, we need to compute the gradient of our objective function.
Hamiltonian Monte-Carlo makes use of the fact, that we can write our likelihood as
$\mathcal{L} = \exp{log \mathcal{L}} = \exp{-E}$
where $E = - log\mathcal{L}$ is the “energy”. The algorithm then uses
Hamiltonian dynamics to modify the way how candidates are proposed:

 The true value we used to generate the data was $\alpha = 2.35$.
The Monte-Carlo estimate is $\hat{\alpha} = 2.3507$.

Mean: 2.35130302322
Sigma: 0.00147375480435
Acceptance rate = 0.645735426457