Recently I decided to do some quick Data Analysis of my interviews with data scientists.
It seems natural when you collect a lot of data to explore it and do some data analysis on it.
You can access the code here.
The code isn’t in much depth but it is a simple example of how to use NLTK, and a few other libraries in Python to do some quick data analysis of ‘unstructured’ data.
What does a word cloud of the data look like?
Here we can see above that science, PHD, science, big etc all pop up a lot – which is not surprising given the subject matter.
Then I leveraged NLTK to do some word frequency analysis. Firstly I removed stop words, and punctuation.
I got the following result – unsurprisingly the most common word was data followed by science, however the other words are of interest – since they indicate what professional data scientists talk about in regards their work.
Source: All interviews published on Dataconomy by me until the end of last week – which was the end of September 2015.