I recently caught up with Alejandro who co-organizes the Luxembourg Data Science Meetup. We’re friends and we regularly talk about Data Science. Alejandro is returning to Colombia soon when he obtains his PhD.
We recently spoke at the same event in Berlin. I recommend his talk since I think he targeted bridging the ‘academic to industry’ divide, which a lot of us struggle with.
1. What project have you worked on do you wish you could go back to, and do better?
At the beginning of my PhD I spent about 12 months preprocessing credit card transactional data without any guidance. Even that I learned a lot, most of that time was me just trying different technologies to extract features (octave, R, SQL, Python) without having real insights from the data until the very end. Currently, If that kind of problem arise again, I will know that there exist interesting communities (PyData, RUsers, Stackoverflow, among others) that can quite easily help you with very good starting points.
2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
That one is easy. GET SOFT SKILLS!!!
Its quite often that I found myself having highly technical discussions that are unrelated with the actual business realities. this leads to focus on issues that may not be the most important for the customer/company. Also, extremely good hard skills (coding, statistics, software engineering) can only bring you so far, in most cases, whenever you’re working as a consultant or in a company, you’re going to be in a position in which you ‘re selling a data product to someone that don’t have any understanding of data science. That’s where the soft skills kick in. You must be able to clearly understand the customer needs, his background and expectations. Most of the time, you’re customer will be happy with the results of a logistic regression, therefore, all the time you spent tuning a SVM could have been utilize in other things.
3. What do you wish you knew earlier about being a data scientist?
To rely more on open-source software/platforms
4. How do you respond when you hear the phrase ‘big data’?
I hate that name. It has became a buzzword with no meaning whatsoever.
As was noted recently by @mrocklin, 90% of the databases are in the gigabyte territory, 9% in the terabyte and only 1% in the petabyte. So unless you’re in that last 1% you dont really have to worry about using “big data” tools. Moreover, I think most of the struggle with larger datasets, can be solved by using better the traditional tools like SQL. I recently read this quote by @dbasch “Many companies think they have a “big data” problem when they really have a big “data problem.” I don’t have more to say… 🙂
5. What is the most exciting thing about your field?
I would say the most interesting thing is to start seeing real commitment from industry leaders to actually get on board with data science.
6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
In general I try to have a first good prototype asap, typically using a standard model such as a logistic regression or random forest in case of a classification problem. This helps to have a baseline. Afterwards it really depends on the particular problem. most often than you think, the result of a logistic regression is more than adequate for any given problem. I try to avoid spending to much time in dealing with feature selection, as its easy to loose a lot of time there.
7. Do you feel ‘Data Science’ is a thing – or do you feel it is just some Engineering functions rebranded? Do you think we could do more of the hypothesis driven scientific enquiry?
I think both. It really dependents on the context. I have seen a lot of people using the re-branding to sell more, but other than that they keep business as usual.
8. You worked as an Analytics professional in Colombia, could you comment on the difference between Data Scientist and Analytics Professional.
In my experience Analytics consists in making the analysis/modeling/predictions, and data science complement that by given more tools for data extraction and finally implementation of the different models. I think for doing analytics you can rely on statistical and data mining skills, whereas in data science you must complement that with skills from software engineering.
Bio: Alejandro Correa Bahnsen is currently working towards a Ph.D in Machine Learning at Luxembourg University. His research area relates to cost-sensitive classification and its application in a variety of real-world problems such as fraud detection, credit risk, direct marketing and churn modeling. Also, he works part time a fraud data scientist at CETREL a SIX Company applying his research for detecting fraud. Before starting his PhD, he worked for five years as a data scientist at GE Money and Scotiabank, applying data mining models in a variety of areas from advertisement to financial risk management. He have written and published many academic and industrial papers in the best per-review publications. Moreover, Alejandro also have experience as instructor of econometrics, financial risk management and machine learning.