Applied Data Science and Machine Learning — Why We are Lightyears away from the Singularity

Torsten Volk
4 min readMar 19, 2019

--

I remember my exasperation after a one hour conversation with a data scientist who kept insisting that “neural networks” cannot be trained effectively unless the data is cleaned up, restructured, and annotated very well. “Then, he said, it takes experience to decide on the type of the learning model and to configure its hyper parameters.”

Artificial Intelligence in 2019 — Not so Intelligent after All

While some are busy musing about the singularity (point where AI supersedes human intelligence), I spent the last two weeks training models only to notice that the results were garbage because of my lack of diligence in terms of cleaning up input data. “Stupid stuff” like ‘https’, ‘http’, and full URLs skewing word clouds and charts were only the tip of the very large iceberg, but already showed the amount of time and CPU/RAM cycles one can waste by not continuously testing intermediary results.

Even Automatic Machine Learning Does not Automatically Yield Usable Results

The four exciting parts of automatic machine learning are automatic data cleansing, automatic model selection, automatic hyperparameterization, and running multiple models against each other to be able to pick the one with the lowest error rate. While all of this is fantastic, without understanding your data sources with all of their dependencies, specific column characteristics, and unnecessary data points that prolong the model building process by hours or days, auto machine learning still is not a turnkey set and forget process, but provides guardrails that help experienced users who do not have to be data scientists, achieve a good level of success.

Example: The below data set shows the percentage of projects where data scientists leverage machine learning models. However, the problem with the underlying data set was the definition of “data scientist”, versus “data engineer”, versus “data savvy developer”, versus someone who only assumes to be a data scientist but does not have the qualifications. In short, to arrive at the below chart we needed to go through a number of extra steps to eliminate about 40% of rows from the data set, as they included information from irrelevant respondents. And as it later turned out lowered “machine learning adoption” by almost 20%.

Here’s the Thing: You Can Do it, but How Long Will it Take?

Even training small models with only a few GB of training data often take hours and multiple CPUs or GPUs to train. And if you do not wait for the training results, you may train multiple models in parallel only to realize that they all need retraining because you left in a garbage column or you forgot to change the scale or format of an input parameter. Or, just as bad, you falsely eliminate an input column and lose critical predictive power. The insidious part is that you may actually not noticed this issue until much further down the road when you arrive at contradictory conclusions and may not even remember the earlier culling of that column. This makes it easy to lose an entire week on simple issues that of course an experienced data scientist would have noticed and eliminated immediately.

Example: Translation Needed: The below chart shows the number of support questions for the major container management platforms. But why is Docker showing so low? The docker-ee tag in StackOverflow is definitely the right one, but it turns out that people are not very good at using this tag, due to the somewhat awkward product name: Docker Enterprise Edition. But we can also not simply use the “docker” tag, as most people use this tag for everything that has to do with containers, but in most cases has nothing to do with the commercial Docker Enterprise Edition. This means that we have to figure out how to get this info into our data table so that we can create a more realistic market chart. This involves finding and analyzing rows that are about Docker EE, but did not apply the tag. Needless to say that this is a time consuming procedure that pure play data scientists will struggle with, as they may not know enough about the subject matter to develop rules to automatically tag records as “docker-ee”.

First Extracting, then Cleaning, then Analyzing — the Machine Learning / AI Model Comes Last

Many times I have written about the fact that 90% of relevant data are not accessible to the people who could build kick-ass predictive models with it. But continuously writing about this made me almost forget the actual pain involved in extracting data in a usable format. I will document my process in another post, but only so much now: knowing how to get access to the data source is not the same as being able to extract the relevant parts of this data in a usable format. And then, connecting it to the rest of your data sources in a manner that does not include implicit assumptions that, if wrong, can skew and invalidate the entire project.

This was the tip of the iceberg and I will be sure to continue this series in a regular manner and next time, with some screenshots and painful examples that can throw projects off schedule in a heartbeat.

--

--

Torsten Volk
Torsten Volk

Written by Torsten Volk

Industry analyst for application development and modernization at the Enterprise Strategy Group (by InformaTechTarget).

No responses yet