What frustrates Data Scientists in Machine Learning projects?

Ganes Kesari
7 min readMay 14, 2018

--

There is an explosion of interest in data science today. One just needs to insert the tag-line ‘Powered-by-AI’, and anything sells.

But, that's where the problems begin.

Data science sales pitches often promise the moon. Then, clients raise the expectations a notch up and launch their moonshot projects. Ultimately, it’s left to the data scientists to take clients to the moon, or leave them marooned.

An earlier article, ‘4 Ways to fail a Data scientist Job interview’ looked at the key blunders candidates commit while pursuing a career in data science. Now, we wade into the fantasy world of expectations from data science projects and find out the top misconceptions held by clients.

Here we’ll talk about the 8 most common myths I’ve seen in machine learning projects, and why they annoy data scientists. If you’re just getting into data science, or are already mainstream, these are potential grenades that might be hurled at you. Hence, it would be handy knowing how to handle them.

“All models are wrong, but some are useful.” — George Box

Photo by Andre Hunter on Unsplash

Myth 1. “We want an AI model.. build one to solve THIS problem”

A large majority of the industry problems in analytics can be solved with simple exploratory data analysis. If machine learning can be an overkill for these, lets not even get started on why AI is futile here. Why use the cannon to kill a fly?

Yes, advanced analytics is cool. Every business likes to talk about being the first in their industry to deploy the latest technology. And which vendor doesn’t want to flaunt an AI project? But, one needs to educate clients and call out the use cases that really warrant the heavy artillery from ML armory. For all other needs, convince clients by showing business value with exploratory data analysis, statistics, or other such time-tested techniques.

“By far, the greatest danger of Artificial Intelligence is that people conclude too early that they understand it.” — Eliezer Yudkowsky

Myth 2. “Take this data.. and come back with transformational insights”

Often, clients think that their responsibility ends with handing over the data. Some even stop with the problem definition, but we’ll see that in point #4! They ask for analysts to take the data and come back with a deck of ground-shattering business insights, that will change the organization overnight.

Unfortunately, unlike creative writing, one can’t think up actionable business recommendations in isolation. It calls for continuous iteration and productive dialogues with business users on what is pertinent and actionable for them. Plan for quality time with business folks periodically, throughout the project.

“If you do not know how to ask the right question, you discover nothing. — W. Edward Deming”

Myth 3. “Build a model, and save time by skipping unnecessary analysis”

Many data scientists overlook the importance of performing data wrangling and exploratory analysis before opening their model toolbox. Hence they miss seeing the risk when clients ask for cutting out ‘unnecessary analysis’ from the critical path, in order to save precious project time.

Data exploration and analysis are mandatory pre-steps to machine learning and all other advanced techniques. Without getting a feel for the data, discovering outliers or spotting underlying patterns, models do nothing but shoot in the dark. Always earmark time for analysis, and onboard clients by sharing interesting findings.

The alchemists in their search for gold discovered many other things of greater value. — Arthur Schopenhauer

Myth 4. “We have last week’s data, can you predict the next 6 months?“

This is a pet peeve of data scientists. Clients cobble together a few rows of data in spreadsheets. They then expect AI to do the magic of crystal ball gazing, deep into the future. At times this gets quite weird when clients confess to not having any data, and then genuinely wonder if machine learning can fill in the gaps.

Data quality and volume are non-negotiable. “Garbage-in-garbage-out” applies equally well to analytics. Statistical techniques come in handy when you have limited data, and they help you extract more when you have less. For instance, impute the missing points, smote to generate data or use simpler models with low volumes. But this calls for toning down the client’s expectations on the model results and project outcomes.

Performance of analytics techniques with data volume: Source Andrew Ng

Myth 5. “Can you finish the modeling project in 2 weeks?”

In any business-critical project, the results are expected as of yesterday, even when the kickoff is planned today. In a rush to crash project timelines, a common casualty is the model engineering phase. With the free availability of model APIs and easy access to GPU computing, clients wonder what slows down the data scientists.

Despite advances in Auto-ML, there is an unmissable manual element in the modeling process. Data scientists must examine statistical results, compare models and check interpretations, often across painful iterations. This cannot be automated away. At least, not yet. It’s best to enlighten clients on the data science lifecycle by sharing examples and illustrating what could get missed out if steps are skipped.

Modelling is part-experimentation and part-art, so milestone-driven project plans may not be too precise.

Myth 6. “Can you replace the Outcome variable and just hit refresh?”

After data scientists crack the problem of modeling business behavior, new client requests often crop-up, as incremental changes. At times, they ask to replace the outcome variable and refresh results quickly by re-running the model. Clients miss realizing that such changes don’t merely move goalposts, but switch the game from soccer to basketball.

While machine learning is highly iterative, the core challenge is to pick the right influencers for a given outcome variable and map their relationship. Clients must be educated upfront on how this works, and the levers that they can play with freely. They must also be cautioned on those parameters that need careful planning upfront, and how all hell will break loose if these get changed beyond defined milestones.

Myth 7. “Can we have a model accuracy of 100%?”

People often get hung up on error rates. Quite like a blind pursuit of test grades, clients want the accuracy to be closest to 100%. This turns worrisome when accuracy becomes the singular focus, trumping all other factors. How useful is it to build a highly accurate model that’s too complex to be made live?

The model that won the million-dollar Netflix prize with the highest accuracy never went live, since its extreme complexity meant heavy engineering costs. Whereas a model with lower accuracy was adopted. Always balance accuracy with simplicity, stability and business interpretability. This calls for decisive trade-offs and judgemental calls by taking the client into confidence.

Model Engineering: Achieving the fine balance and trade-off

Myth 8. “Can the trained model stay smart forever?”

After putting in hard work into model building and testing, clients wonder whether the machine has learned all it ever needs to. A common question is whether it can stay smart and adapt to all future changes in business dynamics?

Unfortunately, machines don’t learn for life. Models need to be constantly and patiently taught. And they need a quick refresher session every few weeks or months, like that struggling student in school. More so, when the context changes. That’s where the analytics industry is at today, though its fast-evolving. So, for now, do budget time and effort for model maintenance and patient updates.

Conclusion

We’ve looked at the 8 key misconceptions in projects, which can also be slotted into six phases of the ML modeling lifecycle, as shown below.

Machine learning project lifecycle

What fuels almost all of the above misconceptions is a lack of awareness and misplaced priorities within a project. After all, every client and business teamwork under stringent timelines, tight budgets, and not-so-perfect data streams. Data scientists should be able to empathize with clients and understand the real reason for these disconnects. This will enable them to educate the stakeholders and provide examples to drive home their point.

Data science teams should adopt a combination of gentle prodding and amicable tradeoffs. They should take decisions that don’t compromise on the final outcomes of projects. Good luck handling such common issues with your next machine learning engagement!

Are there any other misconceptions that aren’t listed here? Leave your comments below to continue the conversation.

If you found this interesting, you might want to check out these other related articles I wrote:

Passionate about data science? Feel free to add me on LinkedIn and subscribe to my Newsletter.

--

--

Co-founder & Chief Decision Scientist @Gramener | TEDx Speaker | Contributor to Forbes, Entrepreneur | gkesari.com