Dear Advisor: Why is my ML in Production doing WORSE than locally?

Post was originally published in January 2021, updated for relevancy in February 2021 and December 2023

First, congratulations on getting your model to production. That was no easy task!

Remember that typically models in production will perform (at least) slightly worse than offline models, even if you used the same metrics to evaluate them, because the model is seeing data is hasn't seen before. But what are some reasons for why the difference is big?

As an aside: If the model performance was almost perfect even in production (without leakage), you may not need ML -- a rules-based approach may be just as good (and easier to maintain).

Part I: Possible Reasons

If you're seeing a large difference in model performance between what you saw when you ran it locally to what's now happening in production, here are some possible reasons:

Overfitting on the local version
- Assessment: How did it perform locally on a brand new out-of-sample data that it’s never seen before?
Bug in data processing/ETL pipeline/etc.
Schema changed
Data sample/split used for model training is missing seasonality in production process
Metrics on live model are different than what was optimized for locally
- Assessment: Did the offline model optimize for the business use case?
- Assessment: Did you evaluate on the same granularity, e.g. averages or individuals locally? In production?
Model drift, for instance because customer behavior's changed or biased data
Intervened to stop poor performance when model went live
(Most likely cause) Leakage on local version

Part II: Diving Deeper into Leakage

What is leakage? Data leakage happens when training data has information you’re trying to predict [ref].

e.g. (Most common) Leaking information from future into past [ref]
- Assessment: Are you seeing 99.9% accuracy in your offline model?
e.g. “Leaking data from the test set into the training set” [ref]

Examples of Leakage:

Epic's AI algorithm to predict sepsis (~2017) suffered from data leakage because it used a physician's order for antibiotics, which were prescribed by the physician to try to treat sepsis, as one of the indicators of sepsis.
An ML model that predicts the existence of a condition from MRI images; this model will also seem to do well because to get MRI images, a patient needs a referral to a specialist and then be seen by a radiologist who will capture any “weird-looking” images. This way, a naive model that predicts everyone has the specific condition – will do very well (!).
Creating leaky variables by accident
ICML Whale challenge: variables file size, timestamp, and clip order were leaky
Healthcare:
- KDD Breast Cancer Identification: ID was leaky
- Invalidation of Dozens of Medical Studies: leaked data from test set into training
- Predicting outcomes from interventions that happened after admission into hospital
- Detecting early Alzheimer's and using Clinical Dementia Rating as one of the features, as it's a test given to people suspected of early onset
Kaggle PetFinder: leakage used in cheating
Leakage in Embedding Models and in DL Models
Predicting likes from views* of content
Including completion date when forecasting construction project going over budget
Including date of first payment into loan repayment prediction model
Predicting churn and including: time since last active, or number of times contacted by customer support as feature

Part III: Recommended Next Steps

Depending on the model performance reason, you may want to consider turning off the live model, and explore locally -- on an old and also fresher data extract -- the potential reasons (above) for seeing such different model performance live vs offline.

Do you need an expert to help you figure out what happened? Please reach out.

Keywords: AI/ML in production, data products, customer understanding