Dear Advisor: Why is my ML in Production doing WORSE than locally?
Post was originally published in January 2021, updated for relevancy in February 2021 and December 2023
First, congratulations on getting your model to production. That was no easy task!
Remember that typically models in production will perform (at least) slightly worse than offline models, even if you used the same metrics to evaluate them, because the model is seeing data is hasn't seen before. But what are some reasons for why the difference is big?
As an aside: If the model performance was almost perfect even in production (without leakage), you may not need ML -- a rules-based approach may be just as good (and easier to maintain).
Part I: Possible Reasons
If you're seeing a large difference in model performance between what you saw when you ran it locally to what's now happening in production, here are some possible reasons:
Overfitting on the local version
Assessment: How did it perform locally on a brand new out-of-sample data that it’s never seen before?
Bug in data processing/ETL pipeline/etc.
Schema changed
Data sample/split used for model training is missing seasonality in production process
Metrics on live model are different than what was optimized for locally
Assessment: Did the offline model optimize for the business use case?
Assessment: Did you evaluate on the same granularity, e.g. averages or individuals locally? In production?
Model drift, for instance because customer behavior's changed or biased data
Intervened to stop poor performance when model went live
(Most likely cause) Leakage on local version
Part II: Diving Deeper into Leakage
What is leakage? Data leakage happens when training data has information you’re trying to predict [ref].
e.g. (Most common) Leaking information from future into past [ref]
Assessment: Are you seeing 99.9% accuracy in your offline model?
e.g. “Leaking data from the test set into the training set” [ref]
Examples of Leakage:
Epic's AI algorithm to predict sepsis (~2017) suffered from data leakage because it used a physician's order for antibiotics, which were prescribed by the physician to try to treat sepsis, as one of the indicators of sepsis.
An ML model that predicts the existence of a condition from MRI images; this model will also seem to do well because to get MRI images, a patient needs a referral to a specialist and then be seen by a radiologist who will capture any “weird-looking” images. This way, a naive model that predicts everyone has the specific condition – will do very well (!).
Creating leaky variables by accident
ICML Whale challenge: variables file size, timestamp, and clip order were leaky
Healthcare:
KDD Breast Cancer Identification: ID was leaky
Invalidation of Dozens of Medical Studies: leaked data from test set into training
Predicting outcomes from interventions that happened after admission into hospital
Detecting early Alzheimer's and using Clinical Dementia Rating as one of the features, as it's a test given to people suspected of early onset
Kaggle PetFinder: leakage used in cheating
Predicting likes from views* of content
Including completion date when forecasting construction project going over budget
Including date of first payment into loan repayment prediction model
Predicting churn and including: time since last active, or number of times contacted by customer support as feature
Part III: Recommended Next Steps
Depending on the model performance reason, you may want to consider turning off the live model, and explore locally -- on an old and also fresher data extract -- the potential reasons (above) for seeing such different model performance live vs offline.
Do you need an expert to help you figure out what happened? Please reach out.
Keywords: AI/ML in production, data products, customer understanding
You may also like:
More tips on leakage detection, by Abhay Pawar
Why Machine Learning Models Degrade In Production, by Alexandre Gonfalonieri
What’s your ML test score? A rubric for ML production systems, by Google (2016)
Introducing the 5 Pillars of Data Observability, by Barr Moses
Past Performance is Not Indicative of Future Results, by Cory Doctorow
Manipulating SGD with Data Ordering Attacks, by Ilia Shumailov, Zakhar Shumaylov, Dmitry Kazhdan, Yiren Zhao, Nicolas Papernot, Murat A. Erdogdu, Ross Anderson