Dear Advisor: Why is my ML in Production doing WORSE than locally?
First, congratulations on getting your model to production. That was no easy task!
Remember that typically models in production will perform (at least) slightly worse than offline models, even if you used the same metrics to evaluate them, because the model is seeing data is hasn't seen before. But what are some reasons for why the difference is big?
As an aside: If the model performance was almost perfect even in production (without leakage), you may not need ML -- a rules-based approach may be just as good (and easier to maintain).
Part I: Possible Reasons
If you're seeing a large difference in model performance between what you saw when you ran it locally to what's now happening in production, here are some possible reasons:
Overfitting on the local version
Assessment: How did it perform locally on a brand new out-of-sample data that it’s never seen before?
Bug in data processing/ETL pipeline/etc.
Data sample/split used for model training is missing seasonality in production process
Metrics on live model are different than what was optimized for locally
Model drift, for instance because customer behavior's changed or biased data
Intervened to stop poor performance when model went live
(Most likely cause) Leakage on local version
Part II: Diving Deeper into Leakage
What is leakage? Data leakage happens when training data has information you’re trying to predict [ref].
e.g. (Most common) Leaking information from future into past [ref]
Assessment: Are you seeing 99.9% accuracy in your offline model?
e.g. “Leaking data from the test set into the training set” [ref]
Examples of Leakage:
Creating leaky variables by accident
ICML Whale challenge: variables file size, timestamp, and clip order were leaky
KDD Breast Cancer Identification: ID was leaky
Invalidation of Dozens of Medical Studies: leaked data from test set into training
Predicting outcomes from interventions that happened after admission into hospital
Detecting early Alzheimer's and using Clinical Dementia Rating as one of the features, as it's a test given to people suspected of early onset
Kaggle PetFinder: leakage used in cheating
Predicting likes from views* of content
Including completion date when forecasting construction project going over budget
Including date of first payment into loan repayment prediction model
Predicting churn and including: time since last active, or number of times contacted by customer support as feature
Part III: Recommended Next Steps
Depending on the model performance reason, you may want to consider turning off the live model, and explore locally -- on an old and also fresher data extract -- the potential reasons (above) for seeing such different model performance live vs offline.
Do you need an expert to help you figure out what happened? Please reach out.
Keywords: AI/ML in production, data products, customer understanding
You may also like:
More tips on leakage detection, by Abhay Pawar
What’s your ML test score? A rubric for ML production systems, by Google (2016)
Introducing the 5 Pillars of Data Observability, by Barr Moses
Past Performance is Not Indicative of Future Results, by Cory Doctorow