When high accuracy is not very accurate: advice for ai diligence

In a world where everyone pitches their “AI” start-ups and tries to prove their expertise in AI, the "accuracy" metric is often mentioned. But how accurate is this metric?

There are a few ways to (inadvertently) game the system, resulting in misleading high accuracy about how well an ML/AI algorithm is doing.

Preliminaries: How do we measure accuracy in algorithms?

Suppose we have two possible outcomes we’re trying to predict: someone having an early onset of a condition – and not.

Accuracy Metric

Accuracy is the percentage of outcomes the algorithm got correct (Wikipedia). 

For example, suppose we had 10 patients who wanted to know if they had an early onset of a condition based on information from their smartwatches – without clinical input.

Incorrectly Pitched as the Accuracy Metric: Area Under the ROC curve

Sometimes, though, people will (incorrectly) refer to the "Area Under the ROC curve" (AKA ROC-AUC) as a measure of accuracy. That may be because they would like to talk about this metric colloquially – or because it tends to look better when there is a majority class. But it’s not the same thing! What is this thing?!

Going back to our example of predicting the presence/absence of a condition. The algorithm typically returns the probability (between 0 and 1) of a patient having this condition, and we assumed that any probability greater than 50% means that the patient has the condition. But what if we only predicted the condition if that probability was greater than 90% if we want to be super sure? Or, instead, we made this cutoff at 25% if we want people to get a second opinion and start getting treatment ASAP? This probability threshold will change who we predict will or won’t have the condition – and the percentage of outcomes the algorithm got correct, e.g., the algorithm’s accuracy! 

If we choose, say, 11 different probability cutoffs between 0% and 100% (e.g., 0%, 10%, 20%, 30%, …, 90%, and 100%) and, for each cutoff, evaluate our set of patients to make predictions of who will and won’t have the condition based on these thresholds, and plot 11 points with the following x-y coordinates:

(For more examples and visualizations of this metric, please see this blog post by Evidently AI.)

Ways to Have Misleadingly High Accuracy

Now that the definitions are out of the way, let’s talk about ways that an algorithm may have high accuracy when it’s actually not doing well – and why that is.

Scenario 1: Presenting ROC-AUC as "accuracy." 

Even if you see a 90%+ accuracy, whether the pitch is about the correct definition of accuracy or ROC-AUC, the following things may be happening under the hood to falsely inflate this metric.

Scenario 2: High accuracy when mispredicting the rare outcomes!

Scenario 3: High accuracy due to data leakage!

Scenario 4: High accuracy due to seeing the data before!

Scenario 5: High accuracy due to overfitting!

Scenario 6: High accuracy due to multicollinearity!

How High is “Too High”?

Depending on the industry, the nature of the product, and the (imbalance) between (two) outcomes, consider further investigating any accuracy/ROC-AUC metrics that are 80% and above! 

Consider your risk tolerance and lower this rule-of-thumb threshold to 75%, given how often pitch decks include this (high) metric and the potential for algorithm risks behind the scenes! 

Then, when any accuracy is too high – over 75%! – it becomes a potential yellow/red flag to dig into during diligence.

Questions to Ask in Diligence 

Here are some questions to ask during diligence when high accuracy is a potential yellow/orange flag to help you better understand if that’s the case.

Good luck! 

You May Also Like

References