The ROC AUC score is a popular metric to evaluate the performance of binary classifiers. To compute it, you must measure the area under the ROC curve, which shows the classifier's performance at varying decision thresholds.
This chapter covers how to plot the ROC curve, compute the ROC AUC and interpret it. We will also showcase it using the open-source Evidently Python library.
Want to keep tabs on your classification models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.
Want to keep tabs on your classification models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.
The ROC curve stands for the Receiver Operating Characteristic curve. It is a graphical representation of the performance of a binary classifier at different classification thresholds.
The curve plots the possible True Positive rates (TPR) against the False Positive rates (FPR).
Here is how the curve can look:
Each point on the curve represents a specific decision threshold with a corresponding True Positive rate and False Positive rate.
ROC AUC stands for Receiver Operating Characteristic Area Under the Curve.
ROC AUC score is a single number that summarizes the classifier's performance across all possible classification thresholds. To get the score, you must measure the area under the ROC curve.
ROC AUC score shows how well the classifier distinguishes positive and negative classes. It can take values from 0 to 1.
A higher ROC AUC indicates better performance. A perfect model would have an AUC of 1, while a random model would have an AUC of 0.5.
To understand the ROC AUC metric, it helps to understand the ROC curve first.
Let’s explain it step by step! We will cover:
The ROC curve plots the True Positive rate (TPR) against the False Positive rate (FPR) at various classification thresholds. You can derive TPR and FPR from a confusion matrix.
A confusion matrix summarizes all correct and false predictions generated for a specific dataset. Here is an example of a matrix generated for a spam prediction use case:
You can calculate the True Positive and False Positive rates directly from the matrix.
TPR (True Positive rate, also known as recall) shows the share of detected true positives. For example, the share of emails correctly labeled as spam out of all spam emails in the dataset.
To compute the TPR, you must divide the number of True Positives by the total number of objects of the target class – both identified (True Positives) and missed (False Negatives).
In the example confusion matrix above, TPR = 600 / ( 600 + 300) = 0.67. The model successfully detected 67% of all spam emails.
FPR (False Positive rate) shows the share of objects falsely assigned a positive class out of all objects of the negative class. For example, the proportion of legitimate emails falsely labeled as spam.
You can calculate the FPR by dividing the number of False Positives by the total number of objects of the negative class in the dataset.
You can think of the FPR as a "false alarm rate."
In our example, FPR = 100 / (100 + 9000) = 0.01. The model falsely flagged 1% of legitimate emails as spam.
To create the ROC curve, you need to plot the FPR values against TPR values at different decision thresholds.
You might ask, what do "different" TPR and FPR values mean? Did we not just calculate them once and for all?
In fact, we calculated the values for a given confusion matrix at a given decision threshold. But for a probabilistic classification model, these TPR and FPR values are not set in stone.
You can vary the decision threshold that defines how to convert the model predictions into labels. This, in turn, can change the number of errors the model makes.
A probabilistic classification model returns a number from 0 to 1 for each object. For example, for each email, it predicts how likely this email is spam. For a given email, it can be 0.1, 0.55, 0.99, or any other number.
You then have to decide at which probability you convert this prediction to a label. For instance, you can label all emails with a predicted probability of over 0.5 as spam. Or, you can only apply this decision when the score is 0.8 or higher.
This choice is what sets the classification threshold.
To better understand the impact of the decision threshold, explore the Classification Threshold chapter in the guide.
As you change the threshold, you will usually get new combinations of errors of different types (and new confusion matrices)!
When you set the threshold higher, you make the model "more conservative." It assigns the True label when it is "more confident." But as a consequence, you typically lower recall: you detect fewer examples of the target class overall.
When you set the threshold lower, you make the model "less strict." It assigns the True label more often, even when "less confident." Consequently, you increase recall: you will detect more examples of the target class. However, this may also lead to lower precision, as the model may make more False Positive predictions.
TPR and FPR change in the same direction. The higher the recall (TPR), the higher the rate of false positive errors (FPR). The lower the recall, the fewer false alarms the model gives.
In the example above, the recall (TPR) decreases as we set the different decision higher:
- 0.5 threshold: 800/(800+100)=0.89
- 0.8 threshold: 600/(600+300)=0.67
- 0.95 threshold: 200/(200+700)=0.22
The FPR also goes down:
- 0.5 threshold: 500/(500+8600)=0.06
- 0.8 threshold: 100/(100+9000)=0.01
- 0.95 threshold: 10/(10+9090)=0.001
Now, let’s get back to the curve!
The ROC curve illustrates this trade-off between the TPR and FPR we just explored. Unless your model is near-perfect, you have to balance the two. As you try to increase the TPR (i.e., correctly identify more positive cases), the FPR may also increase (i.e., you get more false alarms).
For example, the more spam you want to detect, the more legitimate emails you falsely flag as suspicious.
The ROC curve is a visual representation of this choice. Each point on the curve corresponds to a combination of TPR and FPR values at a specific decision threshold.
To create the curve, you should plot the FPR values as the x-axis and the TPR values as the y-axis.
If we continue with the example above, here is how it can look.
Since our imaginary model does fairly well, most values are "crowded" to the left.
The left side of the curve corresponds to the more "confident" thresholds: a higher threshold leads to lower recall and fewer false positive errors. The extreme point is when both recall and FPR are 0. In this case, there are no correct detections but also no false ones.
The right side of the curve represents the "less strict" scenarios when the threshold is low. Both recall and False Positive rates are higher, ultimately reaching 100%. If you put the threshold at 0, the model will always predict a positive class: both recall, and the FPR will be 1.
When you increase the threshold, you move left on the curve. If you decrease the threshold, you move to the right.
Now, let’s take a look at the perfect scenario.
If our model is correct in all the predictions, all the time, it means that the TPR is always 1.0, and FPR is 0. It finds all the cases and never gives false alarms.
Here is how the ROC curve would look.
Now, let’s look at the worst-case scenario.
Let’s say our model is random. In other words, it cannot distinguish between the two classes, and its predictions are no better than chance.
A genuinely random model will predict the positive and negative classes with equal probability.
The ROC curve, in this case, will look like a diagonal line connecting points (0,0) and (1,1). For a random classifier, the TPR is equal to the FPR because it makes the same number of true and false positive predictions for any threshold value. As the classification threshold changes, the TPR goes up or down in the same proportion as the FPR.
Most real-world models will fall somewhere between the two extremes. The better the model can distinguish between positive and negative classes, the closer the curve is to the top left corner of the graph.
A ROC curve is a two-dimensional reflection of classifier performance across different thresholds. It is convenient to get a single metric to summarize it.
This is what the ROC AUC score does.
A ROC AUC score is a single metric to summarize the performance of a classifier across different thresholds. To compute the score, you must measure the area under the ROC curve.
There are different methods to calculate the ROC AUC score, but a common one is a trapezoidal rule. This involves approximating the area under the ROC curve by dividing it into trapezoids with vertical lines at the FPR values and horizontal lines at the TPR values. Then, you compute the area by summing the areas of the trapezoids.
You can compute ROC AUC in Python using sklearn.
If we return to our extreme "perfect" and "random" example, computing the ROC AUC score is easy. In the perfect scenario, we measure the square area: ROC AUC is 1. In the random scenario, it is precisely half: ROC AUC is 0.5.
The ROC AUC score can range from 0 to 1. A score of 0.5 indicates random guessing, and a score of 1 indicates perfect performance.
A score slightly above 0.5 shows that a model has at least "some" (albeit small) predictive power. This is generally inadequate for any real applications.
As a rule of thumb, a ROC AUC score above 0.8 is considered good, while a score above 0.9 is considered great.
However, the usefulness of the model depends on the specific problem and use case. There is no standard. You should interpret the ROC AUC score in context, together with other classification quality metrics, such as accuracy, precision, or recall.
The intuition behind ROC AUC is that it measures how well a binary classifier can distinguish or separate between the positive and negative classes.
It reflects the probability that the model will correctly rank a randomly chosen positive instance higher than a random negative one.
For example, this is how the model predictions might look, arranged by the predicted output scores.
ROC AUC reflects the likelihood that a random positive (red) instance will be located to the right of a random negative (gray) instance.
It shows how well a model can produce good relative scores and generally assign higher probabilities to positive instances over negative ones.
In the above picture, the classifier is not perfect but "directionally correct." It ranks most negative instances lower than positive ones.
The ideal situation is to have all positive instances ranked higher than all negative instances, resulting in an AUC of 1.0.
It’s worth noting that even a perfect ROC AUC does not mean the predictions are well-calibrated. A well-calibrated classifier produces predicted probabilities that reflect the actual probabilities of the events. Say, if it predicts that an event has a 70% chance of occurring, it should be correct about 70% of the time. ROC AUC is not a calibration measure.
ROC AUC score, instead, shows how well a model can produce relative scores that help discriminate between positive or negative instances.
Let’s sum up the important properties of the metric.
Here are some advantages of the ROC AUC score.
The metric also has a few downsides. As usual, a lot depends on the context!
Want to see an example of using ROC AUC? We prepared a tutorial on the employee churn prediction problem "What is your model hiding?". You will train two classification models with similar ROC AUC and explore how to compare them.
Considering all the above, ROC AUC is useful, but as usual, not a perfect metric.
However, there are limitations:
You can use ROC AUC during production model monitoring as long as you have the true labels to compute it.
However, a high ROC AUC score does not communicate all relevant aspects of the model quality. The score evaluates the degree of separability and does not consider the asymmetric costs of false positives and negatives. It captures, in one number, the quality of the model across all possible thresholds.
In many real-world scenarios, this overall performance is not relevant: you need to consider the costs of error and define a specific threshold to make automated decisions. Therefore, the ROC AUC score should be used with other metrics, such as precision and recall. You might also want to monitor precision and recall for specific important segments in your data (such as users in specific locations, premium users, etc.) to capture differences in performance.
However, having ROC AUC as an additional metric might still be informative. For example, in cases where the shifting balance of classes might negatively impact recall, tracking ROC AUC might communicate whether the model itself remains reasonable.
To quickly calculate and visualize the ROC curve and ROC AUC score, as well as other metrics and plots to evaluate the quality of a classification model, you can use Evidently, an open-source Python library to evaluate, test and monitor ML models in production.
You will need to prepare your dataset that includes predicted values for each class and true labels and pass it to the tool. You will instantly get an interactive report that includes ROC AUC, accuracy, precision, recall, F1-score metrics as well as other visualizations. You can also integrate these model quality checks into your production pipelines.
Try our open-source library with over 20 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.
Sign up free ⟶
Or try open source ⟶