This article is a part of the Classification Metrics Guide.
A confusion matrix is easily the most popular method of visualizing the quality of classification models. You can also derive several other relevant metrics from it.
We will show how to build a confusion matrix using the open-source Evidently Python library.
Let’s dive in!
Want to keep tabs on your classification models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.
Want to keep tabs on your classification models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.
A confusion matrix is a table that summarizes the performance of a classification model by comparing its predicted labels to the true labels. It displays the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) of the model's predictions.
Here's an example of a confusion matrix:
Let's break it down step by step!
First, a quick reminder about the problem behind the matrix.
A classification model is a machine learning model that assigns predefined categories or classes (labels) to new input data. If you have only two classes, the classification is binary. If you have more, it is a multi-class problem.
Here are some examples of binary classification problems:
In all these examples, there are two distinct classes that the classification model needs to predict:
To create a confusion matrix, you first need to generate the model predictions for the input data and then get the actual labels.
This way, you can judge the correctness of each model prediction. Was a transaction truly fraudulent? Did the user leave the service? Did a customer make the purchase?
Once you know the actual classes, you can count the number of times the model was right or wrong.
To make it more specific, you can also count the different types of errors.
Let’s consider an example of a payment fraud detection model. There are two types of errors the model can make:
The first type of error is called a false positive. The second is called a false negative.
The words “positive” and “negative” refer to the target and non-target classes. In this example, fraud is our target. We refer to transactions flagged as fraudulent as “positives.”
The distinction between false positives and negatives is important because the consequences of errors are different. You might consider one error less or more harmful than the other.
When it comes to correct predictions, you also have two different things to be correct about:
The first type is a true positive. The second type is a true negative.
Understanding the different types of “correctness” is also valuable. You are likely more interested in how well the model can identify fraudulent transactions rather than how often the model is right overall.
All in all, for every model prediction, you get one of 4 possible outcomes:
A confusion matrix helps visualize the frequency of each of them in a single place. This way, you can grasp the number of correct predictions and errors of each type simultaneously.
To create the matrix, you simply need to draw a table. For binary classification, it is a 2x2 table with two rows and columns.
Rows typically show the actual classes, and columns show the predicted classes.
Then, you populate the matrix with the numbers of true and false predictions on a given dataset, calculated as shown above.
Let’s look at a specific example!
To recap, we will go backward: look at an example of a pre-built confusion matrix and explain how to reach each element.
Let’s say we have an email spam classification model. It is a binary classification problem. The two possible classes are “spam” and “not spam.”
After training the model, we generated predictions for 10000 emails in the validation dataset. We already know the actual labels and can evaluate the quality of the model predictions.
Here is how the resulting matrix can look:
True Positive (TP)
True Negative (TN)
False Positive (FP):
False Negative (FN):
Want to see a real example with data and code? Here is a tutorial on the employee churn prediction problem “What is your model hiding?”. You will train two different classification models and explore how to evaluate each model’s quality and compare them.
The confusion matrix shows the absolute number of correct and false predictions. It is convenient when you want to get a sense of scale (“How many emails did we falsely send to a spam folder this week?).”
However, it is not always practical to use absolute numbers. To compare models or track their performance over time, you also need some relative metrics. The good news is you can derive such quality metrics directly from the confusion matrix.
Here are some commonly used metrics to measure the performance of the classification model.
Accuracy is the share of correctly classified objects in the total number of objects. In other words, it shows how often the model is right overall.
You can calculate accuracy by dividing all true predictions by the total number of predictions. Accuracy is a valuable metric with an intuitive explanation.
In our example above, accuracy is (9000+600)/10000 = 0.96. The model was correct in 96% of cases.
However, accuracy can be misleading for imbalanced datasets when one class has significantly more samples. In our example, we have many non-spam emails: 9100 out of 10000 are regular emails. The overall model “correctness” is heavily skewed to reflect how well the model can identify those non-spam emails. The accuracy number is not very informative if you are interested in catching spam.
Precision is the share of true positive predictions in all positive predictions. In other words, it shows how often the model is right when it predicts the target class.
You can calculate precisions by dividing the correctly identified positives by the total number of positive predictions made by the model.
In our example above, accuracy is 600/(600+100)= 0.86. When predicting “spam,” the model was correct in 86% of cases.
Precision is a good metric when the cost of false positives is high. If you prefer to avoid sending good emails to spam folders, you might want to focus primarily on precision.
Recall, or true positive rate (TPR). Recall shows the share of true positive predictions made by the model out of all positive samples in the dataset. In other words, the recall shows how many instances of the target class the model can find.
You can calculate the recall by dividing the number of true positives by the total number of positive cases.
In our example above, recall is 600/(600+300)= 0.67. The model correctly found 67% of spam emails. The other 33% made their way to the inbox unlabeled.
Recall is a helpful metric when the cost of false negatives is high. For example, you can optimize for recall if you do not want to miss any spam (even at the expense of falsely flagging some legitimate emails).
To better understand how to strike a balance between metrics, read a separate chapter about Accuracy, Precision, and Recall and how to set a custom decision threshold. You can also read about other classification metrics, such as F1-Score and ROC AUC.
You can use a confusion matrix in multi-class classification problems, too. In this case, the matrix will have more than two rows and columns. Their number depends on the number of labels the model is tasked to predict.
Otherwise, it follows the same logic. Each row represents the instances in the actual class, and each column represents the instances in a predicted class. Rinse and repeat as many times as you need.
Let’s say you are classifying reviews that users leave on the website into 3 groups: “negative,” “positive,” and “neutral.” Here is an example of a confusion matrix for a problem with 3 classes:
In this confusion matrix, each row represents the actual review label, while each column represents the predicted review label.
What’s convenient, the diagonal cells show correctly classified samples, so you can quickly grasp them together. The off-diagonal cells show model errors.
Here is how you read the matrix:
You can read more about how to calculate Accuracy, Precision, and Recall for multi-class classification in a separate chapter.
A confusion matrix is typically used in post-training model evaluation. You can also use it in the assessment of production model quality.
In this case, you can generate two side-by-side matrices to compare the latest model quality with some reference period: say, past month, past week, or model validation period.
The main limitation of using the confusion matrix in production model evaluation is that you must get the true labels on every model prediction. This might be possible, for example, when subject matter experts (e.g., payment disputes team) review the model predictions after some time. However, often you only get feedback on some of the predictions or receive only partial labels.
Depending on your exact ML product, it might be more convenient to dynamically monitor specific metrics, such as precision. For example, in cases like payment fraud detection, you are more likely to send suspicious transactions for manual review and receive the true label quickly. This way, you can get the data for some of the confusion matrix's components faster than others.
Separately, it might also be useful to monitor the absolute number of positive and negative labels predicted by the model and the distribution drift in the model predictions. Even before you receive the feedback, you can detect a deviation in the model predictions (prediction drift): such as when a model starts to predict “fraud” more often. This might signal an important change in the model environment.
If you want to generate a confusion matrix for your data, you can easily do this with tools like sklearn.
To get a complete classification quality report for your model, you can use Evidently, an open-source Python library that helps evaluate, test, and monitor ML models in production.
You will need to prepare your dataset that includes predicted values for each class and true labels and pass it to the tool. You will instantly get an interactive report that includes a confusion matrix, accuracy, precision, recall metrics, ROC AUC score and other visualizations. You can also integrate these model quality checks into your production pipelines.
Evidently allows calculating various additional Reports and Test Suites for model and data quality. To start, check out Evidently on GitHub and go through the Getting Started Tutorial.
Try our open-source library with over 25 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.
Sign up free ⟶
Or try open source ⟶