📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

Classification metrics guide

Accuracy, precision, and recall in multi-class classification

Last updated:

January 9, 2025

contents‍

This article is a part of the Classification Quality Metrics guide.

There are different ways to calculate accuracy, precision, and recall for multi-class classification. You can calculate metrics by each class or use macro- or micro-averaging. This chapter explains the difference between the options and how they behave in important corner cases.

We will also show how to calculate accuracy, precision, and recall using the open-source Evidently Python library.

Before getting started, make sure you're familiar with how accuracy, precision, and recall work in binary classification. If you need a refresher, there is a separate chapter in the guide.

TL;DR

You can use accuracy, precision, and recall in multi-class classification.
Accuracy works the same way as in binary classification, but there are different ways to calculate precision and recall.
In a simple case, you can calculate precision and recall by class.
If you have a lot of classes, you can also macro- or micro-average precision and recall. Macro-averaging gives equal weight to each class, while micro-averaging gives equal weight to each instance.
When each data point is assigned a single class, micro-averaged precision and recall are the same and identical to accuracy.

Evidently Classification Performance Report

Start with AI observability

Want to keep tabs on your classification models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.

Start free ⟶Or try open source ⟶

Start with AI observability

Want to keep tabs on your classification models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.

Start free ⟶Or try open source ⟶

Binary vs. multi-class classification

Multi-class classification is a machine learning task that assigns the objects in the input data to one of several predefined categories.

In binary classification, you deal with two possible classes. For example, "spam" or "not spam" emails or "fraudulent" or "non-fraudulent" transactions.

In multi-class, you have multiple categories to choose from. Some example use cases:

Product categorization. You might want to automatically assign each product on an e-commerce website to a specific category, such as "clothing," "electronics," or "home and garden."
User segmentation. You might classify software product users into multiple known categories based on their usage patterns and upsell potential, such as "small business owners," "freelancers," and "enterprise."‍
Classification of support tickets. When users submit support requests, you can automatically tag tickets by the topic, such as "technical issue," "billing inquiry," or "product feedback."

Having multiple classes brings more complexity to evaluating the model quality. This is what this chapter is about!

‍Multi-class vs. multi-label. In this chapter, we focus on multi-class classification. It is different from multi-label. In multi-class classification, there are multiple classes, but each object belongs to a single class. In multi-label classification, each object might belong to multiple categories simultaneously. The evaluation then works differently.

Accuracy in multi-class

Accuracy is a popular performance metric in classification problems. The good news is that you can directly borrow the metric from binary classification and calculate it for multi-class in the same way.

Accuracy measures the proportion of correctly classified cases from the total number of objects in the dataset. To compute the metric, divide the number of correct predictions by the total number of predictions made by the model.

Accuracy formula for multi-class classification

Visual example

Let’s consider that we have a problem with 4 classes. Here is the distribution of the true labels (actual classes) on the validation dataset.

After we trained our model and generated the predictions for the validation dataset, we can evaluate the model quality. Here is the result we received:

Now, this colorful example might be mildly confusing because it shows all the model predictions and the actual labels. However, to calculate accuracy, we only need to know which predictions were correct and which were not. Accuracy is “blind” to specific classes.

We can simplify the illustration:

To calculate accuracy, divide all correct predictions by the total number of predictions.

In our case, the accuracy is 37/45 = 82%.

Pros and cons

Accuracy is straightforward to interpret. Did you make a model that classifies 90 out of 100 samples correctly? The accuracy is 90%! Did it classify 87? 87%!

However, accuracy has its downsides. While it does provide an estimate of the overall model quality, it disregards class balance and the cost of different errors.

Just like with binary classification, in multi-class, some classes might be more prevalent. It might be easier for the model to classify them – at the cost of minority classes. In this case, high accuracy can be confusing.

Say, you are dealing with manufacturing defect prediction. For every new product on a manufacturing line, you assign one of the categories: “no defect,” “minor defect,” “major defect,” or “scrap.” You are most interested in finding defects: the goal is to proactively inspect and take faulty products off the line.

The model might be mostly correct in assigning the “no defect” and “scrap” labels but be unreasonable when predicting actual defects. Meanwhile, the accuracy metric might be high due to the successful performance of the majority classes.

In this case, accuracy might not be a suitable metric.

In our visual example, the model did not do a very good job of predicting Class "B." However, since there were only 5 instances of this class, it did not impact the accuracy dramatically.

Accuracy for multi-class classification by class

To better understand the performance of the classifier, you need to look at other metrics like precision and recall. They can provide more detailed information about the types of errors the classifier makes for each class.

Precision and recall in multi-class

Precision and recall metrics are also not limited to binary classification. You can use them in multi-class classification problems as well.

However, there are different approaches to how to do that, each with its pros and cons.

In the first case, you can calculate the precision and recall for each class individually. This way, you get multiple metrics based on the number of classes you have in the dataset.
In the second case, you can "average" precision and recall across all the classes to get a single number. You can use different methods to average the precision, such as macro- or micro-averaging.

Let’s look at both approaches, starting with calculating the metrics by class.

Precision and recall by class

The most intuitive way is to calculate the precision and recall by class. It follows the same logic as in binary classification.

The only difference is that when computing the recall and precision in binary classification, you focus on a single positive (target) class. With multi-class, there are many classes to predict. To overcome this, you can calculate precision and recall for each class in the dataset individually, each time treating this specific class as "positive." Accordingly, you will treat all other classes as a single "negative" class.

Let's come up with definitions!

Precision for a given class in multi-class classification is the fraction of instances correctly classified as belonging to a specific class out of all instances the model predicted to belong to that class.

In other words, precision measures the model's ability to identify instances of a particular class correctly.

Recall in multi-class classification is the fraction of instances in a class that the model correctly classified out of all instances in that class.

In other words, recall measures the model's ability to identify all instances of a particular class.

Visual example

Let’s stick to the same example. Here is the reminder on how the model predictions look:

Precision and recall for multi-class classification

Say we want to calculate the precision and recall for Class “A.”

To calculate the recall, we divide the number of correct predictions of Class “A” by the total number of Class “A” objects in the dataset (both identified and not).

To calculate the precision, we divide the number of correct predictions of Class “A” by the total number of Class “A” predictions (true and false).

How to calculate precision and recall for multi-class classification

We can see that for Class “A,” the model is not doing badly.

The recall is 13/15=87%. The model correctly identified 13 out of 15 class “A” objects.
The precision is somewhat worse: 13/18=72%. The model predicted that a certain object belongs to class “A” 18 times but was correct only in 13 of them.

Now, let’s look at Class “B.” The results are much worse.

The recall in Class “B” is only 1/5 = 20%. The model correctly identified only 1 example out of 5.
The precision in Class “B” is 1/3=33%. The model predicted Class “B” 3 times but was correct only once.

For other classes, we follow a similar approach. We’ll skip the visuals, but here are the final results for all 4 classes:

Class “A” recall: 13/15=87%. Class “A” precision: 13/18=72%.
Class “B” recall: 1/5=20%. Class “B” precision: 1/3=33%.
Class “C” recall: 9/10=90%. Class “C” precision: 9/10=90%.
Class “D” recall: 14/15=93%. Class “C” precision: 9/9=100%.

Precision and recall calculation by class

Pros and cons

When is calculating precision and recall by class a good idea?

When you have specific classes of interest. Calculating the metrics by category is useful when you want to evaluate the performance of a particular class (or classes) and to know how well the classifier can distinguish this class from the others.
It can also be helpful when you deal with imbalanced classes. Calculating precision and recall for the minority class (or classes) will clearly expose any issues.
When you have a small number of classes. In this case, you can judge each metric individually and grasp them together. It is often the simplest solution.

However, there is a downside:

Calculating precision and recall for each class can result in a large number of performance metrics. The more classes, the more metrics you get. They can be challenging to interpret and grasp simultaneously.

When you have a lot of classes, you might prefer to use macro or micro averages. They provide a more concise summary of the performance.

Averaging precision and recall

The idea is simple: instead of having those many metrics for every class, let’s reduce it to one “average” metric. However, there are differences in how you can implement it. The two popular approaches are macro- and micro-averaging.

Macro-averaging

Here is how you compute macro-averaged precision and recall:

Calculate the number of true positives (TP), false positives (FP), and false negatives (FN) for each class.
Compute precision and recall for each class as TP / (TP + FP) and TP / (TP + FN).
Average the precision and recall across all classes to get the final macro-averaged precision and recall scores.

Here are the formulas to average precision and recall across all classes:

Macro-averaging for precision and recall

where N is the total number of classes, and Precision1, Precision2, ..., PrecisionN and Recall1, Recall2, ..., RecallN are the precision and recall values for each class.

In short, first, you measure the metric by class, then you average it across classes.

Micro-averaging

As an alternative, you can calculate micro-average precision and recall.

In this case, you must first calculate the total number of true positives (TP), false positives (FP), and false negatives (FN) predictions across all classes:

Total True Positive is the sum of true positive counts across all classes;
Total False Positive is the sum of false positive counts across all classes;
Total False Negative is the sum of false negative counts across all classes.

Then, calculate the precision and recall using these total counts.

To calculate the precision, divide the total True Positives by the sum of total True Positives and False Positives.
To calculate the recall, divide the total True Positives by the sum of total True Positives and False Negatives.

The formulas for micro-average precision and recall are:

Micro-averaging for precision and recall

In short, you sum up all TP, FP, and FN predictions across classes and calculate precision and recall jointly.

Now, if you look at the last two formulas closely, you will see that micro-average precision and micro-average recall will arrive at the same number.

The reason is every False Positive for one class is a False Negative for another class. For example, if you misclassify Class “A” as Class “B,” it will be a False Negative for Class “A” (a missed instance) but a False Positive for Class “B” (incorrectly assigned as Class “B”).

Thus, the total number of False Negatives and False Positives in the multi-class dataset will be the same. (It would work differently for multi-label!).

Macro vs. micro-average

There is a principal difference between macro and micro-averaging in how they aggregate performance metrics.

Macro-averaging calculates each class's performance metric (e.g., precision, recall) and then takes the arithmetic mean across all classes. So, the macro-average gives equal weight to each class, regardless of the number of instances.

Micro-averaging, on the other hand, aggregates the counts of true positives, false positives, and false negatives across all classes and then calculates the performance metric based on the total counts. So, the micro-average gives equal weight to each instance, regardless of the class label and the number of cases in the class.

Visual example - Macro

To illustrate this difference, let’s return to our example. We have 45 instances and 4 classes. The number of instances in each class is as follows:

Class “A”: 15 instances
Class “B”: 5 instances
Class “C”: 10 instances
Class “D”: 15 instances

Predicted labels for multi-class classification

We already estimated the recall and precision by class, so it will be easy to compute macro-average precision and recall. We sum them up and divide them by the number of classes.

Using macro-averaging, the average precision and recall across all classes would be:

Macro-average precision = (0.87 + 0.33 + 0.9 + 0.93) / 4 = 0.76
Macro-average recall = (0.72 + 0.2 + 0.9 + 1) / 4 = 0.71

Each class equally contributes to the final quality metric.

Macro-average precision and recall calculation

Visual example - Micro

Now, let’s look at micro-averaging. In this case, you need first to calculate the total counts of true positives, false positives, and false negatives across all classes. Then, you compute precision and recall using the total counts.

We already claimed that precision and recall would be the same in this case. Let’s visually demonstrate it.

Let’s start at the same point and follow the formulas. Here are the model predictions:

We first need to calculate the True Positives across each class. Since we arranged the predictions by the actual class, it is easy to count them visually.

The model correctly classified 13 samples of Class “A,” 1 sample of Class “B,” 9 samples of Class “C,” and 14 samples of Class “D.” The total number of True Positives is 37.

To calculate the recall, we also need to look at the total False Negatives number. To count them visually, we need to look at the “missed instances” that belong to each class but were missed by the model.

Here is how they are split across classes: the model missed 2 instanced of class “A,” 4 instances of class “B,” and 1 instance of class “C” and “D” each. The total number of False Negatives is 8.

Now, what about False Positives? A false positive is an instance of incorrect classification. The model said it was “B,” but was wrong? This is a False Positive for Class “B.”

This is a bit more complex to grasp visually, since we need to look at the color-coded predicted labels, and the errors are spread across classes.

However, what’s important is that we look at the same erroneous predictions as before! Each class’s False Positive is another class’s False Negative.

They are distributed differently: for example, our model often erroneously assigned class “A” but never class “D.” But the total number of False Negatives and False Positives is the same: 8.

As a result, both micro-average precision and recall are the same: 0.82. Here is how to compute it:

Micro-average precision and recall calculation

What’s even more interesting, this number is the same as accuracy. What we just did was divide the number of correct predictions by the total number of the (right and wrong) predictions. This is the accuracy formula!

For multi-class classification, micro-average precision equals micro-average recall and equals accuracy.

To sum up, how did micro- and macro-averaging work out for our examples? The results were different:

Macro-average precision is 76%.
Macro-average recall is 71%.
Micro-average precision and recall are both 82%.

Macro-averaging results in a “worse” outcome since it gives equal weight to each class. 1 out of 4 classes in our example has very low performance. This significantly impacts the score since it constitutes 25% of the final evaluation.

Micro-averaging leads to a “better” metric. It gives equal weight to each instance, and the number of objects in the worse-performing class is low. It only has 5 examples out of 45 total. In this case, their contribution to the overall score was lower.

Pros and cons

A suitable metric depends on the specific problem and the importance of each class or instance.

Macro-averaging treats each class equally.

It can be useful when all classes are equally important, and you want to know how well the classifier performs on average across them.
It is also useful when you have an imbalanced dataset and want to ensure each class equally contributes to the final evaluation.

However, macro averaging can also distort the perception of performance.

For example, it can make the classifier look “worse” due to low performance in an unimportant and small class since it will still contribute equally to the overall score.
In the opposite scenario, it can disguise poor performance in the critical minority class when the overall number of classes is large. In this case, the “contribution” of each class is diluted. The classifier may still achieve high macro-averaged precision and recall by performing well on the majority classes but poorly on the minority class.

If classes have unequal importance, measuring precision and recall by class or weighing them by importance might be helpful.

Micro-averaging can be more appropriate when you want to account for the total number of misclassifications in the dataset. It gives equal weight to each instance and will have a higher score when the overall number of errors is low. (If this sounds like accuracy, it is because it is!)

However, micro-averaging can also overemphasize the performance of the majority class, especially when it dominates the dataset. In this case, micro-averaging can lead to inflated performance scores when the classifier performs well on the majority class but poorly (or very poorly) on the minority classes. If the class is small, you might not notice!

As a result, there is no single best metric. To choose the most suitable one, you need to consider the number of classes, their balance, and their relative importance.

Weighted averaging

Here is extra: in some scenarios, it might be appropriate to use weighted averaging. This approach takes into account the balance of classes. You weigh each class based on its representation in the dataset. Then, you compute precision and recall as a weighted average of the precision and recall in individual classes.

Simply put, it would work like macro-averaging, but instead of dividing precision and recall by the number of classes, you give each class a fair representation based on the proportion it takes in the dataset.

This approach is useful if you have an imbalanced dataset but want to assign larger importance to classes with more examples.

Recap

You have different options when calculating quality metrics in multi-class classification.

Calculating precision and recall by class is useful when you want to evaluate the performance of a classifier for a specific class of interest or when dealing with imbalanced classes, but it can result in a large number of performance metrics.
When you have a large number of classes or want a more concise summary of overall performance, using macro or micro averages can be a better option.
Macro-averaging shows average performance across classes, treating each class as equally important.
Micro-averaging gives equal weight to every instance and shows average performance across all predictions. In the case of multi-class classification, micro-averaged precision, recall, and accuracy are the same.
You might also consider using weighted averaging.
You might prefer one metric over another depending on the class balance and their relative importance.

Accuracy, precision, and recall in Python

To quickly calculate and visualize accuracy, precision, and recall for your machine learning models, you can use Evidently, an open-source Python library to evaluate, test and monitor ML models in production.

You will need to prepare your dataset that includes predicted values for each class and true labels and pass it to the tool. You will instantly get an interactive report that shows accuracy, precision, recall, ROC curve and other visualizations of the model’s quality. You can also integrate these model quality checks into your production pipelines.

Multi-class classification metrics in Python using the Evidently library

Evidently allows calculating various additional Reports and Test Suites for model and data quality. Check out Evidently on GitHub and go through the Getting Started Tutorial.

[fs-toc-omit]Get started with AI observability

Try our open-source library with over 25 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.

Sign up free ⟶

Or try open source ⟶

Start testing your AI systems today

Get demo

Try open source

Start testing your AI systems today

Book a personalized 1:1 demo with our team or sign up for a free account.

Get demo

No credit card required

Classification metrics guide

Accuracy, precision, and recall in multi-class classification

TL;DR

Binary vs. multi-class classification

Accuracy in multi-class

Visual example

Pros and cons

Precision and recall in multi-class

Precision and recall by class

Visual example

Pros and cons

Averaging precision and recall

Macro-averaging

Micro-averaging

Macro vs. micro-average

Visual example - Macro

Visual example - Micro

Pros and cons

Weighted averaging

Recap

Accuracy, precision, and recall in Python

[fs-toc-omit]Get started with AI observability

Read next

Accuracy, Precision, Recall

Classification Threshold

Start testing your AI systems today

Classification metrics guide

Accuracy, precision, and recall in multi-class classification

TL;DR

Binary vs. multi-class classification

Accuracy in multi-class

Visual example

Pros and cons

Precision and recall in multi-class

Precision and recall by class

Visual example

Pros and cons

Averaging precision and recall

Macro-averaging

Micro-averaging

Macro vs. micro-average

Visual example - Macro

Visual example - Micro

Pros and cons

Weighted averaging

Recap

Accuracy, precision, and recall in Python

[fs-toc-omit]Get started with AI observability

Evidently AI Team

Read next

Accuracy, Precision, Recall

Classification Threshold

Start testing your AI systems today