📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

Tutorials

What Is Your Model Hiding? A Tutorial on Evaluating ML Models

Last updated:

April 9, 2025

Published:

March 26, 2021

contents‍

Start testing your AI systems today

Get demo

Imagine you trained a machine learning model. Maybe, a couple of candidates to choose from.

You ran them on the test set and got some quality estimates. Models are not overfitted. Features make sense. Overall, they perform as well as they can, given the limited data at hand.

Now, it is time to decide if any of them is good enough for production use. How to evaluate and compare your models beyond the standard performance checks?

In this tutorial, we will walk through an example of how to assess your model in more detail.

Code example: if you prefer to head straight to the code, open this example Jupyter notebook.

⚠️ Disclaimer:
This example uses the Evidently API as available in version 0.6.7 or lower. Please ensure you are using the correct version when running this example. For updated and new examples, visit our documentation.

[fs-toc-omit]Get started with AI observability

Try our open-source library with over 25 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.

Sign up free ⟶

Or try open source ⟶

Case in point: predicting employee attrition

We will be working with a fictional dataset from a Kaggle competition. The goal is to identify which employees are likely to leave the company soon.

The idea sounds straightforward: with an early warning, you might stop the person from leaving. A valuable expert stays with the company—no need to search for a new hire and wait till they learn the tropes.

Let us try to predict those who are at risk in advance!

Employee attrition: some leave the company

To start, we examine the training data. It was conveniently collected for us. A seasoned data scientist would get suspicious!

Let's take it for granted and skip the hard part of constructing the dataset.

We have data on 1,470 employees.

‍A total of 35 features describe things like:‍

employee background (education, marital status, etc.)
details of the job (department, job level, need for business travel, etc.)
employment history (years with the company, last promotion date, etc.)
compensation (salary, stock opinions, etc.)

and some other characteristics.

There is also a binary label to see who left the company. Exactly what we need!

‍We frame the problem as a probabilistic classification task. The model should estimate the likelihood of belonging to the target "attrition" class for each employee.

How likely a person will leave the company?

When working on the model, we do the usual split into training and test datasets. We use the first to train the model. We hold the rest to check how it performs on the unseen data.

We will not detail the model training process. That is the data science magic we are sure you know!

Let's assume we ran our fair share of experiments. We tried out different models, tuned hyperparameters, made interval assessments in cross-validation.

We ended up with two technically sound models that look equally well.

Next, we checked their performance on the test set. Here is what we got:

A Random Forest model with a ROC AUC score of 0.795
A Gradient Boosting model a ROC AUC score of 0.803

ROC AUC is a standard metric to optimize for in the case of probabilistic classification. If you look for the numerous solutions to this Kaggle use case, that is what the majority do.

Both our models seem fine. Much better than a random split, so we definitely have some signal in the data.

The ROC AUC scores are close. Given that it is just a single-point estimate, we can assume the performance is about the same.

Which of the two should we pick?

Same quality, different qualities

Let's look at the models in more detail.

We will use the Evidently open-source library to compare the models and generate the performance reports.

If you want to follow it step by step, here is a complete Jupyter notebook.

First, we trained the two models and evaluated their performance on the same test dataset.

Next, we prepared the performance logs from both models as two pandas DataFrames. Each includes the input features, predicted class, and true labels.

We specified column mapping to define the location of our target, predicted class, as well as categorical and numerical features.

Then, we call the evidently report and include the classification preset. It shows the performance of the two models in a single dashboard so we can compare them.

classification_performance_report = Report(metrics=[
    ClassificationPreset(),
])

classification_performance_report.run(reference_data=rf_merged_train, current_data=rf_merged_test, column_mapping = column_mapping)

classification_performance_report

‍We treat our simpler Random Forest model as a baseline. For the tool, it becomes the "Reference." The second Gradient Boosting is denoted as the "Current" model under evaluation.

We can quickly see the summary of performance metrics for both models on the test set.

Summary of metrics from the Evidently report.

Real-life is not Kaggle, so we do not always focus on the second digits. Had we looked only at accuracy and ROC AUC, the performance of the two models looks pretty close.

We might even have our reasons to favor a simpler Random Forest model. For example, because it is more interpretable or has a better computational performance.

But the difference in F1-score hints that there might be more to the story. The inner workings of the models vary.

A refresher on problems with imbalanced classes

A savvy machine learner knows the trick. The sizes of our two classes are far-from-equal. In this case, the accuracy metric is of little use. Even though the numbers might look good "on paper."

The target class is often a minor one. We want to predict some rare but important events: fraud, churn, resignations. In our dataset, only 16% of the employees left the company.

If we make a naive model that just classifies all employees as "likely to stay," our accuracy is an all-star 84%!

Constant prediction: model is right 8 out of 10

The ROC AUC does not give us a complete picture. Instead, we have to find the metric that better fits the intended model use.

What does it mean to have a "good" model?

You know the answer: it depends.

It would be great if a model simply pinpoints those about to resign and is always right. Then we could do absolutely anything! An ideal model fits any use case—and does not occur in reality.

Instead, we deal with imperfect models to make them useful for our business processes. Depending on the application, we might pick different criteria for model evaluation.

No single metric is ideal. But models don't exist in a vacuum—we hope you started with the why!

Let us consider different application scenarios and evaluate the model in this context.

Example 1: labeling each employee

In practice, we will likely integrate the model into some existing business processes.

Suppose our model is used to display a label in the interface of an internal HR system. We want to highlight each employee that has a high risk of attrition. When a manager logs into the system, they will see a "high risk" or "low risk" label for each person in the department.

Interface, employees labeled as high or low risk

We want to display the label for all employees. We need our model to be as "right" as it can. But we already know that the accuracy metric hides all the important details. How would we evaluate our models instead?

[fs-toc-omit]Beyond accuracy

Let's go back to the evidently report and analyze the performance of both models in more depth.

We can quickly notice that Confusion Matrices for the two models look differently.

Our first model has only 2 false positives. Sounds great? Indeed, it does not give us too many wrong alerts on potential resignations.

But, on the other side, it correctly identified just 6 resignations. The other 53 were missed.

The second model wrongly labeled 12 employees as high-risk. But, it correctly predicted 27 resignations. It only missed 32.

The plot with Quality Metrics by Class sums this up. Let's look at the "yes" class.

Precision is about the same: when the model predicts resignation, it is right in 69-75% of cases.

But the second model wins in recall! It discovered 45% of people who left the company versus only 10% for the first model.

Which model would you pick?

Most probably, the one with the higher recall in the target "resignation" class would win. It helps us discover more of those likely to leave.

We can tolerate some false positives since it is the manager who interprets the prediction. The data that is already in the HR system also provides additional context.

Even more likely, it would be essential to add explainability to the mix. It could help the user interpret the model prediction and decide when and how to react.

To sum up, we would evaluate our models based on the recall metric. As a non-ML criterion, we would add the usability testing of the feature by the manager. Specifically, to consider explainability as part of the interface.

Example 2: sending proactive alerts

Let's imagine that we expect a specific action on top of the model.

It might still integrate with the same HR system. But now, we will send proactive notifications based on the prediction.

Maybe, an email to the manager that prompts to schedule a meeting with an at-risk employee? Or a specific recommendation of the possible retention steps, such as additional training?

In this case, we might have additional considerations about these false positives.

If we send the emails to managers too often, they are likely to be ignored. Unnecessary intervention might also be seen as a negative outcome.

What should we do?

If we do not have any new valuable features to add, we are left with the models we have. We cannot squeeze more accuracy. But, we can limit the number of predictions we act on.

The goal is to focus only on those employees where the predicted risk is high.

[fs-toc-omit]Precision-recall trade-off

The output of a probabilistic model is a number between 0 and 1. To use the prediction, we need to assign the label on top of these predicted probabilities. The "default" approach for binary classification is to cut at 0.5. If the probability is higher, the label is a "yes."

Instead, we can pick a different threshold. Maybe, 0.6 or even 0.8? By setting it higher, we will limit the number of false positives.

But it comes at the cost of recall: the fewer mistakes we make, the fewer the number of correct predictions too.

This Class Separation plot from the evidently report makes this idea very visual. It shows the individual predicted probabilities alongside the actual labels.

We can see that the first model makes a few very confident predictions. Adjusting the threshold slightly "up" or "down" would not make a big difference in absolute numbers.

‍However, we might appreciate a model's ability to pick a few cases with high conviction. For example, if we consider the cost of false positives to be very high. Making a cut-off at 0.8 would give a precision of 100%. We would make only two predictions, but both would be right.

‍If that is a behavior we like, we can design such a "decisive" model from the very beginning. It will strongly penalize false positives and make fewer predictions in the middle of the probability range. (To be honest, that is exactly what we did for this demo!).

The second model has the predicted probabilities more scattered. Changing the threshold would create different scenarios. We can make a ballpark estimate just by looking at the image. For example, if we set a threshold at 0.8, it would leave us just with a couple of false positives.

‍To be more specific, let's look at the precision-recall table. It aims to help with the choice of threshold in similar situations. It shows different scenarios for top-X predictions.

For example, we can act only on the top-5% predictions for the second model. On the test set, it corresponds to the probability threshold of 66%. All employees with a higher predicted probability are considered likely to leave.

‍In this case, only 18 predictions remain. But 14 of them will be correct! The recall decreased to only 23.7%, but the precision is now 77.8%. We might prefer it to the original 69% precision to minimize the false alarms.

To simplify the concept, we can imagine a line on the Class Separation plot.

In practice, we might make a limit in one of the two ways:

by acting only on top-X predictions, or
by assigning all predictions with a probability more than X to the positive class.

The first option is available for batch models. If we generate predictions for all employees at once, we can sort them and take, say, the top-5%.

If we make individual predictions on request, picking a custom probability threshold makes sense.

If predicted probability is over 0.65 then label as true

Either of the two approaches can work depending on the use case.

We might also decide to visualize labels differently. For example, to label each employee as high, medium, or low risk of attrition. It would require multiple thresholds based on predicted probabilities.

In this case, we would pay additional attention to the quality of model calibration as seen on the Class Separation plot.

To sum up, we would consider the precision-recall trade-off to evaluate our models and pick the application scenario. Instead of displaying a prediction for everyone, we choose a threshold. It helps us focus only on the employees with the highest risk of attrition.

Example 3: apply the model selectively

We might also take a third approach.

When looking at the different plots from the two models, an obvious question comes up. Who are the specific employees behind the dots on the plots? How do the two models differ in predicting resignees from different roles, departments, experience levels?

This sort of analysis might help us decide when to apply the model and when not. If there are apparent segments where the model fails, we can exclude them. Or, in reverse, we can only apply the model where it performs well.

In the interface, we can show something like "not enough information." It might be better than being consistently wrong!

Interface, employees labeled as high or low risk, some as n/a

[fs-toc-omit]Segments of low performance

To get more insight on underperforming segments, let's analyze the Classification Quality table. For each feature, it maps the predicted probabilities alongside feature values.

This way, we can see where the model makes mistakes and if they are dependent on the values of individual features.

Let's take an example.

Here is a Job Level feature, which is a specific attribute of the seniority of the role.

If we are most interested in the employees from Level 1, the first model might be a good choice! It makes a few confident predictions with high probabilities. For example at the 0.6 threshold, it has only one false positive in this group.

‍If we want to predict resignations in Level 3, the second model looks much better.

‍If we want our model to work for all levels, we would probably pick the second model again. On average, it has acceptable performance for Levels 1, 2, and 3.

‍But what is also interesting is how both models perform on Levels 4 and 5. For all predictions made for employees in these groups, the probabilities are visibly lower than 0.5. Both models always assign a "negative" label.

If we look at the distribution of the true labels, we can see that the absolute number of resignations is pretty low in these job levels. Likely it was the same in training, and the model did not pick up any useful patterns for the segment.

If we were to deploy a model in production, we can construct a simple business rule and exclude these segments from applications.

We can also use the results of this analysis to put our model on a "performance improvement plan." Maybe, we can add more data to help the model?

For example, we might have "older" data that we initially excluded from training. We can selectively augment our training dataset for the underperforming segments. In this case, we would add more old data on resignations from employees of Levels 4 and 5.

To sum up, we can identify specific segments where our model fails. We still show the prediction for as many of our employees as possible. But knowing that the model is far from perfect, we apply it only for those parts of the workforce where it performs best.

What does the model know?

This same table can also help us understand the model's behavior in more detail. We can explore the errors, outliers, and get a feeling of what the models learned.

For example, we've already seen that the first model predicts only a few resignations with confidence. The second model "catches" more useful signals from our data. Where does it come from?

If we look through our features, we can get a hint.

For example, the first model successfully predicts resignations only for those relatively new to the company. The second model can detect potential leavers with up to 10 years of experience. We can see it from this plot:

We can see a similar thing with the stock options level.

The first model only successfully predicts those with Level 0. Even though we have quite some resignees, at least at Level 1 as well! The second model catches more of those leaving with higher levels.

But if we look at salary hike (i.e., a recent increase in salary), we will notice no clear segments where either of the models performs better or worse.

There is no specific "skew" beyond the first model's general trait to make fewer confident predictions.

Similar analysis can help choose between models or find ways to improve them.

Like with the example of JobLevel above, we might have ways to augment our dataset. We might add data for other periods or include more features. In the case of imbalanced segments, we can experiment with giving more weight to specific examples. As a last resort, we can add business rules.

We have a winner!

Getting back to our example: the second model is a winner for most scenarios.

But who would swear by it just by looking at ROC AUC?

We had to go beyond singular metrics to evaluate the models in depth.

It applies to many other use cases. There is more to performance than accuracy. And it is not always possible to assign straightforward "cost" to each error type to optimize for it. Treating models like a product, the analysis has to be more nuanced.

It is critical not to lose sight of the use case scenario and tie our criteria to it. Visualizations might help communicate with business stakeholders who do not think in ROC AUC terms.

Small print

A few disclaimers.

This tutorial is less about resignation prediction and more about model analytics!

If you look to solve a similar use case, let us point to at least a few limitations in this toy dataset.

We lack a critical data point: the type of resignation. People can leave voluntarily, get fired, retire, move across the country, and so on. These are all different events, and grouping them together might create ambiguous labeling. It would make sense to focus on a "predictable" type of resignation or solve a multi-class problem instead.

There is not enough context about the work performed. Some other data might indicate churn better: performance reviews, specific projects, promotion planning, etc. This use case calls for careful construction of the training dataset with domain experts.

There is no data about time and resignation dates. We cannot account for the sequence of events and relate to specific periods in the company history.

Last but not least, a use case like this can be highly sensitive.

You might use a similar model to predict the turnover of the front-line personnel. The goal would be to predict the workload of the recruitment department and related hiring needs. Incorrect predictions can lead to some financial risks, but those are easy to factor in.

But if the model is used to support decisions about individual employees, the implications can be more critical. Consider bias in allocating training opportunities, for example. We should evaluate the ethics of the use case and audit our data and model for bias and fairness.

Can I do the same for my model?

If you want to walk through the tutorial example, here is the Jupyter notebook. It includes all the steps to train two models using the employee attrition dataset from Kaggle and generate evidently reports.

If you want to perform a similar diagnostic check for your model, go to GitHub, pip install evidently, and choose a suitable classification or regression dashboard tab. There is more!