📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

ML Monitoring

Machine Learning Monitoring, Part 2: Who Should Care, and What We Are Missing

Last updated:

January 9, 2025

Published:

August 28, 2020

contents‍

Start testing your AI systems today

Get demo

This publication is a part of the Machine Learning Monitoring series. Read our first introductory blog on the "Why" behind model monitoring.

Who should care about machine learning monitoring?

The short answer: everyone who cares about the model's impact on business.

Of course, data scientists are on the front line. But once the model leaves the lab, it becomes a part of the company's products or processes. Now, this is not just some technical artifact but an actual service with its users and stakeholders.

The model can present outputs to external customers, such as a recommendation system on an e-commerce site. Or it can be a purely internal tool, such as sales forecasting models for your demand planners. In any case, there is a business owner—a product manager or a line-of-business team—that relies on it to deliver results. And a handful of others concerned, with roles spanning from data engineers to support.

Both data and business teams need to track and interpret model behavior.

Evidently AI. Questions from different roles. Data scientist: why is my model drifting? Data science manager: is it time to retrain? Product manager: what are the model limitations? Compliance: is this model safe to use?

For the data team, this is about efficiency and impact. You want your models to make the right call, and business to adopt them. You also want the maintenance to be hassle-free. With adequate monitoring, you quickly detect, resolve, prevent incidents, and refresh the model as needed. Observability tools help keep the house in order and save you time to build new things.

‍For business and domain experts, it ultimately boils down to trust. When you act on model predictions, you need a reason to believe they are right. You might want to explore specific outcomes or get a general sense of the model's weak spots. You also need clarity on the ongoing model value and peace of mind that risks are under control.

If you operate in healthcare, insurance, or finance, this supervision gets formal. Compliance will scrutinize the models for bias and vulnerabilities. And since models are dynamic, it is not a one-and-done sort of test. You have to continuously run checks on the live data to see how each model keeps up.

‍We need a complete view of the production model.‍

Proper monitoring can provide this and serve each party the right metrics and visualizations.

Evidently AI. Machine learning model monitoring. Different visualizations. Data scientist: drifting features. Product manager: underperforming segments. Model user: key decision factors. Business: savings over time.

Let's face it. Enterprise adoption can be a struggle. And it often only starts after model deployment. There are reasons for that.

In an ideal world, you can translate all your business objectives into an optimization problem and reach the model accuracy that makes human intervention obsolete.

In practice, you often get a hybrid system and a bunch of other criteria to deal with. These are stability, ethics, fairness, explainability, user experience, or performance on edge cases. You can't simply blend them all in your error minimization goal. They need ongoing oversight.‍

A useful model is a model used.‍

Fantastic sandbox accuracy makes no difference if the production system never makes it.

Beyond "quick win" pilot projects, one has to make the value real. For that, you need transparency, stakeholder engagement, and the right collaboration tools.‍

The visibility pays back.‍

This shared context improves adoption. It also helps when things go off track.

Suppose a model returns a "weird" response. These are the domain experts who help you define if you can or can't dismiss it. Or, your model fails on a specific population. Together you can brainstorm new features to address this.

Want to dig into the emerging data drift? Adjust the classifier decision threshold? Figure out how to compensate for model flaws by tweaking product features?‍

All this requires collaboration.‍

Such engagement is only possible when the whole team has access to relevant insights. A model should not be an obscure black-box system. Instead, you treat it as a machine learning product that one can audit and supervise in action.

When done right, model monitoring is more than just technical bug-tracking. It serves the needs of many teams and helps them collaborate on model support and risk mitigation.

The monitoring gap

In reality, there is a painful mismatch. Research shows that companies monitor only one-third of their models. As for the rest? We seem to be in the dark.‍

This is how the story often unfolds.‍

At first, a data scientist baby-sits the model. Immediately after deployment, one often needs to collect the feedback and iterate on details, which keeps you occupied. Then, the model is deemed fully operational, and its creator leaves for a new project. The monitoring duty is left hanging in the air.

Some teams would routinely revisit the models for a basic health check and miss anything that happens in between. Others only discover issues from their users and then rush to put out a fire.‍

The solutions are custom and partial.‍

For the most important models, you might find a dedicated home-grown dashboard. Often they become a Frankenstein of custom checks based on each consecutive failure the team encounters. To paint a full picture, each model monitor would also have a custom interface while business KPIs live in separate siloed reports.

If someone on a business team asks for a deeper model insight, this would mean custom scripts and time-consuming analytical work. Or often, the request is simply written off.

It is hard to imagine critical software that relies on spot-checking and manual review. But these disjointed, piecemeal solutions are surprisingly common in the modern data science world.‍

Why is it so?‍

One reason is the lack of clear responsibility for the deployed models. In a traditional enterprise setting, you have a DevOps team that takes care of any new software. With machine learning, this is a grey zone.

Sure, IT can watch over service health. But when the input data changes—whose turf is it? Some aspects concern data engineering, while others are closer to operations or product teams.‍

Everybody's business is nobody's business.‍

The data science team usually takes up the monitoring burden. But they juggle way too many things already and rarely have incentives to put maintenance first.

In the end, we often drop the ball.

To do list of a data scientist. Design A/B test. Clean up data. Define problem. Try new tool. Set up SHAP demo - 20 days late. Check on my models - 30 days late. — *A day in the life of an enterprise data scientist.*

Keep an eye on AI

We should urgently address this gap with production-focused tools and practices.

As applications grow in number, holistic model monitoring becomes critical. You can hand-hold one model, but not a dozen.

It is also vital to keep the team accountable. We deploy machine learning to deliver business value—we need a way to show it clearly in production! As well as bring awareness to the costs of downtime and the importance of the support and the improvement work.‍

Of course, the data science process is chaotic all over.

We poorly log experiments. We mismanage deployments. Machine learning operations (aka MLOps) is a rising practice to tackle this mess step by step. And monitoring is, sort of, at the very end. Yet we'd argue that we should solve it early. Ideally, as soon as your first model gets shipped.

When a senior leader asks you how the AI project is doing, you don't want to take a day to respond. Neither be the last to know of the model failure.

Seamless production, visible gains, and happy users are key to make a reputation for machine learning to scale. Unless in pure research, that is where we aim.

Summing up

Monitoring might be boring but is essential to success.
Do it well, and do it sooner.

Next, we take a deep dive into diverse monitoring needs, metrics, and strategies. We start with the data quality and integrity.

[fs-toc-omit]Get started with AI observability

Try our open-source library with over 25 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.

Sign up free ⟶

Or try open source ⟶