Observing the model in production has straightforward goals. We want to detect if something goes wrong. Ideally, in advance.
We also want to diagnose the root cause and get a quick understanding of how to address it. Maybe, the model degrades too fast, and we need to retrain it more often? Perhaps, the error is too high, and we need to adapt the model and rebuild it? Which new patterns are emerging?
In our case, we simply start by checking how well the model performs outside the training data. Our first week becomes what would have otherwise been a holdout dataset.
We continue working with the
sample Jupyter notebook. For demonstration purposes, we generated all predictions for several weeks ahead in a single batch. In reality, we would run the model sequentially as the data comes in.
To choose the period for analysis, we will indicate the rows in the DataFrame.
Let's start by comparing the performance in the first week to what we have seen in training. The first 28 days are our Reference dataset; the next 7 are the Production.