9 Error analysis

This chapter covers

Learning curve analysis
Residual analysis
Finding commonalities in residuals

Once we’ve assembled the initial building blocks, which include gathering the first dataset, picking the metrics, defining the evaluation procedure, and training the baseline, we are ready to start the iterative adjustments process. Just as backpropagation in neural networks calculates the direction of the fastest loss reduction and passes it backward from layer to layer, error analysis finds the fastest improvement for the whole system.

Error analysis steps up as the compass that guides the iterative updates of your system. It helps you understand the error dynamics during the training phase (learning curve analysis) and the distribution of errors after the prediction phase (residual analysis). By analyzing these errors, you can identify commonalities, trends, and patterns that inform improvements to your machine learning (ML) system. In this chapter, we will examine its crucial stages and types and provide examples that we hope will give you a better understanding of the subject.

Error analysis is often skipped when ML systems are designed for a reason that seems somewhat legit at first glance—this step is not part of building the system per se. However, time spent on error analysis is always a good investment as it reveals weak spots and suggests ways of improving the system. Leaving this step out of this book would be a huge mistake from our side.

9.1 Learning curve analysis

Learning curve analysis evaluates the learning process by plotting and analyzing the learning curve, showing the relationship between the model’s training performance and the amount of training data used. Learning curve analysis is designed to answer two vital questions:

Does the model converge?
If so, have we avoided underfitting or overfitting issues?

If both questions lead to negative answers, there is no need for the rest of the analysis.

Before we get into details, what is a learning curve? The term was coined in behavioral psychology, where it is used to display the learning progress of a person or animal observed over time (see figure 9.1). For instance, we may analyze the number of mistakes made by a subject in every new test iteration or study how much time it takes for a mouse to find a path through the labyrinth compared to the trial number.

Figure 9.1 Basic representation of a learning curve

In ML, a learning curve is essentially a graphical representation that shows the dependency of a chosen metric on a specific numerical property, such as the number of iterations, dataset size, or model complexity. Let’s do a brief breakdown of all three properties:

Number of iterations—This kind of curve depicts an evolution of the loss or metric during training and helps examine the model’s convergence. In some sources, it is referred to as a loss curve or a convergence curve. A good example of iterations is the number of training epochs in neural networks.
Model complexity—This type of learning curve shows how performance varies based on changes in the complexity of your model. As its complexity increases, the model tends to fit the training data better but may start to generalize poorly on the unseen data. Examples of parameters for model complexity are the tree depth, the number of features, and the number of layers in a neural network.
Dataset size—This learning curve reveals how the number of samples in the training dataset affects the model’s performance. It is helpful in determining whether the model would benefit from more data.

These properties reveal the three most common types of learning curves. Before diving into each, we should recall “the quest for the grail of machine learning,” the overfitting and underfitting problem, sometimes referred to as the bias–variance tradeoff (see figure 9.2).

Figure 9.2 With an increasing number of model parameters, training error tends to become lower and lower while minimizing bias. At the same time, the model variance increases, providing us with a U-shaped validation error.

9.1.1 Overfitting and underfitting

Overfitting happens when a model conveys a great performance on training data and a bad performance on unseen data. Usually, this happens when it learns the training data so well that it becomes too specialized and fails to generalize, focusing too much on insignificant details and patterns that are not present in the new data.

On the other hand, underfitting occurs when the model is too simple and misses some important relationships between features and the target variable, resulting in poor performance on both the training data and new data.

Both are strongly related to the bias–variance tradeoff, which is the balance between the model’s complexity and the amount of input data. The greater the model’s capacity to capture useful signals from data, the lower the bias and the higher the risk of overfitting. On the other hand, reducing a variance requires decreasing complexity, thus leading to a more biased model.

Bias is an error caused by the low capacity of the model to capture useful signals in the data. In other words, the model is biased toward its simplified assumptions about the data. When the model is biased, we call it underfitting.

Variance is an error caused by the model’s high sensitivity to small fluctuations in the training set. The model is poorly generalized on new data, which, in terms of the model’s parameters, is highly varied from what it has seen in the training set. Often, high variance is a primary reason behind overfitting (assuming nothing is broken in other parts of the system).

A good learning algorithm is expected to minimize both bias and variance simultaneously. The bias–variance tradeoff, however, vividly demonstrates that reducing variance often involves increasing bias and vice versa. The quest here is to find the right balance between the two. This is when learning-curve analysis can guide us.

Keep in mind that the model’s redundant complexity (i.e., high variance) is not the only reason behind overfitting. Some other possible scenarios include

Data leakage (using information that is not supposed to be known during inference)
Noisy or highly granular features that force the model to capture irrelevant patterns
Existence of outliers that have a large impact on the loss function
Overall poor ability of the model to extrapolate
Training sets and validation sets simply belonging to different distributions

Regardless of a given case, learning curves are an effective tool to detect underfitting and overfitting. Armed with the knowledge about overfitting and underfitting, we are ready to go through different types of curves and the hints they give us in this quest.

9.1.2 Loss curve

A loss curve (also referred to as a convergence curve or learning curve) based on learning iterations is the first thing that comes to mind for ML engineers when they hear the term “learning curve.” It shows how much the algorithm is improving as it puts more and more learning effort into the task.

The curve plots the loss (or metric) on the vertical axis and the number of training iterations (or epochs) on the horizontal axis. As the model trains, the loss should decrease, ideally forming a downward slope toward the bottom of the curve.

In contrast to learning curves where axis X is the dataset size or the model complexity (which we will talk about soon), the iteration-wise curve requires only one training run, which makes its tracking practical even for large datasets when a single training run takes hours or even days. The loss curve helps keep your finger on the pulse all the way until the end of training. If you’ve run just 10 training epochs out of 200, you can already get insights on whether the loss value is progressing as expected or if there are issues that make further training pointless.

Be sure you track loss curves, collect them for all conducted experiments, and make them available for future analysis. Incorporating loss-curve monitoring into the training pipeline in the early stages of its building is a valuable one-time effort because you will need these insights for all future experiments, and it is helpful for the overall pipeline’s reproducibility (we’ll have an in-depth look into this subject in chapter 10).

9.1.3 Interpreting loss curves

There are several main patterns in the behavior of loss curves that deviate from what is expected by design. Let’s conduct a brief analysis of each pattern and see how we can interpret different patterns we may encounter while debugging the system (what conclusions can be made and what steps should be taken to debug detected issues).

Pattern 1

Pattern 1 shows that the loss curve diverges (not converging to the desired loss value and oscillating instead; see figure 9.3). How can we try to make the training process more stable? Consider the following:

Check if features and targets are correlated in any way—or if samples and labels are passed to the model in the right order.
Reduce the learning rate to prevent the model from bouncing around in the parameter space.
Reduce the dataset size to a single batch (or 10–100 samples) and check if the model is able to overfit them.
Start with a simpler model and add complexity incrementally. Each time, check whether it outperforms the constant baseline or rule-based baseline.

Figure 9.3 Loss is oscillating, which demonstrates a lack of convergence.

Pattern 2

Pattern 2 shows a loss explosion or trending toward NaN (not a number) (see figure 9.4). This behavior indicates computational problems when either the gradient is exploded (in which case solutions like gradient clipping, lower learning rate, or different weights initialization techniques may help) or some mathematical problems emerge (e.g., division by zero, the logarithm of zero or negative numbers, or NaNs in data—thus, it usually indicates an error in implementation or lack of data preprocessing).

Figure 9.4 A model was converging until something went wrong.

Pattern 3

Pattern 3 shows that the loss decreases, but the metric is contradictory (see figure 9.5). If the model continues improving based on the loss but the metric is stuck, it may signal that the chosen metric is inadequate for the problem or poorly implemented. Typically, it happens in classification and other related tasks where we use metrics that include a certain threshold.

Figure 9.5 Loss decreases while the metric stays constantly low.

Campfire story from Valerii

When I was working at a company providing messaging services, one of the tasks we had was the improvement of existing antispam and antifraud systems. The main challenge we faced was that the model, which had been trained on a given dataset, showed promising metrics during offline testing but didn’t perform as expected after deployment.

After digging into possible causes, we found three main problems:

We didn’t use proper metrics for offline testing. For example, such metrics as precision at defined recall are not that useful for fraud detection as they are class sensitive (precision), while specificity at a defined recall yielded better results (see chapter 5). However, recall is somewhat of Schrödinger’s cat itself, as we never had its full picture for fraud (there were fraud cases we missed, and we didn’t know how many of them); thus, our recall could be calculated only on a subset of known fraud cases.
The second problem was hiding in performance evaluation. We conducted it on a point estimate basis, but reality has a tendency to deviate from point estimates, and given the scale of 100 billion events per day, even a 0.1% deviation would translate to a significant number (100 million) of events being misclassified compared to what we expected.
The third problem lay in the fact that we evaluated the system through a spam/nonspam binary classification, using log loss as the loss function. What we really needed was to keep users happy and reduce spam to the appropriate levels (there is always spam, but sometimes it’s not a big deal, and other times it is a problem), excluding such cases as receiving 100,000 messages from a single number within 1 second.

While the first two problems were challenging but somewhat manageable, the third problem was extremely complex, demanding a proper hierarchy of metrics, a custom loss function, and continuous experimentation.

That is the essence of ML: we train a model using specific loss, we measure its performance using different sets of metrics in the hope of achieving something different from the first and second, and sooner or later, we will face a plateau where the first or even second element improves but the third does not react.

The most important lesson we learned from this case was that while you can adjust your error analysis until it reaches perfection, it will not save you from a fiasco if you incorrectly pick metrics in the first place.

Pattern 4

Pattern 4 shows the converging training curve with unexpected loss values. While the curvature of the training curve appears promising, the observed values are perplexing. To identify such anomalies in advance, it is advisable to do a sanity check by running a simple unit on a single batch that asserts if the loss falls within the expected range. Often, the reason behind such an issue lies in scaling transformations (e.g., normalization of an image or a mask in segmentation).

Pattern 5

Figure 9.6 Loss is decreasing for training but not validation, reflecting potential overfit.

Pattern 5 shows that the training loss decreases while the validation loss increases (see figure 9.6). This is a classic textbook example of overfitting due to high variance. In these cases, you should restrict the model’s capacity, either by reducing its complexity directly or by increasing regularization.

9.1.4 Model-wise learning curve

After we ensure the model converges and the training loss reaches the plateau with no drastic overfitting or underfitting, we can wrap up our learning-curve analysis and move on. This is especially relevant at the stage of initial deployment.

However, if we face overfitting/underfitting issues or there is enough time to experiment with the optimal model size, that is when the second type of learning curve comes into play (see figure 9.7):

First, we pick a hyperparameter that represents a varying model complexity. Again, it may be the tree depth in gradient boosting, a regularization strength, a number of features, or a number of layers in a deep neural network.
We define a grid for this hyperparameter (e.g., 2, 3, 4, …, 16 for the tree depth; 10^-2, 10^-1, 1, 10, 10², 10³ for regularization term).
We train each model until convergence and capture the final loss/metric values.

Figure 9.7 Finding the optimal model complexity based on learning curves

Now we map these values on the vertical axis and corresponding hyperparameter values on the horizontal axis. This learning curve helps us easily see what range of model complexity (determined by this hyperparameter) is optimal for the given data.

9.1.5 Sample-wise learning curve

Finally, let’s vary the dataset size. We discussed this technique in detail in chapter 6, section 6.4. In short, we keep the validation set unchanged and probe different numbers of samples in the training set: 100, 1,000, 10,000, and so on. As in the model-wise learning curve, we train the model until it reaches convergence and plot training and validation learning curves.

If we extrapolate the validation metric, we can estimate how much new data we need to increase the metric by 1% and vice versa. If we expect to gather N more samples of data, we can forecast what metric gain it’ll give.

Besides this extrapolation, the sample-wise learning curve also serves the purpose of revealing overfitting and underfitting. Specifically, what insights do we get by analyzing training and validation curves?

If the curves almost converge (there is a small or no gap between curves at the maximum number of samples), the model generalizes well, and there’s no need to add more samples to the dataset, as it will not increase the model’s performance.
Specifically, if the training and validation curves almost converge but the loss level remains high in both, it reports a high bias problem (underfitting). In this scenario, increasing the dataset size will not help either. What can be fruitful is using a more complicated model.
If there is a large gap between the curves, it signals either a high variance problem or simply a difference between the training set and validation set. In the first case, we should reduce model complexity or gather more data to combat this problem. In the second case, we need to examine the data-splitting procedure and ensure it fairly represents real-world scenarios the model will encounter.

A sample-wise learning curve indicates whether the current bottleneck in the system is the amount of data or not. Understanding the metric dependency on dataset size guides our next steps in improving the system, which could include a combination of gathering more data and investing effort in feature engineering and model hyperparameter tuning.

9.1.6 Double descent

The bias–variance tradeoff runs like clockwork for classical ML models and deep neural networks of moderate size. However, for modern overparametrized deep neural networks, things get more perplexing.

There is a phenomenon called double descent, where the test error first gets better, then worse, and then better again. Surprisingly, researchers found different regimes of double descent corresponding to all three learning curves: epoch-wise, model-wise, and sample-wise.

It is still an open question what the mechanism behind the double descent is, why it occurs, and whether it means that overfitting for large deep neural networks is not an issue. The common hypothesis behind the double descent is the following:

If the model’s capacity (the number of parameters) is lower than the dataset size, it tries to approximate the data, leading to the classical regime where the bias–variance tradeoff takes place. We call this model underparametrized.
At the interpolation threshold, the model has sufficient ability to fit the training data perfectly and reach zero bias. There is effectively only one such model in this parameter space. Forcing this model to fit even slightly noisy labels will destroy its global structure.
However, in an overparametrized regime, there are many models of this kind. Some of them not only interpolate the train set but also perform well on the test set. It turns out that stochastic gradient descent leads to such “good models” for reasons we don’t yet understand.

For further details, we recommend reading “Deep Double Descent” by OpenAI ().

The modern scaling laws of large neural networks redefine our modeling strategies. The double-descent phenomenon may surprise those who are only familiar with the classical bias–variance tradeoff, and it is crucial to consider it when selecting a model for the system (especially when working with large convolutional networks and transformers), as well as when determining training and debugging procedures.

We only mention this here to highlight that most of the heuristics described previously are not solid laws set in stone. As with many things in ML design, they reveal signals—often useful ones—but may not be an exact fit for a particular problem.

9.2 Residual analysis

Certainly, coming up with new ideas is important, but even more important, to understand the results.
— Ilya Sutskever

Once we’ve ensured that a ML model has converged and is not plagued by underfitting or overfitting, the next step in model debugging is to perform residual analysis. This involves studying individual predictions made by the model compared to their corresponding true labels. Residual analysis involves calculating the differences between the predicted and actual values, known as residuals (see figure 9.8):

residual_i = y_pred_i – y_true_i

First, what exactly are residuals? In the narrow sense, residuals are simply the differences between predicted and true values in regression. In a wider sense, residuals can be any sample-wise differences or errors between model predictions and ground truth.

Figure 9.8 Basic case: single regressor x. Each vertical line represents a residual error (e) or, simply, residual. Lines above the regressor have negative signs (the actual value is higher than the predicted value), and lines below the regressor have positive signs.

Thus, to align residuals to a specific loss, you may prefer using a single term of the loss sum as a pseudo-residual instead of raw differences, which are a squared error for mean squared error, a logarithmic error for root mean squared logarithmic error, or a class label multiplied by the predicted probability of this class for LogLoss.

Often, residuals are associated only with regression and classification tasks. But what about residuals outside of regression and classification tasks? Looking more broadly, we will find equivalent tools in almost any ML-related task.

For instance, in the search engine context, true labels are often mappings from a search query to the top N most relevant documents or products. To calculate residuals in this context, we can measure the difference between the rank predicted by the model and the true rank of each item in the top N list (see figure 9.9).

Figure 9.9 Example of residuals for the image segmentation problem (image source: )

In image segmentation, we can compute the differences between predicted and ground truth masks for each image, which yields 2D residuals that highlight which parts of an object are not covered by the mask or are covered incorrectly.

9.2.1 Goals of residual analysis

Residual analysis helps identify patterns in the errors made by the model so that we can detect clear directions for improving the system. Whereas the overall error of a model is usually represented by a single number, such as a loss or metric, the residual analysis does the opposite. It examines the raw differences between predictions and true labels, providing a more fine-grained diagnostic of the model’s performance.

Along with that, there are other main purposes of residual analysis:

Verify model assumptions. First, it challenges our basic assumptions about the model. Do residuals follow a normal distribution? Are the model’s predictions biased or not? If we identify any significant discrepancies, we may need to reevaluate our approach or choose a different model.
Detect sources of metric change. The overall performance may increase or may remain unchanged. Either way, capturing a significant change in the metric distribution is possible. Which data samples show varying residual patterns with different models? In which data subset do we have the greatest number of wrong answers? What samples affect the final score the most?
Ensure fairness of residuals. Residual analysis enables us to evaluate if the model treats every sample fairly and has the same distribution across different cohorts. If we identify any significant skewness or disparities, we can make respective adjustments to ensure that the model is unbiased and treats all samples equally.
Perform worst-case and best-case analysis. Is there a commonality between the top N samples with the biggest residuals? What should we change so that our model performs better in these cases? What about a top N list with the smallest residuals?
Examine corner cases. How does our model perform on users with the shortest or longest history or on shortest/longest audio records, texts, and sessions, depending on the problem we solve? How does it deal with items with the lowest/highest price, zero stocks, or highest revenue? We must be familiar with business cases and the nature of data to evaluate all possible pitfalls.

These questions are the essence of the residual analysis, and finding answers to them closes the feedback loop of offline evaluation. The earlier we start capturing hard samples (and loss curves) for conducted experiments, the better. It is a good practice to collect 10 to 20 objects with the largest residuals after training as attendant artifacts. It is difficult to overestimate the value of thinking through such steps in a training pipeline in a design document.

In the project’s later phases, we transform it into a part of automatic reports for every trained model, along with model drift monitoring and data quality reports. Let’s say the metric increased in most cases but dropped a bit in a crucial segment. Depending on our policy, we may either reject this version or simply pay extra attention to this change in forthcoming iterations.

9.2.2 Model assumptions

Whichever model we train, we have prior assumptions about its predictions, biases, and residual distribution. The assumption check helps us ensure that we picked the right model, collected enough data, and engineered the right features. If the assumptions reveal an unexpected pattern, it may prompt the exploration of alternative solutions.

Again, from the design perspective, we need to figure out in advance what we assume to be true about the model’s predictions or, specifically, residuals and express it via corresponding unit tests. It will prevent unexpected model behavior after the next deployment. In the following chapter, we will dive into a more holistic overview of tests and their role in the training pipeline. Let’s explore two different examples to see how assumptions can be applied.

Example 1. Linear regression assumptions

Suppose we solve the demand forecasting problem using a simple linear regression. What are the key assumptions we make here?

Linearity —The relationship between predictors (x) and target (y) is linear.
Strict exogeneity —Residuals should be zero-centered.
Normality —Residuals are assumed to be normally distributed.
Homoscedasticity —The variance of the residual is the same for any value of X.
Independence —Residual error terms should be independent.

After fitting our model, we check whether these assumptions hold true. Potential problems include

Nonlinearity in X-Y relationships
Bias in residuals
Heteroscedasticity: nonconstant variance of error terms
Presence of data points with extremely high influence: outliers in predicted values (y) or in regressors (x)

To check regression assumptions, we’ll examine the distribution of residuals. For this purpose, we plot residuals in four different ways and build so-called diagnostic plots (see figures 9.10 and 9.11):

Residuals vs. fitted —Utilized to evaluate the assumptions of a linear relationship. A horizontal line that lacks distinct patterns suggests a linear relationship, which is favorable. No difference between the solid and dashed lines means strong linear dependence.
Normal quantile-quantile (Q-Q) plot —Used to examine whether the residuals are normally distributed. We plot quantiles of the standard normal distribution as x-coordinates and quantiles of standardized residuals (residuals after subtracting the mean and dividing by standard deviation) as y-coordinates. If the resulting points are close to the straight line (dotted line on the plot), residuals follow a normal distribution.
Scale-location —Used to assess the homogeneity of variance among the residuals. A horizontal line with evenly dispersed points is a strong sign of homoscedasticity. This is not the case in our example, where we have a heteroscedasticity problem (higher fitted values have higher variance).
Residuals vs. leverage —Used to identify influential cases, meaning extreme values that could affect the regression results when included or excluded from the analysis. Leverage refers to the extent to which the coefficients in the regression model would change if we removed a particular observation from the dataset. There is a commonly used measurement for influential data points called Cook’s Distance.

Figure 9.10 Four diagnostic plots for linear regression’s residual analysis in both cases (Case 1: Assumptions are met)

Figure 9.11 Four diagnostic plots for linear regression’s residual analysis in both cases (Case 2: Assumptions are not met)

Sometimes it is beneficial to force the model to follow our assumptions more strictly by incorporating our prior knowledge directly into the model.

Example 2. Attention plot

Imagine you’re the product owner of a banking application, and your next big update is to add a voice assistant. After looking into your “inner circle” of specialists, you hire Stacy, a world-class master in text-to-speech (TTS) systems. After several weeks of work, Stacy builds a first version of a voice synthesis system.

In the realm of TTS tasks, there exists a fundamental assumption: the order of characters in a text should progress linearly over time in the corresponding audio segments. When we read a text, it’s natural to assume that the text’s position aligns closely with the audio we hear. This stands in contrast to other sequence-to-sequence tasks, like machine translation, where an attention module is necessary to resolve word alignment between languages with different syntax or token ordering, such as English and Chinese.

To assess the validity of this assumption, Stacy employs an attention plot—a visual representation that depicts the activation map between audio frames (x-axis) and characters (y-axis). By observing this plot at regular intervals during training, Stacy aims to evaluate how closely it resembles a nearly diagonal matrix.

To force the attention matrix to exhibit a near-diagonal pattern, Stacy employs a technique known as guided attention. Whenever the attention matrix deviates significantly from the diagonal, it is penalized using an auxiliary loss. This heuristic not only accelerates the training process but also steers the model toward meaningful solutions that align with the underlying assumption from the start (see figure 9.12).

Figure 9.12 Attention plot evolution through training without (left) and with (right) guided attention loss

The deviation of the attention plot from the diagonal matrix is nothing but residuals. Residuals of moderate size are appropriate: some characters people speak more quickly than others; therefore, the plot will not represent a straight line. However, large residuals reveal the specific sounds or character combinations that the model can’t learn well.

Armed with this knowledge, Stacy could shape a forthcoming data-gathering strategy to address the model’s difficulties and improve its performance.

To learn more about the architecture of a typical TTS network, including details on the construction of the attention module in this case, we recommend studying the paper “Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention” ().

9.2.3 Residual distribution

If none of the model’s assumptions are met, it is a sign to adjust the training pipeline, collect more data, engineer new features, or explore alternative models and losses. But how do we determine the necessary improvement steps? The guess-and-check approach may seem tempting, but we don’t recommend it for exploring the solution space.

Let’s, for example, violate the normality assumption for the same linear regression (figures 9.13–9.15).

Figure 9.13 In Case 1, we observe a normal residual distribution when linearity assumptions are met.

Figure 9.14 Case 2 displays nonnormal residual distribution when linearity assumptions are not met due to the log-normal distribution of the target (there is a clear skewness in the distribution).

Figure 9.15 In Case 3, there’s a nonnormal residual distribution for linear regression when linearity assumptions are not met due to the nonmonotonic dependence of the target from the regressors.

In this example, we are lucky and instantly see what’s wrong with the model:

Case 2—It seems we don’t consider the distribution of the target variable. Entities like revenue, sales, and prices follow a log-normal distribution, whereas the mean squared error or mean absolute error that the regression models minimize is not suitable (at least not directly). To overcome this issue, applying logarithm transform to the target often helps.
Case 3—Residuals form multiple clusters. In this case, transforming the target will not be of any use. The target variable dependence on features is not monotonic. What can help in this case is either trying a model that can catch nonmonotonic dependencies or engineering new features that will help the linear model to reduce nonmonotonic dependencies to monotonic and even linear ones.

9.2.4 Fairness of residuals

In ML, fairness is an indicator of inequality among data. How do individual samples contribute to a loss or metric? By what cost do we increase the metric? Does the new model add inequality among residuals or reduce it? Basically, fairness is another term for defining the skewness of the residual distribution.

Not every metric change is meant to be equal. Some improvements are distributed uniformly among all samples, while others add significant growth in one stratum and provoke a decrease in the rest. The concept of “fairness” in residual analysis pushes us toward a more holistic model evaluation procedure, far beyond estimating single-valued metrics.

To get a better understanding of what fairness is, consider figure 9.16.

Figure 9.16 Fair vs. unfair residual distribution

In this simplified example, we have two models. Both decrease the mean absolute error (MAE) by 20%. However, we prefer to deploy the first model because it reduces absolute residuals uniformly among all 10 samples. In contrast, the second model drastically improves metrics on one half and reduces them on the other half. In this case, we add inequality to the residual distribution, so we call this distribution unfair.

One way to assess fairness quantitatively, instead of relying purely on visualizations, is to use the Gini index from economics (see figure 9.17). To compute it, the residuals should be sorted based on their absolute values, and then the cumulative proportion of the absolute values should be divided by the cumulative proportion of the number of residuals.

Figure 9.17 Graphical representation of the Gini coefficient: the graph shows that the Gini coefficient equals the area marked A divided by the sum of the areas marked A and B—that is, Gini = A/(A + B). It is also equal to 2A and to 1 − 2B because A + B = 0.5 (since the axes scale from 0 to 1).

For total fairness (Gini = 0.0), almost all residuals have the same contribution to the total error. For total inequity (Gini = 1.0), a single residual is stealing the covers. Those are two extremes that you will rarely face in real life, while the common values are always somewhere in between.

There are two main reasons to care about fairness. First, we want the model to have high performance among all users, items, or other entities it will be deployed on, not just a fraction of them. This also includes reducing overall inequality in residuals step by step in each iteration of the system.

Second, we want to not only improve the overall error numbers but also reduce each residual. If we increase the search engine quality by 5%, it reduces prediction quality on some segments of users by 20%, which damages the users’ experience. The gains we get by improving the average quality may be easily neutralized by the increase in the churn of these users.

In the long term, we chase close-to-equal growth among all strata. There are exceptions to every rule, and fairness is not that critical for every single project and every metric. We should pay attention to the error cost produced by residual distribution tails. This should define our tradeoff between average and sample-wise improvements.

9.2.5 Underprediction and overprediction

In regression tasks, we often split residuals by sign—positive residuals indicate overprediction (predicted values are greater than true values), while negative residuals indicate underprediction.

Depending on the problem we are set to solve, one or another bias of the model is preferred. For instance, if we are building a demand forecasting system, missed profit is a less desirable outcome than moderate overstocks. On the other hand, if we predict a client’s creditworthiness for a bank, we are better off underestimating it than overestimating it.

Therefore, the cost of an error is often asymmetric, and it should tell us residuals of which sign and size we should pay attention to the most.

9.2.6 Elasticity curves

One of the demand-specific error analysis tools is the elasticity plot. It is not a universal tool applicable for any ML system, but because it is crucial for pricing-related applications, it is worth our attention. Although we discussed most of the curve-shaped ways of analysis earlier, this example belongs here as it can be seen as a special case of residual analysis.

One of the core model assumptions behind demand forecasting is price–demand dependency—the higher the price, the lower the demand, and vice versa. This is true for almost all kinds of products (except some special cases like Veblen and Giffen goods, if you recall Microeconomics 101).

An elasticity plot is a special case for a more generic concept called a partial dependence plot, in which we vary some features and analyze how the predicted outcome changes. It is used for model interpretability (which we will cover in chapter 11).

There are two ways to plot the elasticity curve for our model:

Using training data (with real prices known):
- Taking the sales history of a particular SKU
- Predicting sales for each data point (Y)
- Taking the historical price for each data point (X)
- Plotting X-Y (price → predicted sales) dependency
Using a mixture of synthetic and real data:
- Taking the last price for a particular SKU
- Multiplying this price by different coefficients (–20%, –19%, –18%, …, +19%, +20%)
- Recalculating all price-based features and predicting sales (Y) for each new row
- Plotting X-Y (price → predicted sales) dependency

As we mentioned before, in a perfect case, the plot demonstrates an inverse dependency: the higher the price, the lower the sales. However, if this plot demonstrates the opposite, is noisy (partly or fully nonmonotonic), or has any other controversial patterns, it will signal one of the following:

We don’t have enough price variability for this SKU to capture its elasticity (e.g., a short history of sales).
The sales for this SKU are way too stochastic. For instance, this SKU is often affected by promo campaigns, seasonality, or other external factors).
The model can’t capture it for whatever reason (“a hard case”).

The “better” the plot (more monotonous in the negative direction), the more we can rely on the model’s predictions for these SKUs. If the elasticity is “bad,” it signals that forecasting is not reliable and should be further investigated rather than deployed for these “hard” SKUs (see figures 9.18 and 9.19).

Figure 9.18 Theoretical example of elastic demand

Figure 9.19 Theoretical example of inelastic demand

Plotting elasticity not only for predicted sales but for actual sales as well is helpful. It is also helpful to understand whether this particular SKU reveals distinct elasticity. If it doesn’t, we should not expect it for the predicted demand.

If you feel like you need more context on the price elasticity concept, we recommend the article “Forecasting with Price Elasticity of Demand” ().

A friend of ours recently told his own campfire story about using an elasticity curves application. He had been working on a pricing problem, and their solution was effectively a glorified elasticity plot. They built a gradient-boosting model that predicted the number of sold items using various features, including price-based ones and estimated sales for different possible prices. Their first model revealed a surprising pattern: instead of having a smooth look, the plot had a visible “ladder” of steps. After deeper analysis, they realized that the origin of the steps was related to the features they used; continuous variables were split into buckets with low cardinality (e.g., only 256 buckets for all possible prices across all the items on the marketplace), and it limited the model’s sensitivity. After the number of buckets was increased, the model was able to capture more detailed patterns, and the elasticity curve became smooth, improving the overall system performance.

9.3 Finding commonalities in residuals

Now that we’ve examined residual distribution as a whole, it’s time to investigate the patterns and trends in residual subgroups. To do that, we approach the problem from both ends (see figure 9.20):

We group samples by their residuals and analyze features in each group.
We group samples by their features and analyze residuals in each group.

Figure 9.20 Two approaches to finding commonalities in residuals: picking subsets by residuals rank or by sample attributes

Grouping residuals by their values includes best-case and worst-case analysis; it also covers overprediction and underprediction problems. Grouping residuals by their properties produces group analysis and corner case analysis.

9.3.1 Worst/best-case analysis

The goal of worst/best-case analysis is to define typical cases where the model works fine and where we should avoid making decisions based on this model. What do residuals have in common in these extreme cases?

Campfire story from Valerii

Getting back to a case mentioned in previous chapters, once we deployed a dynamic pricing system in a large marketplace that set a price for an item based on predicted demand. We started to encounter moments when it failed to forecast sales accurately. We decided to focus on the top 200 products by their residual size. Soon we realized the most distinguishable clusters of problems:

The first cluster revolved around new products that had recently been added to the marketplace’s assortment matrix. This issue is known as the cold start problem. These items posed a challenge as the lack of historical data hindered accurate predictions. It became clear that relying solely on our ML model would not suffice in such cases. Instead, we needed to develop heuristics that would use the warming up of sales for products within the same category to establish a solid foundation for forecasting.
Another cluster emerged from the electronic devices category, revealing a different predicament. These products exhibited sparse sales over time, making it difficult to depend on our model’s predictions. Recognizing this, we made a crucial decision to exclude these items from our pilot and explore alternative ways of improving the prediction quality. We contemplated the idea of splitting the model into larger categories, believing it would capture the specific dynamics within each group more effectively.
However, the most significant bias was influenced by marketing activities—big sales, promotional codes, and discounts. The model struggled to account for these factors, resulting in noticeable underpredictions represented by large negative residuals. To rectify this bias, we incorporated a promo calendar into our feature set. By doing so, we could empower the model to make corresponding corrections, thus enhancing the accuracy of predictions.

In addition to identifying the areas where our model fell short, we also investigated residuals that were close to zero to determine the boundaries of our model’s applicability. This analysis helped us understand where we could have confidence in the model’s predictions, where the quality might be satisfactory but within acceptable limits, and where it became risky to rely on the model’s forecasts.

By examining these residual patterns, we gained a comprehensive understanding of the strengths and limitations of our dynamic pricing system, enabling us to make informed decisions about its rollout and ensure its appropriate usage.

9.3.2 Adversarial validation

If the manual worst-case analysis won’t provide new insights, a “machine learning model for analyzing machine learning models” could help. In chapter 7, we discussed a concept called adversarial validation. It was derived from ML competitions and is used to check whether the distribution of the two datasets differs. Often we concatenate, train, and test datasets with labels 0 and 1.

Adversarial validation can be easily transferred to the rails of residual analysis: we set 0 for “good” samples of data and 1 for “bad” samples (e.g., taking the top-N% biggest residuals in the second case). We should try different thresholds for our particular set.

The rest of the algorithm is similar: we fit a simple classifier (e.g., logistic regression) on these labels and calculate the receiver operating characteristic area under the curve (ROC AUC). If two classes are separable (area under the curve is significantly greater than 0.5), then we analyze the model’s weights, and this gives us a hint of which exact features best distinguish our “worst” cases from others.

Sometimes we find no easily defined patterns during worst-case analysis. This is fine. It means that we have already captured the most low-hanging fruits in the model improvement space.

9.3.3 Variety of group analysis

Group analysis enables the identification of distinct patterns and trends within the residuals of various groups, segments, classes, cohorts, or clusters. For instance, in both binary and multiclass classification scenarios, one effective approach involves splitting the residuals by classes (i.e., by target variable), allowing for separate analysis of the residuals in each group.

Many typical applications processing tabular data, such as fraud detection systems, rely on grouping samples based on specific characteristics, such as geography or traffic source. By analyzing the residuals within each segment, it becomes possible to uncover common biases present in the model’s predictions. These insights can then guide further model refinement by incorporating more relevant features or adjusting the weights of existing features.

When dealing with the data in the form of text, images, or videos, there may be no distinct cohorts or groups by default. In such cases, an alternative method involves manually classifying a set of N residuals and assigning labels to each issue encountered (e.g., identifying images that are too dark or blurry or flagging texts with specific wording or style). This process allows for the discovery of specific problematic clusters where the model underperforms. Consequently, it provides guidance on what type of data should be collected next to improve the system.

9.3.4 Corner-case analysis

Corner-case analysis aims to test the model in rare circumstances. Typically, we would like to have a benchmark, a fixed set of already-captured corner cases, to quickly examine the behavior of each new model.

Here are some ideas for what we can check during corner-case analysis:

Forecasting models —Users/items with short or no history, users/items with extremely large numbers of actions/sales, highest and lowest values in feature X, users/items with rare actions/sales
Image segmentation —Bad-quality images, low-resolution images, high-resolution images, occlusions and reflections, abnormal lighting conditions, multiple objects in one image, no objects in the image
Language models —Shortest and longest texts, jokes, offensive topics, simple arithmetic, text with typos, text with N different languages, extensive usage of emojis
Voice recognition —Shortest or longest audio, bad-quality audio, audio with no voice, music instead of audio, samples with pronunciation that are too fast or too slow, loud environment, silent voice, samples with multiple speakers (aka “cocktail party”)

While the best/worst-case analyses ask which data our model digests excellently or badly, the corner-case analysis and cohort analysis ask about the model’s performance on predefined subsets of data.

Campfire story from Arseny

When I was working for the augmented-reality company, one important piece of the system was a key-point detector based on a deep learning model. The task was to identify several key points that were later used to understand the object coordinates. The training pipeline was written with proper diagnostic tools from the very beginning, so we identified at an early stage that certain samples with high loss demonstrated one common pattern—some images contained mirrors or other reflecting surfaces (even a puddle on a rainy day!), and the model could pick not the key point versus its reflection. It meant we needed additional properties from the system: choose the “real” object, remember it between frames, and ignore key points that belong to mirrored objects. Early understanding of this problem helped us tinker with a solution that could mitigate the reflected key points situation.

Model detects the real foot, not the reflection

9.4 Design document: Error analysis

Because we’re convinced that error analysis should be among the essential elements of ML system design, we will include this phase in both our design documents.

9.4.1 Error analysis for Supermegaretail

We start off with Supermegaretail, where we will suggest the approach that will help the company achieve its main goal—to reduce the gap between delivered and sold items, making it as narrow as possible while avoiding out-of-stock situations.

Design document: Supermegaretail

VI. Error analysis

Remember that we have six quantile losses for 1.5th, 25th, 50th, 75th, 95th, and 99th quantiles of the target and corresponding six models for each. The constant baseline estimates these quantiles for each product based on the last N days of its sales. These baselines already have some residual distribution with some specific bias that is helpful to consider.

Comparing more complex models (linear models and gradient boosting) with these dummy baselines will give us an understanding of whether we are moving in the right direction in modeling and feature engineering.

i. Learning-curve analysis

i. Convergence analysis

A step-wise learning curve based on the number of iterations comes into play only when we start experimenting with the gradient-boosting algorithm. The key questions we should answer when examining the loss curve are

Does the model converge at all?
Does the model beat baseline metrics (quantile loss, mean absolute percent error, etc.)?
Are issues like underfitting/overfitting presented or not?

Once we ensure the model converges, we can pick a sufficient number of trees on a rough grid (500-1,000-2,000-3,000-5,000) and fixate for future experiments. For simpler baselines, convergence analysis is not needed.

ii. Model complexity

We will use a model-wise learning curve to decide an optimal number of features and overall model complexity.

Let’s say we fixate all hyperparameters except the number of lags we use: the more we take, the more complicated patterns and seasonalities our model can capture—and the easier it will be to overfit training data. Should it be N – 1, N – 2, N – 3 days? or N – 1, N – 2, …, N – 30 days? The optimal number can be determined by the “model size vs. error size” plot.

Similarly, we can optimize window sizes. For instance, windows “7/14/21/…” are more granular than “30/60/90/…” ones. The appropriate level of granularity can be chosen, again, by using a model-wise learning curve.

In the same fashion, we tweak other key hyperparameters of the model during the initial adjustments—for instance, regularization term size.

iii. Dataset size

Do we need to use all the available data to train the model? How many last months are enough and relevant? Do we need to utilize all (day, store, item) data points, or can we downsample 20%/10%/5% of them without noticeable downgrading in metrics?

Here comes the rescue: the sample-wise learning curve analysis that determines how many samples are necessary for the error on the validation set to reach a plateau.

We should make an important design decision of whether to use (day, store, item) as an object of the dataset or move to less granular (week, store, item). The last option reduces the number of required computations by a factor of 7, while model performance can either be left unchanged or even be increased.

This design decision affects not only the demand forecasting service speed and performance but also the overall product (a stock management system), drastically reshaping the landscape of its possible use cases. Therefore, despite the possible advantages, this decision should be agreed upon with our product managers, users (category managers), and stakeholders.

ii. Residual analysis

Remember that we have an asymmetric cost function: overstock is far less harmful than out-of-stock problems. We have either expired goods or missed profit. The uncovered demand problem is a much worse scenario, and in the long run, it is expressed in customers’ dissatisfaction and an increased risk that they will pass to competitors.

i. Residual distribution

The mentioned peculiarity of the demand should guide us throughout the residual analysis of our forecasting model: positive residuals (overprediction) are more preferred than negative ones (underpredictions). However, too much overprediction is bad as well.

Therefore, we plot the distribution of the residuals along with their bias (a simple average among raw residuals). We expect this to be true in one of the following possible scenarios:

A small positive bias reveals slight overprediction, which is the desirable outcome. If, in addition, residuals are not widely spread (low variance), we get a perfect scenario.
Equally spread residuals in both negative and positive directions would be okay but is less preferred than the previous scenario. We should force the model to produce more optimistic forecasts to ensure we minimize missed profit.
The worst scenario is when we have a skew in favor of negative residuals. It means our model tends to increase customers’ dissatisfaction. This would definitely be a red flag for the current model version deployment.
If we have a skew but favor positive residuals, this is unambiguously a good case for Supermegaretail as well and, hence, is less preferred than the first case.

These scenarios are applicable when we try to estimate unbiased demand (we use median prediction for that). But as mentioned, we also have several other models for other quantiles (1.5%, 25%, 75%, 95%, 99%).

For each of them, we analyze the basic assumption behind each model—for example:

Is it true that 95% of residuals are positive for a model that predicts a 95% quantile?
Is it true that 75% of residuals are negative for a model that predicts a 25% quantile?

ii. Elasticity

We should validate the elasticity assumptions using elasticity curves. There is no solid understanding of whether all the goods are expected to demonstrate elasticity, and this needs to be confirmed with stakeholders.

If we face problems related to elasticity, we have two options to improve the elasticity capturing:

Postprocessing (fast, simple, ad hoc solution)—We can apply an additional model (e.g., isotonic regression) for prediction postprocessing to calibrate forecasts.
Improving the model (slow, hard, generic solution)—This requires additional modeling, feature engineering, data preprocessing, etc. There is no predefined set of actions that will solve the problem for sure.

iii. Best-case vs. worst-case vs. corner-case

Each time we roll out a new version of the model, we automatically report its performance in best/worst/corner cases and save the top-N% cases as artifacts of the training pipeline. The following is a draft of a checklist of questions for which we should find answers in this report:

What’s the model’s prediction error when the sales history of an item is short? Are the residuals mostly positive or mostly negative?
What about items with a high price or with a low price?
How does prediction error depend on weekends/holidays/promotion days?
What are the commonalities among the items with almost zero residuals? Is a long sales history necessarily required for them? How long should the sales history be to get acceptable performance? Does the model require other conditions that can help us to distinguish those cases where we are certain about the quality of the forecast?
What are the commonalities among the items with the largest negative residuals? We would 100% prefer to exclude these cases or whole categories from A/B-testing groups or pilots. We should also focus on these items when we start to improve the model.
What do the items with the largest positive residuals have in common?

9.4.2 Error analysis for PhotoStock Inc.

Now we get back to PhotoStock Inc., which requires a modern search tool able to find the most relevant shots based on customers’ text queries while providing excellent performance and displaying the most relevant images in stock.

Design document: PhotoStock Inc.

VI. Error analysis

To enable early diagnostics of potential problems, we should include tools for error analysis from the very beginning. In this section of the document, we want to plan in advance some parts we want to focus on.

i. Learning curve analysis

Loss curves should be enabled for sanity checks and further tuning of vital hyperparameters like early stopping threshold, learning rate, and many others.
Given our loss is composite (contains multiple components; see the previous Metrics and Losses section), we need to be able to see the loss curves per component to adapt its weights.
It should be possible to train the model on subsamples of data to draw sample-size learning curves later and estimate how new data improves the overall performance.
In parallel with loss curves, there should be metric curves to ensure they’re fairly correlated.
Given the dataset may be shared, we need to be able to see the curves per shard as well.

ii. Residual analysis

For each training epoch, we should report the most interesting samples, such as samples with the highest/lowest loss overall and per component.
For each displayed sample, metadata should be available, so we report not only the search query and relevant images but also category, tags, query geo, query lang, and other attributes that may arise later.

After training each candidate model (the model that is considered to be good enough to be used for the real system), we suggest the following procedure:

Sample 100 results with high loss/metric.
For each one, suggest a short hypothesis about how this sample is outstanding (e.g., the suggested image is blurry, the image description is overoptimized, the query is too short, etc.) and group the results by these resolutions. Further analysis should be performed every time new steps of system improvement are planned, as it is a significant source of the signal.

In the future, we can consider applying interpretability techniques here as well because, at some point, questions like “Why image X is semantically close to image Y” will arise. However, from the current perspective, it can be postponed.

Summary

Don’t hesitate to apply error analysis while designing your ML system, as it will help you reveal its weak spots and suggest ways of improving it.
Learning curve analysis is a vital first step for defining the efficiency of your model. If the model does not converge and there are overfitting and/or underfitting issues, there is no need for the rest of the analysis.
The presence of either overfitting or underfitting is an indicator of a possible imbalance between the model’s complexity and the amount of input data.
Depending on the type of loss curve you’re observing during error analysis, there is a certain list of actions you need to take to debug detected issues.
Designed to calculate the differences between the predicted and actual values, residual analysis is essential for verifying model assumptions, detecting sources of metric changes, ensuring the fairness of residuals, performing worst-case and best-case analysis, and examining corner cases.