Книга: Machine Learning System Design
Назад: Part 3 Intermediate steps
Дальше: 10 Training pipelines

9 Error analysis

This chapter covers

  • Learning curve analysis
  • Residual analysis
  • Finding commonalities in residuals

Once we’ve assembled the initial building blocks, which include gathering the first dataset, picking the metrics, defining the evaluation procedure, and training the baseline, we are ready to start the iterative adjustments process. Just as backpropagation in neural networks calculates the direction of the fastest loss reduction and passes it backward from layer to layer, error analysis finds the fastest improvement for the whole system.

Error analysis steps up as the compass that guides the iterative updates of your system. It helps you understand the error dynamics during the training phase (learning curve analysis) and the distribution of errors after the prediction phase (residual analysis). By analyzing these errors, you can identify commonalities, trends, and patterns that inform improvements to your machine learning (ML) system. In this chapter, we will examine its crucial stages and types and provide examples that we hope will give you a better understanding of the subject.

Error analysis is often skipped when ML systems are designed for a reason that seems somewhat legit at first glance—this step is not part of building the system per se. However, time spent on error analysis is always a good investment as it reveals weak spots and suggests ways of improving the system. Leaving this step out of this book would be a huge mistake from our side.

9.1 Learning curve analysis

Learning curve analysis evaluates the learning process by plotting and analyzing the learning curve, showing the relationship between the model’s training performance and the amount of training data used. Learning curve analysis is designed to answer two vital questions:

If both questions lead to negative answers, there is no need for the rest of the analysis.

Before we get into details, what is a learning curve? The term was coined in behavioral psychology, where it is used to display the learning progress of a person or animal observed over time (see figure 9.1). For instance, we may analyze the number of mistakes made by a subject in every new test iteration or study how much time it takes for a mouse to find a path through the labyrinth compared to the trial number.

figure
Figure 9.1 Basic representation of a learning curve

In ML, a learning curve is essentially a graphical representation that shows the dependency of a chosen metric on a specific numerical property, such as the number of iterations, dataset size, or model complexity. Let’s do a brief breakdown of all three properties:

These properties reveal the three most common types of learning curves. Before diving into each, we should recall “the quest for the grail of machine learning,” the overfitting and underfitting problem, sometimes referred to as the bias–variance tradeoff (see figure 9.2).

figure
Figure 9.2 With an increasing number of model parameters, training error tends to become lower and lower while minimizing bias. At the same time, the model variance increases, providing us with a U-shaped validation error.

9.1.1 Overfitting and underfitting

Overfitting happens when a model conveys a great performance on training data and a bad performance on unseen data. Usually, this happens when it learns the training data so well that it becomes too specialized and fails to generalize, focusing too much on insignificant details and patterns that are not present in the new data.

On the other hand, underfitting occurs when the model is too simple and misses some important relationships between features and the target variable, resulting in poor performance on both the training data and new data.

Both are strongly related to the bias–variance tradeoff, which is the balance between the model’s complexity and the amount of input data. The greater the model’s capacity to capture useful signals from data, the lower the bias and the higher the risk of overfitting. On the other hand, reducing a variance requires decreasing complexity, thus leading to a more biased model.

Bias is an error caused by the low capacity of the model to capture useful signals in the data. In other words, the model is biased toward its simplified assumptions about the data. When the model is biased, we call it underfitting.

Variance is an error caused by the model’s high sensitivity to small fluctuations in the training set. The model is poorly generalized on new data, which, in terms of the model’s parameters, is highly varied from what it has seen in the training set. Often, high variance is a primary reason behind overfitting (assuming nothing is broken in other parts of the system).

A good learning algorithm is expected to minimize both bias and variance simultaneously. The bias–variance tradeoff, however, vividly demonstrates that reducing variance often involves increasing bias and vice versa. The quest here is to find the right balance between the two. This is when learning-curve analysis can guide us.

Keep in mind that the model’s redundant complexity (i.e., high variance) is not the only reason behind overfitting. Some other possible scenarios include

Regardless of a given case, learning curves are an effective tool to detect underfitting and overfitting. Armed with the knowledge about overfitting and underfitting, we are ready to go through different types of curves and the hints they give us in this quest.

9.1.2 Loss curve

A loss curve (also referred to as a convergence curve or learning curve) based on learning iterations is the first thing that comes to mind for ML engineers when they hear the term “learning curve.” It shows how much the algorithm is improving as it puts more and more learning effort into the task.

The curve plots the loss (or metric) on the vertical axis and the number of training iterations (or epochs) on the horizontal axis. As the model trains, the loss should decrease, ideally forming a downward slope toward the bottom of the curve.

In contrast to learning curves where axis X is the dataset size or the model complexity (which we will talk about soon), the iteration-wise curve requires only one training run, which makes its tracking practical even for large datasets when a single training run takes hours or even days. The loss curve helps keep your finger on the pulse all the way until the end of training. If you’ve run just 10 training epochs out of 200, you can already get insights on whether the loss value is progressing as expected or if there are issues that make further training pointless.

Be sure you track loss curves, collect them for all conducted experiments, and make them available for future analysis. Incorporating loss-curve monitoring into the training pipeline in the early stages of its building is a valuable one-time effort because you will need these insights for all future experiments, and it is helpful for the overall pipeline’s reproducibility (we’ll have an in-depth look into this subject in chapter 10).

9.1.3 Interpreting loss curves

There are several main patterns in the behavior of loss curves that deviate from what is expected by design. Let’s conduct a brief analysis of each pattern and see how we can interpret different patterns we may encounter while debugging the system (what conclusions can be made and what steps should be taken to debug detected issues).

Pattern 1

Pattern 1 shows that the loss curve diverges (not converging to the desired loss value and oscillating instead; see figure 9.3). How can we try to make the training process more stable? Consider the following:

figure
Figure 9.3 Loss is oscillating, which demonstrates a lack of convergence.

Pattern 2

Pattern 2 shows a loss explosion or trending toward NaN (not a number) (see figure 9.4). This behavior indicates computational problems when either the gradient is exploded (in which case solutions like gradient clipping, lower learning rate, or different weights initialization techniques may help) or some mathematical problems emerge (e.g., division by zero, the logarithm of zero or negative numbers, or NaNs in data—thus, it usually indicates an error in implementation or lack of data preprocessing).

figure
Figure 9.4 A model was converging until something went wrong.

Pattern 3

Pattern 3 shows that the loss decreases, but the metric is contradictory (see figure 9.5). If the model continues improving based on the loss but the metric is stuck, it may signal that the chosen metric is inadequate for the problem or poorly implemented. Typically, it happens in classification and other related tasks where we use metrics that include a certain threshold.

figure
Figure 9.5 Loss decreases while the metric stays constantly low.

Pattern 4

Pattern 4 shows the converging training curve with unexpected loss values. While the curvature of the training curve appears promising, the observed values are perplexing. To identify such anomalies in advance, it is advisable to do a sanity check by running a simple unit on a single batch that asserts if the loss falls within the expected range. Often, the reason behind such an issue lies in scaling transformations (e.g., normalization of an image or a mask in segmentation).

Pattern 5

figure
Figure 9.6 Loss is decreasing for training but not validation, reflecting potential overfit.

Pattern 5 shows that the training loss decreases while the validation loss increases (see figure 9.6). This is a classic textbook example of overfitting due to high variance. In these cases, you should restrict the model’s capacity, either by reducing its complexity directly or by increasing regularization.

9.1.4 Model-wise learning curve

After we ensure the model converges and the training loss reaches the plateau with no drastic overfitting or underfitting, we can wrap up our learning-curve analysis and move on. This is especially relevant at the stage of initial deployment.

However, if we face overfitting/underfitting issues or there is enough time to experiment with the optimal model size, that is when the second type of learning curve comes into play (see figure 9.7):

  1. First, we pick a hyperparameter that represents a varying model complexity. Again, it may be the tree depth in gradient boosting, a regularization strength, a number of features, or a number of layers in a deep neural network.
  2. We define a grid for this hyperparameter (e.g., 2, 3, 4, …, 16 for the tree depth; 10-2, 10-1, 1, 10, 102, 103 for regularization term).
  3. We train each model until convergence and capture the final loss/metric values.
figure
Figure 9.7 Finding the optimal model complexity based on learning curves

Now we map these values on the vertical axis and corresponding hyperparameter values on the horizontal axis. This learning curve helps us easily see what range of model complexity (determined by this hyperparameter) is optimal for the given data.

9.1.5 Sample-wise learning curve

Finally, let’s vary the dataset size. We discussed this technique in detail in chapter 6, section 6.4. In short, we keep the validation set unchanged and probe different numbers of samples in the training set: 100, 1,000, 10,000, and so on. As in the model-wise learning curve, we train the model until it reaches convergence and plot training and validation learning curves.

If we extrapolate the validation metric, we can estimate how much new data we need to increase the metric by 1% and vice versa. If we expect to gather N more samples of data, we can forecast what metric gain it’ll give.

Besides this extrapolation, the sample-wise learning curve also serves the purpose of revealing overfitting and underfitting. Specifically, what insights do we get by analyzing training and validation curves?

A sample-wise learning curve indicates whether the current bottleneck in the system is the amount of data or not. Understanding the metric dependency on dataset size guides our next steps in improving the system, which could include a combination of gathering more data and investing effort in feature engineering and model hyperparameter tuning.

9.1.6 Double descent

The bias–variance tradeoff runs like clockwork for classical ML models and deep neural networks of moderate size. However, for modern overparametrized deep neural networks, things get more perplexing.

There is a phenomenon called double descent, where the test error first gets better, then worse, and then better again. Surprisingly, researchers found different regimes of double descent corresponding to all three learning curves: epoch-wise, model-wise, and sample-wise.

It is still an open question what the mechanism behind the double descent is, why it occurs, and whether it means that overfitting for large deep neural networks is not an issue. The common hypothesis behind the double descent is the following:

For further details, we recommend reading “Deep Double Descent” by OpenAI ().

The modern scaling laws of large neural networks redefine our modeling strategies. The double-descent phenomenon may surprise those who are only familiar with the classical bias–variance tradeoff, and it is crucial to consider it when selecting a model for the system (especially when working with large convolutional networks and transformers), as well as when determining training and debugging procedures.

We only mention this here to highlight that most of the heuristics described previously are not solid laws set in stone. As with many things in ML design, they reveal signals—often useful ones—but may not be an exact fit for a particular problem.

9.2 Residual analysis

Certainly, coming up with new ideas is important, but even more important, to understand the results.
— Ilya Sutskever

Once we’ve ensured that a ML model has converged and is not plagued by underfitting or overfitting, the next step in model debugging is to perform residual analysis. This involves studying individual predictions made by the model compared to their corresponding true labels. Residual analysis involves calculating the differences between the predicted and actual values, known as residuals (see figure 9.8):

residuali = y_predi – y_truei

First, what exactly are residuals? In the narrow sense, residuals are simply the differences between predicted and true values in regression. In a wider sense, residuals can be any sample-wise differences or errors between model predictions and ground truth.

figure
Figure 9.8 Basic case: single regressor x. Each vertical line represents a residual error (e) or, simply, residual. Lines above the regressor have negative signs (the actual value is higher than the predicted value), and lines below the regressor have positive signs.

Thus, to align residuals to a specific loss, you may prefer using a single term of the loss sum as a pseudo-residual instead of raw differences, which are a squared error for mean squared error, a logarithmic error for root mean squared logarithmic error, or a class label multiplied by the predicted probability of this class for LogLoss.

Often, residuals are associated only with regression and classification tasks. But what about residuals outside of regression and classification tasks? Looking more broadly, we will find equivalent tools in almost any ML-related task.

For instance, in the search engine context, true labels are often mappings from a search query to the top N most relevant documents or products. To calculate residuals in this context, we can measure the difference between the rank predicted by the model and the true rank of each item in the top N list (see figure 9.9).

figure
Figure 9.9 Example of residuals for the image segmentation problem (image source: )

In image segmentation, we can compute the differences between predicted and ground truth masks for each image, which yields 2D residuals that highlight which parts of an object are not covered by the mask or are covered incorrectly.

9.2.1 Goals of residual analysis

Residual analysis helps identify patterns in the errors made by the model so that we can detect clear directions for improving the system. Whereas the overall error of a model is usually represented by a single number, such as a loss or metric, the residual analysis does the opposite. It examines the raw differences between predictions and true labels, providing a more fine-grained diagnostic of the model’s performance.

Along with that, there are other main purposes of residual analysis:

These questions are the essence of the residual analysis, and finding answers to them closes the feedback loop of offline evaluation. The earlier we start capturing hard samples (and loss curves) for conducted experiments, the better. It is a good practice to collect 10 to 20 objects with the largest residuals after training as attendant artifacts. It is difficult to overestimate the value of thinking through such steps in a training pipeline in a design document.

In the project’s later phases, we transform it into a part of automatic reports for every trained model, along with model drift monitoring and data quality reports. Let’s say the metric increased in most cases but dropped a bit in a crucial segment. Depending on our policy, we may either reject this version or simply pay extra attention to this change in forthcoming iterations.

9.2.2 Model assumptions

Whichever model we train, we have prior assumptions about its predictions, biases, and residual distribution. The assumption check helps us ensure that we picked the right model, collected enough data, and engineered the right features. If the assumptions reveal an unexpected pattern, it may prompt the exploration of alternative solutions.

Again, from the design perspective, we need to figure out in advance what we assume to be true about the model’s predictions or, specifically, residuals and express it via corresponding unit tests. It will prevent unexpected model behavior after the next deployment. In the following chapter, we will dive into a more holistic overview of tests and their role in the training pipeline. Let’s explore two different examples to see how assumptions can be applied.

Example 1. Linear regression assumptions

Suppose we solve the demand forecasting problem using a simple linear regression. What are the key assumptions we make here?

After fitting our model, we check whether these assumptions hold true. Potential problems include

To check regression assumptions, we’ll examine the distribution of residuals. For this purpose, we plot residuals in four different ways and build so-called diagnostic plots (see figures 9.10 and 9.11):

figure
Figure 9.10 Four diagnostic plots for linear regression’s residual analysis in both cases (Case 1: Assumptions are met)
figure
Figure 9.11 Four diagnostic plots for linear regression’s residual analysis in both cases (Case 2: Assumptions are not met)

Sometimes it is beneficial to force the model to follow our assumptions more strictly by incorporating our prior knowledge directly into the model.

Example 2. Attention plot

Imagine you’re the product owner of a banking application, and your next big update is to add a voice assistant. After looking into your “inner circle” of specialists, you hire Stacy, a world-class master in text-to-speech (TTS) systems. After several weeks of work, Stacy builds a first version of a voice synthesis system.

In the realm of TTS tasks, there exists a fundamental assumption: the order of characters in a text should progress linearly over time in the corresponding audio segments. When we read a text, it’s natural to assume that the text’s position aligns closely with the audio we hear. This stands in contrast to other sequence-to-sequence tasks, like machine translation, where an attention module is necessary to resolve word alignment between languages with different syntax or token ordering, such as English and Chinese.

To assess the validity of this assumption, Stacy employs an attention plot—a visual representation that depicts the activation map between audio frames (x-axis) and characters (y-axis). By observing this plot at regular intervals during training, Stacy aims to evaluate how closely it resembles a nearly diagonal matrix.

To force the attention matrix to exhibit a near-diagonal pattern, Stacy employs a technique known as guided attention. Whenever the attention matrix deviates significantly from the diagonal, it is penalized using an auxiliary loss. This heuristic not only accelerates the training process but also steers the model toward meaningful solutions that align with the underlying assumption from the start (see figure 9.12).

figure
Figure 9.12 Attention plot evolution through training without (left) and with (right) guided attention loss

The deviation of the attention plot from the diagonal matrix is nothing but residuals. Residuals of moderate size are appropriate: some characters people speak more quickly than others; therefore, the plot will not represent a straight line. However, large residuals reveal the specific sounds or character combinations that the model can’t learn well.

Armed with this knowledge, Stacy could shape a forthcoming data-gathering strategy to address the model’s difficulties and improve its performance.

To learn more about the architecture of a typical TTS network, including details on the construction of the attention module in this case, we recommend studying the paper “Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention” ().

9.2.3 Residual distribution

If none of the model’s assumptions are met, it is a sign to adjust the training pipeline, collect more data, engineer new features, or explore alternative models and losses. But how do we determine the necessary improvement steps? The guess-and-check approach may seem tempting, but we don’t recommend it for exploring the solution space.

Let’s, for example, violate the normality assumption for the same linear regression (figures 9.13–9.15).

figure
Figure 9.13 In Case 1, we observe a normal residual distribution when linearity assumptions are met.
figure
Figure 9.14 Case 2 displays nonnormal residual distribution when linearity assumptions are not met due to the log-normal distribution of the target (there is a clear skewness in the distribution).
figure
Figure 9.15 In Case 3, there’s a nonnormal residual distribution for linear regression when linearity assumptions are not met due to the nonmonotonic dependence of the target from the regressors.

In this example, we are lucky and instantly see what’s wrong with the model:

9.2.4 Fairness of residuals

In ML, fairness is an indicator of inequality among data. How do individual samples contribute to a loss or metric? By what cost do we increase the metric? Does the new model add inequality among residuals or reduce it? Basically, fairness is another term for defining the skewness of the residual distribution.

Not every metric change is meant to be equal. Some improvements are distributed uniformly among all samples, while others add significant growth in one stratum and provoke a decrease in the rest. The concept of “fairness” in residual analysis pushes us toward a more holistic model evaluation procedure, far beyond estimating single-valued metrics.

To get a better understanding of what fairness is, consider figure 9.16.

figure
Figure 9.16 Fair vs. unfair residual distribution

In this simplified example, we have two models. Both decrease the mean absolute error (MAE) by 20%. However, we prefer to deploy the first model because it reduces absolute residuals uniformly among all 10 samples. In contrast, the second model drastically improves metrics on one half and reduces them on the other half. In this case, we add inequality to the residual distribution, so we call this distribution unfair.

One way to assess fairness quantitatively, instead of relying purely on visualizations, is to use the Gini index from economics (see figure 9.17). To compute it, the residuals should be sorted based on their absolute values, and then the cumulative proportion of the absolute values should be divided by the cumulative proportion of the number of residuals.

figure
Figure 9.17 Graphical representation of the Gini coefficient: the graph shows that the Gini coefficient equals the area marked A divided by the sum of the areas marked A and B—that is, Gini = A/(A + B). It is also equal to 2A and to 1 − 2B because A + B = 0.5 (since the axes scale from 0 to 1).

For total fairness (Gini = 0.0), almost all residuals have the same contribution to the total error. For total inequity (Gini = 1.0), a single residual is stealing the covers. Those are two extremes that you will rarely face in real life, while the common values are always somewhere in between.

There are two main reasons to care about fairness. First, we want the model to have high performance among all users, items, or other entities it will be deployed on, not just a fraction of them. This also includes reducing overall inequality in residuals step by step in each iteration of the system.

Second, we want to not only improve the overall error numbers but also reduce each residual. If we increase the search engine quality by 5%, it reduces prediction quality on some segments of users by 20%, which damages the users’ experience. The gains we get by improving the average quality may be easily neutralized by the increase in the churn of these users.

In the long term, we chase close-to-equal growth among all strata. There are exceptions to every rule, and fairness is not that critical for every single project and every metric. We should pay attention to the error cost produced by residual distribution tails. This should define our tradeoff between average and sample-wise improvements.

9.2.5 Underprediction and overprediction

In regression tasks, we often split residuals by sign—positive residuals indicate overprediction (predicted values are greater than true values), while negative residuals indicate underprediction.

Depending on the problem we are set to solve, one or another bias of the model is preferred. For instance, if we are building a demand forecasting system, missed profit is a less desirable outcome than moderate overstocks. On the other hand, if we predict a client’s creditworthiness for a bank, we are better off underestimating it than overestimating it.

Therefore, the cost of an error is often asymmetric, and it should tell us residuals of which sign and size we should pay attention to the most.

9.2.6 Elasticity curves

One of the demand-specific error analysis tools is the elasticity plot. It is not a universal tool applicable for any ML system, but because it is crucial for pricing-related applications, it is worth our attention. Although we discussed most of the curve-shaped ways of analysis earlier, this example belongs here as it can be seen as a special case of residual analysis.

One of the core model assumptions behind demand forecasting is price–demand dependency—the higher the price, the lower the demand, and vice versa. This is true for almost all kinds of products (except some special cases like Veblen and Giffen goods, if you recall Microeconomics 101).

An elasticity plot is a special case for a more generic concept called a partial dependence plot, in which we vary some features and analyze how the predicted outcome changes. It is used for model interpretability (which we will cover in chapter 11).

There are two ways to plot the elasticity curve for our model:

As we mentioned before, in a perfect case, the plot demonstrates an inverse dependency: the higher the price, the lower the sales. However, if this plot demonstrates the opposite, is noisy (partly or fully nonmonotonic), or has any other controversial patterns, it will signal one of the following:

The “better” the plot (more monotonous in the negative direction), the more we can rely on the model’s predictions for these SKUs. If the elasticity is “bad,” it signals that forecasting is not reliable and should be further investigated rather than deployed for these “hard” SKUs (see figures 9.18 and 9.19).

figure
Figure 9.18 Theoretical example of elastic demand
figure
Figure 9.19 Theoretical example of inelastic demand

Plotting elasticity not only for predicted sales but for actual sales as well is helpful. It is also helpful to understand whether this particular SKU reveals distinct elasticity. If it doesn’t, we should not expect it for the predicted demand.

If you feel like you need more context on the price elasticity concept, we recommend the article “Forecasting with Price Elasticity of Demand” ().

A friend of ours recently told his own campfire story about using an elasticity curves application. He had been working on a pricing problem, and their solution was effectively a glorified elasticity plot. They built a gradient-boosting model that predicted the number of sold items using various features, including price-based ones and estimated sales for different possible prices. Their first model revealed a surprising pattern: instead of having a smooth look, the plot had a visible “ladder” of steps. After deeper analysis, they realized that the origin of the steps was related to the features they used; continuous variables were split into buckets with low cardinality (e.g., only 256 buckets for all possible prices across all the items on the marketplace), and it limited the model’s sensitivity. After the number of buckets was increased, the model was able to capture more detailed patterns, and the elasticity curve became smooth, improving the overall system performance.

9.3 Finding commonalities in residuals

Now that we’ve examined residual distribution as a whole, it’s time to investigate the patterns and trends in residual subgroups. To do that, we approach the problem from both ends (see figure 9.20):

figure
Figure 9.20 Two approaches to finding commonalities in residuals: picking subsets by residuals rank or by sample attributes

Grouping residuals by their values includes best-case and worst-case analysis; it also covers overprediction and underprediction problems. Grouping residuals by their properties produces group analysis and corner case analysis.

9.3.1 Worst/best-case analysis

The goal of worst/best-case analysis is to define typical cases where the model works fine and where we should avoid making decisions based on this model. What do residuals have in common in these extreme cases?

9.3.2 Adversarial validation

If the manual worst-case analysis won’t provide new insights, a “machine learning model for analyzing machine learning models” could help. In chapter 7, we discussed a concept called adversarial validation. It was derived from ML competitions and is used to check whether the distribution of the two datasets differs. Often we concatenate, train, and test datasets with labels 0 and 1.

Adversarial validation can be easily transferred to the rails of residual analysis: we set 0 for “good” samples of data and 1 for “bad” samples (e.g., taking the top-N% biggest residuals in the second case). We should try different thresholds for our particular set.

The rest of the algorithm is similar: we fit a simple classifier (e.g., logistic regression) on these labels and calculate the receiver operating characteristic area under the curve (ROC AUC). If two classes are separable (area under the curve is significantly greater than 0.5), then we analyze the model’s weights, and this gives us a hint of which exact features best distinguish our “worst” cases from others.

Sometimes we find no easily defined patterns during worst-case analysis. This is fine. It means that we have already captured the most low-hanging fruits in the model improvement space.

9.3.3 Variety of group analysis

Group analysis enables the identification of distinct patterns and trends within the residuals of various groups, segments, classes, cohorts, or clusters. For instance, in both binary and multiclass classification scenarios, one effective approach involves splitting the residuals by classes (i.e., by target variable), allowing for separate analysis of the residuals in each group.

Many typical applications processing tabular data, such as fraud detection systems, rely on grouping samples based on specific characteristics, such as geography or traffic source. By analyzing the residuals within each segment, it becomes possible to uncover common biases present in the model’s predictions. These insights can then guide further model refinement by incorporating more relevant features or adjusting the weights of existing features.

When dealing with the data in the form of text, images, or videos, there may be no distinct cohorts or groups by default. In such cases, an alternative method involves manually classifying a set of N residuals and assigning labels to each issue encountered (e.g., identifying images that are too dark or blurry or flagging texts with specific wording or style). This process allows for the discovery of specific problematic clusters where the model underperforms. Consequently, it provides guidance on what type of data should be collected next to improve the system.

9.3.4 Corner-case analysis

Corner-case analysis aims to test the model in rare circumstances. Typically, we would like to have a benchmark, a fixed set of already-captured corner cases, to quickly examine the behavior of each new model.

Here are some ideas for what we can check during corner-case analysis:

While the best/worst-case analyses ask which data our model digests excellently or badly, the corner-case analysis and cohort analysis ask about the model’s performance on predefined subsets of data.

9.4 Design document: Error analysis

Because we’re convinced that error analysis should be among the essential elements of ML system design, we will include this phase in both our design documents.

9.4.1 Error analysis for Supermegaretail

We start off with Supermegaretail, where we will suggest the approach that will help the company achieve its main goal—to reduce the gap between delivered and sold items, making it as narrow as possible while avoiding out-of-stock situations.

9.4.2 Error analysis for PhotoStock Inc.

Now we get back to PhotoStock Inc., which requires a modern search tool able to find the most relevant shots based on customers’ text queries while providing excellent performance and displaying the most relevant images in stock.

Summary

Назад: Part 3 Intermediate steps
Дальше: 10 Training pipelines