Книга: Machine Learning System Design
Назад: Part 2 Early stage
Дальше: 6 Gathering datasets

5 Loss functions and metrics

This chapter covers

  • Selecting proper metrics and losses for your machine learning system
  • Defining and utilizing proxy metrics
  • Applying the hierarchy of metrics

In the previous chapter, we first touched on the topic of creating a design document for your machine learning (ML) system. We figured out why a design document is subject to constant edits and why all the changes you implement in it are not only inevitable but also necessary.

Unfortunately, an ML system can’t directly solve a problem, but it can try to approximate it by optimizing a specific task. To do that efficiently, it must be adjusted, appropriately guided, and monitored.

To direct an ML system’s effort, we use its algorithm’s loss function to reward or punish if for reducing or increasing specific errors. However, the loss function is used to train the model and usually must be differentiable, meaning that there is a narrowed choice of available loss functions. Thus, to assess the model’s performance, we use metrics; and while every loss function can be used as a metric (a good example would be root mean squared error [RMSE], which is quite often used as a metric, although we are not sure that is the best decision), not every metric can be used as a loss function.

In this chapter, we will discuss how to pick the best-fitting metrics and loss functions, focusing on how to do proper research and provide motivation for choice during the design process.

5.1 Losses

The loss function, also known as the objective or cost function, effectively defines how a model learns about the world and the connections between dependent and independent variables, what it pays the most attention to, what it tries to avoid, and what it considers acceptable. Thus, the choice of a loss function can drastically affect your model’s overall performance, even if everything else—features, target, model architecture, dataset size—remains unchanged. Switching to a different loss function can completely reshape your whole system.

Picking the right loss function (i.e., choosing the way a model learns from its mistakes) is one of the most crucial decisions in designing an ML system. Recalling an evergreen anecdote, we can be pretty confident in optimizing for the mean while counting the average salary of bar visitors until Bill Gates walks in ().

Unfortunately, not every function can be used as a loss function. In general, a loss function feature two properties:

While these two points are relevant for any loss, it is important to select a loss function that will best match your particular case and will be closest to the final goal of your system.

This is where advanced loss functions come into play, providing tempting ways of improving your model. Unlike manipulations with features or the model itself, they don’t usually affect the runtime aspect, meaning that all the code changes are only related to training pipelines, and isolating changes to a small part of a system is always a good property of design. But more often than not, we have witnessed ML engineers (especially recent graduates) sticking to a particular loss function just because they got used to applying it to similar problems. A notorious example is the regression problem with the mean squared error (MSE) or mean absolute error (MAE) loss function as the default choice and, many times, the only choice by many practitioners.

At the same time, while choosing a proper loss function (or a set of them) is a decision that may greatly improve your model’s performance, it is still not a silver bullet. We have worked with a few ML engineers (often with respectable academic backgrounds and PhDs) who tried to solve all the problems they had with just one elegant loss function. This approach is on the opposite end of the spectrum from paying no attention to the loss function at all, but it is still far from ideal. A good ML system designer keeps many tools in mind, not overfitting for one. Overall, the heuristic is the following: the more research-heavy your system is, the more likely it is that you need to invest time in finding or designing a nontrivial loss function.

A couple of years ago, Valerii worked with an intern on building a model to predict the exchange volume of cryptocurrencies. As always, he asked the intern to prepare a design document before doing anything, and this was an insightful exercise. The intern thoughtlessly skipped the loss function chapter, listing some metrics he would use to assess the system performance without any reasoning behind them.

Why is this not acceptable? By using an example, we can review a simplified situation with a knowledge of loss functions for regression problems being narrowed down to the two most widely used loss functions: MSE and MAE.

Imagine that we have a vector of target values Y = [100, 100, 100, 100, 100, 100, 100, 100, 100, 1000] and a vector of independent variables X being equal for all samples.

If we train a model using MSE as a loss function, it will output a vector of predictions:

Y_hat = [190, 190, 190, 190, 190, 190, 190, 190, 190, 190]

If we train a model using MAE as a loss function, it will output a vector of predictions:

Y_hat = [100, 100, 100, 100, 100, 100, 100, 100, 100, 100]

When we calculate MSE and MAE for a model with the RMSE loss function, it will result in the following numbers: MSE = 72,900, MAE = 162, with the mean of residuals equal to 0 and the median of residuals equal to –90 (figure 5.1).

figure
Figure 5.1 Residuals after optimizing the mean

When we calculate MSE and MAE for a model with the MAE loss function, the result will be MSE = 81,000, MAE = 90, with the mean of residuals equal to 90 and the median of residuals equal to 0 (figure 5.2).

figure
Figure 5.2 Residuals after optimizing the median

No wonder the model optimized for MSE yields better MSE, and as MSE tries to minimize the mean, the mean residuals are better. On the other hand, the model optimized for MAE delivers better MAE, and as MAE tries to optimize the median, the median residuals are better. But what does it mean for us? Which loss function is better? That depends on our application.

Let’s say we are optimizing a navigation system for aircraft, and an error larger than 850 means that a plane will go off a landing field and crash. In this case, optimizing for MAE is not an ideal decision. Of course, we can say 9 out of 10 times that we have a perfect result, and only 1 out of 10 times a vehicle is destroyed, but this is not acceptable by any means. We have to avoid outliers at all costs or penalize them, thus using MSE or even some higher-degree modifications.

But suppose we are optimizing the amount of liquidity for a cryptocurrency exchange we need for every trading day. Liquidity refers to a cryptocurrency’s capacity to be converted into cash or other cryptocurrencies without losing value, and it is essential for all cryptocurrency exchanges. High liquidity signifies a dynamic and stable market, allowing participants to trade quickly at reasonable prices. Excessive liquidity, however, means that allocated resources are not used. In this case, reserving more cash than required 9 times out of 10 is far from desired. We can review it from a different angle: the model optimized for MSE overallocated 810 units and underallocated 810 units, while the model optimized for MAE was on the spot 9 times out of 10 and underallocated 900 units, which seems like a better decision (if underallocation is less than 9 times worse than over allocation) to convey to the model what we need.

It’s easy to see that even though we used MSE and MAE to train the models, we applied different criteria to assess them. For the aircraft navigation system, we counted the number of times when the difference between the actual and predicted value was greater than 850. For liquidity optimization, it was the number of times we were on the spot or under an overallocation weighted sum. This illustrates that training the model to optimize specific loss functions and assess this model’s performance can represent two different tasks, which we will cover in section 5.2 on metrics. Before we proceed, we’d like to share some insights on the nuances and aspects of determining losses for deep learning models.

5.1.1 Loss tricks for deep learning models

In deep-learning-based systems, especially those processing text, image, or audio data, loss selection is even more crucial.

A properly chosen loss function can help with many problems related to model training, especially a sophisticated model and/or data domain. For example, a cross-entropy loss is a classical solution for the classification problem. One of the problems with it is related to class imbalance. If one class is heavily overrepresented, the model optimized by the entropy loss may face something called mode collapse—a situation when it outputs a constant (popular class) for any input. These problems have been solved in many ways (e.g., data undersampling/oversampling, custom weights for classes, etc.), but all of them required significant manual tuning and were not reliable. The problem was approached by researchers who tried to design a loss addressing it; the most notable result is probably by Lin et al. (“Focal Loss for Dense Object Detection,” ), and this loss is now taking its honorable place among tools helping to solve the data imbalance problem.

Focal loss (see figure 5.3) is a dynamically scaled cross-entropy loss where the scaling factor decays to zero as confidence in the correct class increases. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples (more information can be found at ).

figure
Figure 5.3 The suggested focal loss function focuses more on misclassified examples while reducing the relative loss for well-classified examples (source: Lin et al.).

Originally, this loss was introduced for the object detection problem specific to computer vision, and later, the approach expanded to many other domains, including those unrelated to images, like audio or natural language processing. The most distant application of the focal loss we have found has been introduced in the paper “Can Natural Language Processing Help Differentiate Inflammatory Intestinal Diseases in China?” (Tong et al.; ), which confirms how ideas spread across domains.

In some cases, a reasonable solution will be to combine multiple losses for a single model. The need for such an approach may arise with complex problems, often multimodal and often associated with multiple concurrent datasets. We will not provide many details on using combined loss functions here as it is research-heavy, but we would like to give some examples:

  1. “Authentic Volumetric Avatars from a Phone Scan” (Cao et al.; ). The authors combined three families of losses (segmentation, reconstruction, perceptual). Generative computer vision models are often subject to considering combined losses.
  2. “Highly Accurate Protein Structure Prediction with AlphaFold” (Jumper et al.; ). The famous AlphaFold 2 model predicts 3D shapes of proteins from their genetic sequence with impressive accuracy. That’s a huge thing for the biotech world, and it uses multiple auxiliary losses under the hood. For example, a masked language modeling objective, the one that is likely to be inspired by a loss function used in BERT-like architectures, is a popular family of natural language processing models.
  3. “GrokNet: Unified Computer Vision Model Trunk and Embeddings for Commerce” (Bell et al.; ). This is a jewel among combined loss examples we can recall. The authors aim to build a single model to rule multiple problems, so they used 7 goods datasets and 83 (80 categorical and 3 embedding) losses!

In general, multiple losses are usually used either to help models’ convergence or to solve multiple adjustment problems with a single model.

While loss functions help set up and fine-tune accuracy and efficiency and minimize errors for your system while training, metrics are used to evaluate its performance within a certain set of parameters.

5.2 Metrics

The loss function we optimize and the metric we use to assess our model’s performance can be very different from each other. Recall that the end goal of the demand forecast system for Supermegaretail in chapter 4 was to reduce the gap between delivered and sold items, making it as narrow as possible while avoiding an out-of-stock situation. If we try to visualize the pipeline, it might look like figure 5.4.

We know that a proper loss function is essential, but what about metrics? Can’t we pick some standard metrics, assess a variety of models, choose the best, deploy it, and estimate potential success through A/B tests?

figure
Figure 5.4 A general-purpose pipeline for a demand forecast system that perfectly fits the Supermegaretail case

Unfortunately, no. Choosing the right set of metrics has to follow just as carefully an elaboration as selecting loss functions. Even more, while the set of popular losses is finite, there is always an opportunity to tailor a custom metric for a specific business domain. Choosing the wrong metric, in its turn, can cause misguided optimization when we set our model to train for irrelevant values, which eventually leads to poor performance in real-world scenarios. As a result, we have to roll back several steps in model development, resulting in a significant waste of time and resources. But even choosing the right metric for your ML system will not guarantee the project’s success.

On the surface, a framework for picking the right metric is very straightforward: choose the one that is closest to the final goal. However, as the next campfire story will show, it might be very tricky to do. You can try either finding that metric yourself or using some outside help. The following are some options we recommend considering:

If you don’t have the luxury of having the things mentioned here, you can do the following:

In the next campfire story, we will review the canonical binary classification problem.

Some cases, however, force you to improvise in order to find the metric that will be able to obtain a required behavior pattern from your system.

One important factor in the success of your ML system will always be its consistency. To achieve this, there is a separate category of metrics, which we cover in the following section.

5.2.1 Consistency metrics

In applied ML, a model that has a consistent output when presented with slightly perturbed inputs is often desired. This property, known in different subfields as consistency, robustness, stability, or smoothness, can be formally defined as the requirement that the model be invariant under certain transformations, such that the difference between the model’s output on the original input and the model’s output on the perturbed input tends toward zero. In other words, we can express this property mathematically as

figure

where f represents the model, x represents the original input, and eps represents the perturbation applied to the input. Consistency metrics are not commonly discussed in academic ML but are an important consideration in practical applications where small changes to the input can have significant effects on the model’s output from the product perspective.

Perturbations can be different. For example, for a solid computer vision model, a minor change of lighting usually should not change model outputs, or a sentiment analysis model should not be sensitive to changing words with synonyms. We will talk about such perturbations and invariants in more detail later, when discussing ML system testing in chapter 10.

There’s another similar property: when the model is retrained (e.g., with the addition of new data or even with other seeds), we expect it to produce the same or close outputs, given that inputs remain unchanged. For an antifraud system, it is not acceptable if the same user is considered fraudulent today, legitimate tomorrow, and a fraudster again next week:

figure

When the model outputs are different over time, the release of a new model (which should be a routine procedure for most ML systems) may affect the downstream system or end users of the system, disturbing their common usage scenarios. People rarely like unexpected changes in their tools and environment.

Such properties can be as important as default features we expect from a model (such as accurate predictions) because they shape expectations. As we discussed in earlier chapters, if a model can’t be trusted, its utility is reduced. Thus, we need specific metrics to measure this kind of behavior.

Luckily, we formulated these properties strictly enough, so the biggest open question left is to estimate a proper type of noise or perturbation for the preceding formulas: what are the invariants, and how are the conditions expected to change over time?

With these estimations in place, you can attach your regular metrics to estimate consistency. For example, for the search engine example (Photostock Inc.), we don’t want a document to change its rank for some query between releases of your system, and so the consistency metric could be a variance of ranks for the pair (query, document) over some time over corpora of documents and queries. Obviously, the less the variance is, the better it is for the system. Still, you can’t forget about ill-posed situations—say, a dummy constant model tends to provide the lowest variance, but that’s not the consistency ML engineers usually hunt for.

Consistency is often an important property of an ML system (see figure 5.5). If it’s the case for your system, consider adding a metric reflecting how your system responds to the changes to input data, training data, or training procedure tweaks.

figure
Figure 5.5 New model releases are fairly consistent when estimating the probability (P) of the user (U) being fraudulent (F).

Eventually, you will be able to form a single metrics system based on a clear hierarchy of offline and online metrics.

5.2.2 Offline and online metrics, proxy metrics, and hierarchy of metrics

Setting and improving appropriate metrics is an important step in building an efficient ML system. But even that is not our end goal, as we have to go one level deeper into the rabbit hole. When we had a plan to reduce spam and fraudulent behavior, the goal was not to have the highest recall at a given specificity. It was to improve the user experience by lowering the number of spam messages and making it safer by reducing the risk of fraudulent behavior.

In the Supermegaretail case, the goal was to reduce losses due to out-of-stock and overstock situations, which can be expressed in cash equivalent, but not mean absolute error (MAE), mean square error (MSE), weighted mean absolute percentage error (wMAPE), weighted absolute percentage error (WAPE), or any other metric.

In other words, the metric we used to assess the model during the training/testing/validation stages and the final metrics are rarely the same (see table 5.1).

The previously discussed set is also called offline metrics because we can apply and calculate them without deploying the model into production. In contrast, some metrics, usually our goal metrics, can be calculated only after implementing the system and using its output in the business. And although sometimes offline and online metrics might coincide, we still have to assess them differently. The most common way to evaluate online metrics (change/improvement) is through A/B testing.

We use offline metrics for a simple reason: we can use them before deploying the system. This method is quick and reproducible, and it doesn’t require an expensive model deployment process. Offline metrics must have one quality: they must be a good predictor of online metrics. In other words, an increase or decrease in offline metrics has to be either strongly correlated or proportional to an increase/decrease in online metrics. Offline metrics play the role of proxy metrics for online metrics and can be used as efficient predictors of online metrics.

Table 5.1 Examples of offline and online metrics
Offline metrics
Online metrics
Recall at given specificity for spam message classification
Number of user complaints about spam messages
Quantiles of 1.5, 25, 50, 75, 95, and 99
Value of expired items, total sales
Mean reciprocal rank, normalized discounted cumulative gain
Click-through rate on search engine result page

But if we can find offline metrics that are strongly correlated with our online metrics with the improvement being transitive, we can do the same for offline metrics. Let’s use an example to review this.

Imagine that we are building a recommender system for an eCommerce website. Our final goal is to increase gross merchandise value (GMV; this is a metric that measures the total value of sales over a given period). Unfortunately, as mentioned already, this is not something we can measure until we deploy our system into production and run A/B tests. We believe that increasing the number of items purchased will increase GMV. To achieve that, we want to increase the conversion rate by providing users with an offer that has a higher chance of being purchased (assuming this will increase the overall number of purchased items).

On average, 3% of offers end up being clicked, and 3% of those lead to a purchase: 3% times 3% means that if we show 10,000 offers, only 9 will lead to a purchase. This has two adverse, interconnected consequences:

For example, for A/B tests with a 9/10,000 ratio of success to attempts, we would need 100 times more data than for the 90/10,000 ratio (quadratic dependency between a minimum detectable effect and a number of samples; please see the following for an example).

To mitigate that, we can use a proxy metric, click-through rate (CTR), with the following context in mind:

Using CTR instead of CR helps us iterate faster and with higher sensitivity, both offline (estimating metrics and loss is easier with more data for the class of interest) and online (at least partly through A/B testing).

We can represent this in the following relation:

We can further generalize this by building a hierarchy of metrics:

  1. The global, company-wide metric is revenue.
  2. Global revenue (GMV) is composed of the revenue from different products, including the product we are responsible for.
  3. Our product revenue is affected by
    • Average purchase price
    • Purchase frequency
    • Number of users (they are interconnected and have mutual influence, thus dotted lines)
  4. Purchase frequency is affected by CR.
  5. The conversion rate is affected by CTR.

A hierarchy of metrics (see figure 5.6) facilitates finding proper proxy metrics. Even though creating it lies outside the scope of designing an ML system, it will be handy to have one in place and refer to it during the design process. Using a common ground helps prove the choice and reduces the risk of failure.

figure
Figure 5.6 Hierarchy of metrics

A hierarchy of metrics is especially important when the system gets mature enough so that some metrics can be contradictory. A friend of ours once told us a short anecdote about building a recommendation system: a variant that demonstrated higher engagement by internal users (they preferred new recommendations over previous versions) appeared to be way less profitable on a wider audience.

A hierarchy of metrics and proxy metrics concepts are connected to the multicomponent losses we discussed earlier. For example, when building this recommender engine for Supermegaretail, we can tailor a specific loss function that will consider multiple levels of user activity (clicks, purchases, total amount of purchased items) and balance our interest between metrics.

5.3 Design document: Adding losses and metrics

Starting in chapter 4, we began to introduce design documents for two fictional cases: Supermegaretail and PhotoStock Inc. Here we continue to elaborate on the development of ML solutions for each case and cover the selection of loss functions and losses. We start with Supermegaretail followed by PhotoStock Inc.

5.3.1 Metrics and loss functions for Supermegaretail

Let’s refresh our memory on the Supermegaretail case. There, we were to reduce the gap between delivered and sold items, making it as narrow as possible while avoiding an out-of-stock situation with a specific service-level agreement (SLA) to be specified further.

5.3.2 Metrics and loss functions for PhotoStock Inc.

Next up is the PhotoStock Inc. design document, where a whole different set of losses and metrics should be applied based on the nature of the business case and the problem to be solved. In the case of PhotoStock Inc., we were hired to build a modern search tool that can find the most relevant shots based on customers’ text queries while providing excellent performance and displaying the most relevant images in stock.

5.3.3 Wrap up

The examples from these two design documents show how important it is to choose the right metrics and loss functions. Just like any other key element in building an ML system, metrics and loss functions should coincide with the goals of your project. And if you feel there’s more time needed to define the appropriate parameters, please find a few days in your schedule to do it so you don’t have to roll back a few miles in a month or more.

The next chapter covers data gathering, datasets, the difference between data and metadata, and how to achieve a healthy data pipeline.

Summary

Назад: Part 2 Early stage
Дальше: 6 Gathering datasets