5 Loss functions and metrics

This chapter covers

Selecting proper metrics and losses for your machine learning system
Defining and utilizing proxy metrics
Applying the hierarchy of metrics

In the previous chapter, we first touched on the topic of creating a design document for your machine learning (ML) system. We figured out why a design document is subject to constant edits and why all the changes you implement in it are not only inevitable but also necessary.

Unfortunately, an ML system can’t directly solve a problem, but it can try to approximate it by optimizing a specific task. To do that efficiently, it must be adjusted, appropriately guided, and monitored.

To direct an ML system’s effort, we use its algorithm’s loss function to reward or punish if for reducing or increasing specific errors. However, the loss function is used to train the model and usually must be differentiable, meaning that there is a narrowed choice of available loss functions. Thus, to assess the model’s performance, we use metrics; and while every loss function can be used as a metric (a good example would be root mean squared error [RMSE], which is quite often used as a metric, although we are not sure that is the best decision), not every metric can be used as a loss function.

In this chapter, we will discuss how to pick the best-fitting metrics and loss functions, focusing on how to do proper research and provide motivation for choice during the design process.

5.1 Losses

The loss function, also known as the objective or cost function, effectively defines how a model learns about the world and the connections between dependent and independent variables, what it pays the most attention to, what it tries to avoid, and what it considers acceptable. Thus, the choice of a loss function can drastically affect your model’s overall performance, even if everything else—features, target, model architecture, dataset size—remains unchanged. Switching to a different loss function can completely reshape your whole system.

Picking the right loss function (i.e., choosing the way a model learns from its mistakes) is one of the most crucial decisions in designing an ML system. Recalling an evergreen anecdote, we can be pretty confident in optimizing for the mean while counting the average salary of bar visitors until Bill Gates walks in ().

Unfortunately, not every function can be used as a loss function. In general, a loss function feature two properties:

It is globally continuous (changes in predictions lead to changes in losses).
It is differentiable (its gradient can be calculated for optimization algorithms based on the gradient descent). There is one exclusion: in exotic cases, gradient-free optimization methods are applicable, although practitioners prefer to avoid them as gradient-based methods typically converge much better.

While these two points are relevant for any loss, it is important to select a loss function that will best match your particular case and will be closest to the final goal of your system.

This is where advanced loss functions come into play, providing tempting ways of improving your model. Unlike manipulations with features or the model itself, they don’t usually affect the runtime aspect, meaning that all the code changes are only related to training pipelines, and isolating changes to a small part of a system is always a good property of design. But more often than not, we have witnessed ML engineers (especially recent graduates) sticking to a particular loss function just because they got used to applying it to similar problems. A notorious example is the regression problem with the mean squared error (MSE) or mean absolute error (MAE) loss function as the default choice and, many times, the only choice by many practitioners.

At the same time, while choosing a proper loss function (or a set of them) is a decision that may greatly improve your model’s performance, it is still not a silver bullet. We have worked with a few ML engineers (often with respectable academic backgrounds and PhDs) who tried to solve all the problems they had with just one elegant loss function. This approach is on the opposite end of the spectrum from paying no attention to the loss function at all, but it is still far from ideal. A good ML system designer keeps many tools in mind, not overfitting for one. Overall, the heuristic is the following: the more research-heavy your system is, the more likely it is that you need to invest time in finding or designing a nontrivial loss function.

A couple of years ago, Valerii worked with an intern on building a model to predict the exchange volume of cryptocurrencies. As always, he asked the intern to prepare a design document before doing anything, and this was an insightful exercise. The intern thoughtlessly skipped the loss function chapter, listing some metrics he would use to assess the system performance without any reasoning behind them.

Why is this not acceptable? By using an example, we can review a simplified situation with a knowledge of loss functions for regression problems being narrowed down to the two most widely used loss functions: MSE and MAE.

Imagine that we have a vector of target values Y = [100, 100, 100, 100, 100, 100, 100, 100, 100, 1000] and a vector of independent variables X being equal for all samples.

If we train a model using MSE as a loss function, it will output a vector of predictions:

Y_hat = [190, 190, 190, 190, 190, 190, 190, 190, 190, 190]

If we train a model using MAE as a loss function, it will output a vector of predictions:

Y_hat = [100, 100, 100, 100, 100, 100, 100, 100, 100, 100]

NOTE Please note that this is a thought experiment to highlight the idea and make it easier to comprehend. If we needed to, we could create synthetic data to reproduce the whole process—features, targets, and models—but for the sake of simplicity, we will use only the preceding numbers.

When we calculate MSE and MAE for a model with the RMSE loss function, it will result in the following numbers: MSE = 72,900, MAE = 162, with the mean of residuals equal to 0 and the median of residuals equal to –90 (figure 5.1).

Figure 5.1 Residuals after optimizing the mean

When we calculate MSE and MAE for a model with the MAE loss function, the result will be MSE = 81,000, MAE = 90, with the mean of residuals equal to 90 and the median of residuals equal to 0 (figure 5.2).

Figure 5.2 Residuals after optimizing the median

No wonder the model optimized for MSE yields better MSE, and as MSE tries to minimize the mean, the mean residuals are better. On the other hand, the model optimized for MAE delivers better MAE, and as MAE tries to optimize the median, the median residuals are better. But what does it mean for us? Which loss function is better? That depends on our application.

Let’s say we are optimizing a navigation system for aircraft, and an error larger than 850 means that a plane will go off a landing field and crash. In this case, optimizing for MAE is not an ideal decision. Of course, we can say 9 out of 10 times that we have a perfect result, and only 1 out of 10 times a vehicle is destroyed, but this is not acceptable by any means. We have to avoid outliers at all costs or penalize them, thus using MSE or even some higher-degree modifications.

But suppose we are optimizing the amount of liquidity for a cryptocurrency exchange we need for every trading day. Liquidity refers to a cryptocurrency’s capacity to be converted into cash or other cryptocurrencies without losing value, and it is essential for all cryptocurrency exchanges. High liquidity signifies a dynamic and stable market, allowing participants to trade quickly at reasonable prices. Excessive liquidity, however, means that allocated resources are not used. In this case, reserving more cash than required 9 times out of 10 is far from desired. We can review it from a different angle: the model optimized for MSE overallocated 810 units and underallocated 810 units, while the model optimized for MAE was on the spot 9 times out of 10 and underallocated 900 units, which seems like a better decision (if underallocation is less than 9 times worse than over allocation) to convey to the model what we need.

It’s easy to see that even though we used MSE and MAE to train the models, we applied different criteria to assess them. For the aircraft navigation system, we counted the number of times when the difference between the actual and predicted value was greater than 850. For liquidity optimization, it was the number of times we were on the spot or under an overallocation weighted sum. This illustrates that training the model to optimize specific loss functions and assess this model’s performance can represent two different tasks, which we will cover in section 5.2 on metrics. Before we proceed, we’d like to share some insights on the nuances and aspects of determining losses for deep learning models.

5.1.1 Loss tricks for deep learning models

In deep-learning-based systems, especially those processing text, image, or audio data, loss selection is even more crucial.

A properly chosen loss function can help with many problems related to model training, especially a sophisticated model and/or data domain. For example, a cross-entropy loss is a classical solution for the classification problem. One of the problems with it is related to class imbalance. If one class is heavily overrepresented, the model optimized by the entropy loss may face something called mode collapse—a situation when it outputs a constant (popular class) for any input. These problems have been solved in many ways (e.g., data undersampling/oversampling, custom weights for classes, etc.), but all of them required significant manual tuning and were not reliable. The problem was approached by researchers who tried to design a loss addressing it; the most notable result is probably by Lin et al. (“Focal Loss for Dense Object Detection,” ), and this loss is now taking its honorable place among tools helping to solve the data imbalance problem.

Focal loss (see figure 5.3) is a dynamically scaled cross-entropy loss where the scaling factor decays to zero as confidence in the correct class increases. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples (more information can be found at ).

Figure 5.3 The suggested focal loss function focuses more on misclassified examples while reducing the relative loss for well-classified examples (source: Lin et al.).

Originally, this loss was introduced for the object detection problem specific to computer vision, and later, the approach expanded to many other domains, including those unrelated to images, like audio or natural language processing. The most distant application of the focal loss we have found has been introduced in the paper “Can Natural Language Processing Help Differentiate Inflammatory Intestinal Diseases in China?” (Tong et al.; ), which confirms how ideas spread across domains.

In some cases, a reasonable solution will be to combine multiple losses for a single model. The need for such an approach may arise with complex problems, often multimodal and often associated with multiple concurrent datasets. We will not provide many details on using combined loss functions here as it is research-heavy, but we would like to give some examples:

“Authentic Volumetric Avatars from a Phone Scan” (Cao et al.; ). The authors combined three families of losses (segmentation, reconstruction, perceptual). Generative computer vision models are often subject to considering combined losses.
“Highly Accurate Protein Structure Prediction with AlphaFold” (Jumper et al.; ). The famous AlphaFold 2 model predicts 3D shapes of proteins from their genetic sequence with impressive accuracy. That’s a huge thing for the biotech world, and it uses multiple auxiliary losses under the hood. For example, a masked language modeling objective, the one that is likely to be inspired by a loss function used in BERT-like architectures, is a popular family of natural language processing models.
“GrokNet: Unified Computer Vision Model Trunk and Embeddings for Commerce” (Bell et al.; ). This is a jewel among combined loss examples we can recall. The authors aim to build a single model to rule multiple problems, so they used 7 goods datasets and 83 (80 categorical and 3 embedding) losses!

In general, multiple losses are usually used either to help models’ convergence or to solve multiple adjustment problems with a single model.

While loss functions help set up and fine-tune accuracy and efficiency and minimize errors for your system while training, metrics are used to evaluate its performance within a certain set of parameters.

5.2 Metrics

The loss function we optimize and the metric we use to assess our model’s performance can be very different from each other. Recall that the end goal of the demand forecast system for Supermegaretail in chapter 4 was to reduce the gap between delivered and sold items, making it as narrow as possible while avoiding an out-of-stock situation. If we try to visualize the pipeline, it might look like figure 5.4.

We know that a proper loss function is essential, but what about metrics? Can’t we pick some standard metrics, assess a variety of models, choose the best, deploy it, and estimate potential success through A/B tests?

Figure 5.4 A general-purpose pipeline for a demand forecast system that perfectly fits the Supermegaretail case

Unfortunately, no. Choosing the right set of metrics has to follow just as carefully an elaboration as selecting loss functions. Even more, while the set of popular losses is finite, there is always an opportunity to tailor a custom metric for a specific business domain. Choosing the wrong metric, in its turn, can cause misguided optimization when we set our model to train for irrelevant values, which eventually leads to poor performance in real-world scenarios. As a result, we have to roll back several steps in model development, resulting in a significant waste of time and resources. But even choosing the right metric for your ML system will not guarantee the project’s success.

Campfire story from Valerii

Some time ago, I was developing an ML system for a bank that regularly encountered the problem of nonpaying debtors. The system we were preparing had two main goals:

Reduce the number of delinquent payments
Make customers more responsive

As a metric, we chose the conversion rate of clients from nonpayers to payers.

The first thing we did was to implement a system of promised payments that worked as follows. Let’s say Mr. Smith gets a call from the bank: “Mr. Smith, you haven’t paid your loan on time. Can we expect you to pay the required amount within three days?” “Oh, of course, I will, I will,” says Mr. Smith. The people at the bank hang up and check the “promised to pay” box. But then Mr. Smith would break the promise and not pay anything.

The conversion rate by the time we started our work was 0.5, which means cases like that were occurring half the time. It’s not that bad but definitely not brilliant.

Given the attitude of people to such calls from banks and their desire to hang up as soon as possible, broken promises are a very common case. But the fact is, it’s a stick with two ends. On the one hand, the client won’t find it pleasant to talk to the bank, especially if they did not initiate the conversation. But the bank also isn’t interested in futile communication, having to overspend on call centers and employees.

As a solution, we built a system to predict the probability of clients agreeing to make their payment and fulfilling it. And we replaced human calls with text messages. This spared us from having to call our customers and talk them into making promises. The system was also supposed to predict customer behavior.

At the validation stage, the system showed a conversion rate of 0.9—almost twice as high as manual work! Two weeks later and in combat conditions, however, the conversion plummeted to 0.35, and we had only a week until making a report to our vice president.

Something had obviously gone wrong, and we needed to figure out what it was. We examined how this metric worked before, and it was pretty simple: if the client had promised to pay the debt on a certain day of the month but did not do it within 3 days, they were marked as debtors. Why was it 3 days? The answer is that the gap between an actual operation and getting information about this operation in the bank’s database was 3 days.

Let’s say you are supposed to make your next loan payment by EOD March 1. At the end of the day, March 1, you go to the bank after work and pay the required amount. On March 2, a system checks the database and sees that the payment has not been made (no wonder, as the information will not reach it until March 4). “Looks like we have a delinquent,” the system thinks and initiates a text message because, according to the data collected by the system, you have a high probability (90%!) of paying the required amount after receiving the message. Later on March 2, you get a text message from the bank asking you to pay the loan. “They must have got something wrong. I’ll let them know I’ve already paid,” you think and start filling out the form in the reply message. The problem is that the form does not allow you to enter a payment due date earlier than the current date. You can only specify that you will pay on March 2 or later. But you already paid on March 1. What do you do? You indicate that you paid on March 2 and submit the form. Three days later, the system checks the list of nonpayers, opens your profile, and sees that you promised to pay on March 2 but haven’t done that within 3 days from this date.

When we reconfigured the system, the conversion rate almost reached the initial value, getting as high as 0.8, but the interim problems we faced show how reaching your metrics can be hindered by flaws in the overall system behavior.

On the surface, a framework for picking the right metric is very straightforward: choose the one that is closest to the final goal. However, as the next campfire story will show, it might be very tricky to do. You can try either finding that metric yourself or using some outside help. The following are some options we recommend considering:

If you’re lucky enough to have a hierarchy of metrics, which we will cover later on in this chapter, use it to navigate to the metric you need.
Some companies have a dedicated department working on metrics; if that is the case, use their help.
If neither of these two options is the case, you might use product managers and data scientists to develop the best metric.
If the problem you’re set to solve is similar to a problem solved before and the solution proved to be solid and efficient, it is natural to transfer metrics from one project to another with certain modifications, if necessary.
If you have an A/B testing team, they also usually have enough knowledge to select or create a metric.

If you don’t have the luxury of having the things mentioned here, you can do the following:

Refer to the goals section in the design and align with it (it is essential to refresh what the end goals are, not how you remembered them). Knowing your goals will help you understand which metrics will help you achieve those goals or at least help you discard obviously inappropriate metrics.
Try to decompose the end goal by writing a map similar to the hierarchy of metrics (see section 5.2.2). It will probably take more than one stage to achieve, but this kind of exercise will help you break down your big goal into several smaller components, each with its own metric. Having many small parts on hand will help assemble the greater whole.
Find the best metrics describing the success of each stage.
If something is hard to measure directly, replace it with proxy metrics (see section 5.2.2). Proxy metrics will allow you to gather necessary and very important information before your system goes into release.
With this map, pick the metric that either represents the most critical stage or summarizes the map in the best possible way.

In the next campfire story, we will review the canonical binary classification problem.

Campfire story from Valerii

Recently, I had a conversation with a friend of mine regarding the evaluation of fraud models. Fraud models usually try to solve binary classification tasks where 0 is nonfraud and 1 is fraud.

No metric is ideal, and it always depends on the final goal. However, when we speak about fraud models, we usually want to maintain a ratio of fraud to legit transactions of some level. If we had 10 times more transactions, it would be okay to have 10 times more fraud, but not 20 or 30 times more. In other words, we want to have a probabilistic model.

Also, fraud usually belongs to the class imbalance problem, and that balance is not stable through time. One day the ratio can be 1:100 (outburst of fraudulent transactions), the next day, 1:1000 (an ordinary day), and the day after, 1:10,000 (fraudsters took a vacation).

The most popular set of metrics for this family of models is precision and recall, which may not be the best choice.

The problem with precision is that its calculations take both classes into account:

Imagine that we have a model that has a probability of 95% to predict that fraud is fraud (true positive [TP]) and 5% to predict that nonfraud is fraud (false positive [FP]).

Let’s review three scenarios where P is the number of positive samples and N is the number of negative samples:

P = 10,000, N = 10,000,

P = 100,000, N = 10,000,

P = 1000, N = 10,000,

As you can see, the class balance affected the metric significantly even when nothing else changed.

Now let’s take a look at recall (Recall = TP/(TP+FN) = TP/P = True Positive Rate [TPR]) and examine the same three scenarios:

P = 10,000, N = 10,000,

P = 100,000, N = 10,000,

P = 1000, N = 10,000,

In this case, the class balance didn’t affect the metric at all.

There is also a metric called specificity that can replace precision: inline figure

The same three examples show the following:

P = 10,000, N = 10,000,

P = 100,000, N = 10,000,

P = 1000, N = 10,000,

Recall and specificity do not change because of class imbalance, as these metrics are class-balance insensitive.

Initially, my friend created a notebook () to prove me wrong. The following code demonstrates his train of thought:

import numpy as np    def gen_labels_preds(fraud, genuine, fraud_predicted, correct_fraud_predicted):    labels = np.concatenate([np.repeat(True, fraud), np.repeat(False, genuine)])    preds = np.concatenate([      np.repeat(True, correct_fraud_predicted), # TP      np.repeat(False, fraud - correct_fraud_predicted), # FP      np.repeat(True, fraud_predicted - correct_fraud_predicted), # FN      np.repeat(False, genuine - (fraud_predicted - correct_fraud_predicted)) # TN      ])      return labels, preds    def calculate_metrics(labels, preds):    TP = (preds & labels).sum()    FP = (preds & ~labels).sum()    TN = (~preds & ~labels).sum()    FN = (~preds & labels).sum()      recall = TP / (TP + FN)    precision = TP / (TP + FP)    FPR = FP / (FP + TN)      return recall, precision, FPR

He devised two models with the following metrics:

A has 20 false positives, and 80% of the fraud is caught.
B has 920 false positives, and 80% of the fraud is caught.

Then he tried his two models in three scenarios with different numbers of transactions and fraud cases. In scenario 1, the number of transactions was 100,000. Overall, there were 100 fraud cases, so the class balance was 1:1,000:

fraud = 100 # high imbalance  genuine = 100000  model_A_FP = 20  model_B_FP = 920      # Model A  a_total_fraud_predicted = model_A_FP + fraud*0.8  a_correct_fraud_predicted = fraud*0.8    a_labels, a_preds = gen_labels_preds(fraud, genuine,  a_total_fraud_predicted, a_correct_fraud_predicted)  a_recall, a_precision, a_FPR = calculate_metrics(a_labels, a_preds)    # Model B    b_total_fraud_predicted = model_B_FP + fraud*0.8   # Flags many more transactions  b_correct_fraud_predicted = fraud*0.8  b_labels, b_preds = gen_labels_preds(fraud, genuine,b_total_fraud_predicted, b_correct_fraud_predicted)  b_recall, b_precision, b_FPR = calculate_metrics(b_labels, b_preds)    print("Model A Performance Metrics:")  print('TPR:', a_recall)  print("Precision:", a_precision)  print("FPR:", a_FPR)    print("\nModel B Performance Metrics:")  print('TPR:', b_recall)  print("Precision:", b_precision)  print("FPR:", b_FPR)    Model A Performance Metrics:  TPR: 0.8  Precision: 0.8  FPR: 0.0002    Model B Performance Metrics:  TPR: 0.8  Precision: 0.08  FPR: 0.0092

In scenario 2, he used the same metrics as in scenario 1. The number of transactions was 100,000. Overall, there were 10 fraud cases, so the class balance was 1:10,000:

fraud = 10 # high imbalance  genuine = 100000    # Model A  a_total_fraud_predicted = model_A_FP + fraud*0.8  a_correct_fraud_predicted = fraud*0.8    a_labels, a_preds = gen_labels_preds(fraud, genuine, a_total_fraud_predicted, a_correct_fraud_predicted)  a_recall, a_precision, a_FPR = calculate_metrics(a_labels, a_preds)    # Model B  b_total_fraud_predicted = model_B_FP + fraud*0.8 # Flags many more transactions  b_correct_fraud_predicted = fraud*0.8    b_labels, b_preds = gen_labels_preds(fraud, genuine, b_total_fraud_predicted, b_correct_fraud_predicted)  b_recall, b_precision, b_FPR = calculate_metrics(b_labels, b_preds)  print("Model A Performance Metrics:")  print('TPR:', a_recall)  print("Precision:", a_precision)  print("FPR:", a_FPR)    print("\nModel B Performance Metrics:")  print('TPR:', b_recall)  print("Precision:", b_precision)  print("FPR:", b_FPR)    Model A Performance Metrics:  TPR: 0.8  Precision: 0.2857142857142857  FPR: 0.0002    Model B Performance Metrics:  TPR: 0.8  Precision: 0.008620689655172414  FPR: 0.0092

In scenario 3, he again used the same metrics and 100,000 transactions. Overall, there were 1,000 fraud cases, so the class balance was 1:100:

fraud = 1000 # high imbalance  genuine = 100000    # Model A  a_total_fraud_predicted = model_A_FP + fraud*0.8  a_correct_fraud_predicted = fraud*0.8    a_labels, a_preds = gen_labels_preds(fraud, genuine, a_total_fraud_predicted, a_correct_fraud_predicted)  a_recall, a_precision, a_FPR = calculate_metrics(a_labels, a_preds)    # Model B  b_total_fraud_predicted = model_B_FP + fraud*0.8 # Flags many more transactions  b_correct_fraud_predicted = fraud*0.8    b_labels, b_preds = gen_labels_preds(fraud, genuine, b_total_fraud_predicted, b_correct_fraud_predicted)  b_recall, b_precision, b_FPR = calculate_metrics(b_labels, b_preds)    print("Model A Performance Metrics:")  print('TPR:', a_recall)  print("Precision:", a_precision)  print("FPR:", a_FPR)  print("\nModel B Performance Metrics:")  print('TPR:', b_recall)  print("Precision:", b_precision)  print("FPR:", b_FPR)  Model A Performance Metrics:  TPR: 0.8  Precision: 0.975609756097561  FPR: 0.0002    Model B Performance Metrics:  TPR: 0.8  Precision: 0.46511627906976744  FPR: 0.0092

Model A is better according to both the receiver operating characteristic area under the curve (ROC AUC) and precision-recall AUC (PR AUC) metrics. Model B is a bad model but still gets a very good FPR (0.0092), even though, if it were put into production, the predictions would be rubbish (920 out of 1,000 fraud predictions would be incorrect). Precision allows us to see this. It’s just 0.08 for model B, so we’d never even think about putting it close to production.

What is the fallacy here?

First, model B has an FPR of 0.0092, which is 46 times higher than model A, with its FPR of 0.0002. There is no good or bad FPR. It depends on your volume, and even a slight difference might turn out to be huge. For example, 0.99 has a 10 times higher case ratio than 0.999 (1:100 vs. 1:1000).

But even in the notebook example, while precision is only 10 times worse, the FPR of model B is 46 times worse; it’s hard to call this a very good FPR.

As you can see from the previous calculations and the notebook, precision shows a very different number when there is a shift in class balance, even when the model’s performance stays the same. In contrast, both TPR and FPR remain unchanged.

How do we combine this information and apply it to pick proper metrics?

In one of the companies we worked for, we had a goal to reduce spam and fraudulent behavior with more than 100,000,000,000 events per day. We set specificity to be at least 0.999999 (Specificity = TNR = 1 – FPR [in other words, we were okay to have one false positive per 1 million events]) and maximized recall (TPR) at that specificity rate. This proved to be more beneficial than using a standard recall-precision pair, given the volatile nature of underlying data.

Some cases, however, force you to improvise in order to find the metric that will be able to obtain a required behavior pattern from your system.

Campfire story from Arseny

I worked for a manufacturing optimization company and needed to improve a defect in its detection system, but in the midst of the process, another problem emerged: the metrics were not sensitive enough. The datasets required for running the planned scenario were too small—only 10 to 20 defective samples per customer product. And we couldn’t get any more data because there were simply no more existing defective units. The defect ratio was just too low, thanks to the high engineering quality.

Besides the dataset size, our customers weren’t interested in intermediate results (e.g., how calibrated the defect probability of our model was). Their judgment was very straightforward. For the sake of simplicity, let me frame it like this:

There are 10 defective units and N regular units.
An ideal scenario is to have 0 errors.
1 false positive or 1 false negative is good enough.
Otherwise, the system is unusable.

Most of my attempts to improve the system as is were fruitless, until at some point I decided to design a custom continuous metric that utilized the internal metrics and had reasonable thresholds. The metric appeared very discrete:

“0” would mean “perfect system.”
“1” would be “good enough.”
“2” would stand for “garbage.”

With this metric in place, I was able to start improving the system gradually, step by step, while being confident that I was moving in the right direction.

After a series of minor improvements, the cumulative effect transformed the system from “garbage” to “good enough” and from “good enough” to “perfect” for multiple customers.

One important factor in the success of your ML system will always be its consistency. To achieve this, there is a separate category of metrics, which we cover in the following section.

5.2.1 Consistency metrics

In applied ML, a model that has a consistent output when presented with slightly perturbed inputs is often desired. This property, known in different subfields as consistency, robustness, stability, or smoothness, can be formally defined as the requirement that the model be invariant under certain transformations, such that the difference between the model’s output on the original input and the model’s output on the perturbed input tends toward zero. In other words, we can express this property mathematically as

where f represents the model, x represents the original input, and eps represents the perturbation applied to the input. Consistency metrics are not commonly discussed in academic ML but are an important consideration in practical applications where small changes to the input can have significant effects on the model’s output from the product perspective.

Perturbations can be different. For example, for a solid computer vision model, a minor change of lighting usually should not change model outputs, or a sentiment analysis model should not be sensitive to changing words with synonyms. We will talk about such perturbations and invariants in more detail later, when discussing ML system testing in chapter 10.

There’s another similar property: when the model is retrained (e.g., with the addition of new data or even with other seeds), we expect it to produce the same or close outputs, given that inputs remain unchanged. For an antifraud system, it is not acceptable if the same user is considered fraudulent today, legitimate tomorrow, and a fraudster again next week:

When the model outputs are different over time, the release of a new model (which should be a routine procedure for most ML systems) may affect the downstream system or end users of the system, disturbing their common usage scenarios. People rarely like unexpected changes in their tools and environment.

Such properties can be as important as default features we expect from a model (such as accurate predictions) because they shape expectations. As we discussed in earlier chapters, if a model can’t be trusted, its utility is reduced. Thus, we need specific metrics to measure this kind of behavior.

Luckily, we formulated these properties strictly enough, so the biggest open question left is to estimate a proper type of noise or perturbation for the preceding formulas: what are the invariants, and how are the conditions expected to change over time?

With these estimations in place, you can attach your regular metrics to estimate consistency. For example, for the search engine example (Photostock Inc.), we don’t want a document to change its rank for some query between releases of your system, and so the consistency metric could be a variance of ranks for the pair (query, document) over some time over corpora of documents and queries. Obviously, the less the variance is, the better it is for the system. Still, you can’t forget about ill-posed situations—say, a dummy constant model tends to provide the lowest variance, but that’s not the consistency ML engineers usually hunt for.

Consistency is often an important property of an ML system (see figure 5.5). If it’s the case for your system, consider adding a metric reflecting how your system responds to the changes to input data, training data, or training procedure tweaks.

Figure 5.5 New model releases are fairly consistent when estimating the probability (P) of the user (U) being fraudulent (F).

Eventually, you will be able to form a single metrics system based on a clear hierarchy of offline and online metrics.

5.2.2 Offline and online metrics, proxy metrics, and hierarchy of metrics

Setting and improving appropriate metrics is an important step in building an efficient ML system. But even that is not our end goal, as we have to go one level deeper into the rabbit hole. When we had a plan to reduce spam and fraudulent behavior, the goal was not to have the highest recall at a given specificity. It was to improve the user experience by lowering the number of spam messages and making it safer by reducing the risk of fraudulent behavior.

In the Supermegaretail case, the goal was to reduce losses due to out-of-stock and overstock situations, which can be expressed in cash equivalent, but not mean absolute error (MAE), mean square error (MSE), weighted mean absolute percentage error (wMAPE), weighted absolute percentage error (WAPE), or any other metric.

In other words, the metric we used to assess the model during the training/testing/validation stages and the final metrics are rarely the same (see table 5.1).

The previously discussed set is also called offline metrics because we can apply and calculate them without deploying the model into production. In contrast, some metrics, usually our goal metrics, can be calculated only after implementing the system and using its output in the business. And although sometimes offline and online metrics might coincide, we still have to assess them differently. The most common way to evaluate online metrics (change/improvement) is through A/B testing.

We use offline metrics for a simple reason: we can use them before deploying the system. This method is quick and reproducible, and it doesn’t require an expensive model deployment process. Offline metrics must have one quality: they must be a good predictor of online metrics. In other words, an increase or decrease in offline metrics has to be either strongly correlated or proportional to an increase/decrease in online metrics. Offline metrics play the role of proxy metrics for online metrics and can be used as efficient predictors of online metrics.

Table 5.1 Examples of offline and online metrics

Offline metrics	Online metrics
Recall at given specificity for spam message classification	Number of user complaints about spam messages
Quantiles of 1.5, 25, 50, 75, 95, and 99	Value of expired items, total sales
Mean reciprocal rank, normalized discounted cumulative gain	Click-through rate on search engine result page

But if we can find offline metrics that are strongly correlated with our online metrics with the improvement being transitive, we can do the same for offline metrics. Let’s use an example to review this.

Imagine that we are building a recommender system for an eCommerce website. Our final goal is to increase gross merchandise value (GMV; this is a metric that measures the total value of sales over a given period). Unfortunately, as mentioned already, this is not something we can measure until we deploy our system into production and run A/B tests. We believe that increasing the number of items purchased will increase GMV. To achieve that, we want to increase the conversion rate by providing users with an offer that has a higher chance of being purchased (assuming this will increase the overall number of purchased items).

On average, 3% of offers end up being clicked, and 3% of those lead to a purchase: 3% times 3% means that if we show 10,000 offers, only 9 will lead to a purchase. This has two adverse, interconnected consequences:

Low amount of class 1 data (purchase), huge class imbalance
Increased A/B test duration

For example, for A/B tests with a 9/10,000 ratio of success to attempts, we would need 100 times more data than for the 90/10,000 ratio (quadratic dependency between a minimum detectable effect and a number of samples; please see the following for an example).

To mitigate that, we can use a proxy metric, click-through rate (CTR), with the following context in mind:

No purchase can be made without a click. We can expect a positive correlation between the CTRs and conversion rates (CRs) and even calculate it.
There are 33.3 times more clicks than purchases, meaning that we will have 33.3 times more training data for class 1 of the system, and A/B tests will become 1,111 (33.(3)^2) times faster. (To be precise, we can expect that variance will change as well, as , so with , var = 0.000899 and with , var = 0.0291, meaning that overall we will increase the speed of convergence by times.)

Using CTR instead of CR helps us iterate faster and with higher sensitivity, both offline (estimating metrics and loss is easier with more data for the class of interest) and online (at least partly through A/B testing).

We can represent this in the following relation:

CTR → CR → (overall number of purchased items) → GMV

We can further generalize this by building a hierarchy of metrics:

The global, company-wide metric is revenue.
Global revenue (GMV) is composed of the revenue from different products, including the product we are responsible for.
Our product revenue is affected by
- Average purchase price
- Purchase frequency
- Number of users (they are interconnected and have mutual influence, thus dotted lines)
Purchase frequency is affected by CR.
The conversion rate is affected by CTR.

A hierarchy of metrics (see figure 5.6) facilitates finding proper proxy metrics. Even though creating it lies outside the scope of designing an ML system, it will be handy to have one in place and refer to it during the design process. Using a common ground helps prove the choice and reduces the risk of failure.

Figure 5.6 Hierarchy of metrics

A hierarchy of metrics is especially important when the system gets mature enough so that some metrics can be contradictory. A friend of ours once told us a short anecdote about building a recommendation system: a variant that demonstrated higher engagement by internal users (they preferred new recommendations over previous versions) appeared to be way less profitable on a wider audience.

A hierarchy of metrics and proxy metrics concepts are connected to the multicomponent losses we discussed earlier. For example, when building this recommender engine for Supermegaretail, we can tailor a specific loss function that will consider multiple levels of user activity (clicks, purchases, total amount of purchased items) and balance our interest between metrics.

Campfire story from Arseny

Once I worked on a brand-new product feature based on computer vision. The proposed solution was broken down into components, with each component and subcomponent carefully annotated with metrics. Due to the innovative nature of the feature, the metrics were custom—mostly ratios between various possible outcomes. We designed the metrics hierarchy in collaboration with a product executive. After several experiments aimed at moving the needle for one of the metrics, I developed a gut feeling that it was imbalanced. To test this, I ran an adversarial experiment by replacing the model predictions with random noise generated with specific parameters. Surprisingly, the random model scored perfectly! The metric was originally designed to favor recall over precision, but such an extreme imbalance was clearly not desirable, so we had to redesign it as soon as possible.

5.3 Design document: Adding losses and metrics

Starting in chapter 4, we began to introduce design documents for two fictional cases: Supermegaretail and PhotoStock Inc. Here we continue to elaborate on the development of ML solutions for each case and cover the selection of loss functions and losses. We start with Supermegaretail followed by PhotoStock Inc.

5.3.1 Metrics and loss functions for Supermegaretail

Let’s refresh our memory on the Supermegaretail case. There, we were to reduce the gap between delivered and sold items, making it as narrow as possible while avoiding an out-of-stock situation with a specific service-level agreement (SLA) to be specified further.

Design document: Supermegaretail

II. Metrics and losses

i. Metrics

Before picking up a metric on our own, it makes sense to do some preliminary research. Fortunately, there are many papers related to this problem, but the one that stands out is “Evaluating Predictive Count Data Distributions in Retail Sales Forecasting” by Stephen Kolassa ().

Let’s recall the project goal, which is to reduce the gap between delivered and sold items, making it as narrow as possible while avoiding an out-of-stock situation with a specific SLA to be specified further. To do that, we plan to forecast the demand for a specific item in a specific store during a particular period using an ML system.

In this case, this paper’s abstract looks like an almost perfect fit:

Massive increases in computing power and new database architectures allow data to be stored and processed at increasingly finer granularities, yielding count data time series with lower and lower counts. These series can no longer be dealt with using approximative methods appropriate for continuous probability distributions. In addition, it is not sufficient to calculate point forecasts alone: we need to forecast the entire (discrete) predictive distributions, particularly for supply chain forecasting and inventory control, but also for other planning processes.

(Count data is an integer-valued time series. It is essential for the supply chain forecasting we are facing, where most products are sold in units.) With that in mind, we can briefly review this paper (within the following lettered list) and pick the metrics that are most appropriate for our end goal.

A. Measures based on absolute errors

MAE optimizes the median; the weighted mean absolute percentage error (wMAPE) is MAE divided by the mean of the out-of-sample realizations, and the mean absolute scaled error is obtained by dividing the MAE by the in-sample MAE of the random walk forecast.

Optimizing for the median does not differ much from optimizing for the mean in a symmetric predictive distribution. However, the predictive distributions appropriate for low-volume count data are usually far from symmetric, and this distinction makes a difference in such cases and yields biased forecasts.

B. Percentage errors

The mean absolute percentage error (MAPE) is undefined if any future realization is zero, so it is singularly unsuitable for count data.

The symmetric MAPE is a “symmetrized” version of the MAPE, which is defined if the point forecasts and actuals are not both zero at all future time points. However, in any period with a zero actual, its contribution is 2, regardless of the point forecast, making it unsuitable for count data.

C. Measures based on squared errors

Minimizing the squared error leads naturally to an unbiased point forecast. However, the MSE is unsuitable for intermittent-demand items because it is sensitive to very high forecast errors. The same argument stands for nonintermittent count data.

D. Relative errors

Prominent variations are the median relative absolute error and the geometric mean relative absolute error.

In the specific context of forecasting count data, these suffer from two main weaknesses:

Relative errors commonly compare absolute errors. As such, they are subject to the same criticism as MAE-based errors, as detailed earlier.
On a period-by-period basis, simple benchmarks such as the naive random walk may forecast without errors, and thus, this period’s relative error would be undefined because of a division by zero.

E. Rate-based errors

Kourentzes (2014) recently suggested two new error measures for the intermittent demand: MSR and MAR, which aim to assess whether an intermittent demand point forecast captures the average demand correctly over an increasing period of time. This is an interesting suggestion, but one property of these measures is that they implicitly weigh the short-term future more heavily than the mid- to long-term future. One could argue that this is exactly what we want to do while forecasting, but even then, a case could be made that such weighting should be explicit—by using an appropriate weighting scheme when averaging over future time periods.

F. Scaled errors

Petropoulos and Kourentzes (2015) suggest a scaled version of the MSE, the sMSE, which is the mean over squared errors that have been scaled by the squared average actuals over the forecast horizon. The sMSE is well-defined unless all actuals are zero, is minimized by the expectation of f, and, due to the scaling, can be compared between different time series. In addition (again because of the scaling), it is not quite as sensitive to high-forecast errors as the MSE. Specifically, it is more robust to dramatic underforecasts, although it is still sensitive to large overforecasts.

G. Functionals and loss functions

An alternative way of looking at forecasts concentrates on point forecasts that are functionals of the predictive distribution. One could argue that a retailer aims at a certain level of service (say 95%) and that therefore they are only interested in the corresponding quantile of the predictive distribution. This would then be elicited with appropriate loss functions or scoring rules. This approach is closely related to the idea of considering forecasts as part of a stock control system. From this perspective, quantile forecasts are used as inputs to standard stock control strategies, and the quality of the forecasts is assessed by valuing the total stock position over time and weighing it against out-of-stocks.

Though the authors did not see this as the best solution and proposed an alternative, the last paragraph of the paper is quite promising. Not only does it make sense from a business perspective to predict different quantiles to uphold SLA, but it is desirable from the point of view of having the loss function equal to the metric. Thus, quantile metrics for quantiles of 1.5, 25, 50, 75, 95, and 99 look like a proper choice. Moreover, suppose we need to pay more attention to a specific SKU, item group, or cluster. In that case, quantile metrics support the calculation of object/group weights (for example, item price).

i.ii. Metrics to pick

Quantile metrics for quantiles of 1.5, 25, 50, 75, 95, and 99 both as is and with weights equal to SKU price and an additional penalty for underforecasting or overforecasting if deemed necessary are calculated as point estimates with 95% confidence intervals (using bootstrap or cross-validation). In addition, we can further transform this metric, representing it not as an absolute value but as an absolute percentage error at a given quantile. All consideration from the Petropoulos and Kourentzes article regarding percentage errors have to be taken into account. Ultimately, a set of experiments will help to decide a final form. We will probably have both, as it makes sense to check both absolute values in money/pcs and percentage error.

Online metrics of interest during A/B test are

Revenue—expected to increase
Level of stock—expected to decrease or maintain the same
Margin—expected to increase

Alpha—coefficient used in quantile-based losses
W—Weights
I—Indicator function
A—Model output
T—Label

ii. Loss functions

With metrics equal to our loss functions, it is straightforward to pick the latter. We will train six models using a quantile loss of 1.5, 25, 50, 75, 95, and 99, resulting in six different models, providing us with various guarantees for the corresponding quantile of the predictive distribution.

As a second line of experimentation, we will additionally review the Tweedie loss function. Tweedie distributions are a family of probability distributions, including the purely continuous normal, gamma, and inverse Gaussian distributions; the purely discrete scaled Poisson distribution; and the class of compound Poisson–gamma distributions that have positive mass at zero but are otherwise continuous. These qualities make it an attractive candidate for our Count data.

5.3.2 Metrics and loss functions for PhotoStock Inc.

Next up is the PhotoStock Inc. design document, where a whole different set of losses and metrics should be applied based on the nature of the business case and the problem to be solved. In the case of PhotoStock Inc., we were hired to build a modern search tool that can find the most relevant shots based on customers’ text queries while providing excellent performance and displaying the most relevant images in stock.

Design document: PhotoStock Inc.

II. Metrics and loss functions

i. Metrics

When choosing metrics for a new PhotoStock search engine, we should keep in mind the expected behavior of the system, which includes the following:

Users click on links in search results, with higher results getting more clicks. This behavior can be reflected in the CTR metric, which evaluates how many users click on search results.
Users purchase images they find via search. This behavior can be reflected in the CR metric, which evaluates how many clicks lead to purchase.
Users see diverse suggestions on the search engine result page (SERP). There are no ready-to-go solutions here because we don’t have a solid definition of diversity. Let’s discuss it later with the UX team. As a baseline, we can use the number of different categories of images represented on SERP as a measure of diversity. In the future, we should research other companies’ experiences—Airbnb’s paper “Learning to Rank Diversely at Airbnb” ().
Search results look reasonable from the human perspective. This behavior can be reflected in the metric of human evaluation, which displays how many users think that search results are reasonable.

CTR and CR are online metrics, which means that they can only be measured when the system is live. Diversity is an unsupervised offline metric, which means that it doesn’t require any additional data and can be measured on a regular basis at no cost. Human evaluation, on the other hand, is a supervised offline metric, which means that it requires additional data (human evaluation) and thus takes time and effort to collect.

To introduce offline proxy metrics for CTR and CR, we can use classic metrics for ranking problems, such as mean reciprocal rank (MRR) and normalized discounted cumulative gain (NDCG). MRR is a metric that calculates the average of the reciprocal ranks for a given set of results, which is a measure of the mean of the inverse of the rank for the first relevant result. NDCG is a metric that calculates the average of discounted cumulative gains (DCGs) for a given set of results, which is a measure of the sum of relevance scores taken from the first N results divided by the ideal DCG. In its turn, DCG is the sum of relevance scores among the first N results in the order of decreasing relevance.

Both MRR and NDCG require a list of relevant results for each query to calculate the metrics. We can use the same list of relevant results for both MRR and NDCG, but we need to create this list using crowdsourcing to ensure that it is representative of the results that users are likely to see. While MRR may be appropriate as an offline metric for CTR, it may not be a good proxy for CR because a crowdsourced list of relevant results is not representative of the real purchase data. Therefore, to accurately measure CR, we should consider using real purchase data. However, for the first version of the system, we may only be able to monitor CR online using A/B tests and a gradual rollout.

To summarize, here’s how we can divide metrics:

Fast offline metrics: MRR, NDCG, diversity
Slow offline metrics: human evaluation
Online metrics: CTR, CR

ii. Losses

To use loss functions for training a search engine, it is important to consider available data and desired outcomes. In this case, the three main aspects we would like to optimize for are clicks, purchases, and diversity.

For the clicks and purchases aspect, we can use binary cross-entropy loss as a measure of success. However, it’s important to note that the data for clicks and purchases may be imbalanced, meaning that there may be more examples of one class than the other. In such cases, it may be beneficial to use a loss function that is more robust to class imbalance, such as focal loss or other loss functions designed for this purpose.

Focal loss is a loss function that was introduced in the paper “Focal Loss for Dense Object Detection” (). It is a generalization of a binary cross-entropy loss commonly used in classification tasks. The key difference between a focal loss and a binary cross-entropy loss is that focal loss down-weights easy examples, which are those examples that are classified correctly with high confidence. This is useful in cases where the data is imbalanced, as it helps the model to focus on the hard examples, which are typically more important for improving the overall performance of the model, so it seems relevant for the PhotoStock search engine.

As for the diversity aspect, we can add a term to the loss function that penalizes the similarity in results. One potential way to do this is to use the entropy of the category distribution of the results as a measure of diversity. However, this approach may not always be feasible, so the diversity loss should be considered optional.

Overall, the final loss function can be written as

Here, alpha, beta, and gamma are represented as hyperparameters that control the relative importance of the three components. These hyperparameters can be tuned to find the optimal balance between the three aspects.

5.3.3 Wrap up

The examples from these two design documents show how important it is to choose the right metrics and loss functions. Just like any other key element in building an ML system, metrics and loss functions should coincide with the goals of your project. And if you feel there’s more time needed to define the appropriate parameters, please find a few days in your schedule to do it so you don’t have to roll back a few miles in a month or more.

The next chapter covers data gathering, datasets, the difference between data and metadata, and how to achieve a healthy data pipeline.

Summary

Don’t fall into the temptation of using time-tested loss functions just because they worked on your previous project(s).
A loss function must be globally continuous and differentiable.
Loss selection is an important step, but it is even more crucial with deep learning-based systems.
Consider applying consistency metrics when small changes to the inputs can have significant effects on the output of your model from the product perspective.
Offline metrics can be applied before putting your project into production and play the role of proxy metrics for online metrics.
Make sure to have the hierarchy of metrics at hand, as it will be useful while working on the design of your system.