In the preceding chapters, we discussed the building blocks that form the backbone of a machine learning (ML) system, starting with data collection and model selection, continuing with metrics, losses, and validation split, and ending with a comprehensive error analysis. With all these elements firmly established in the pipeline, it’s time to circle back to the initial purpose of the system and think about proper reporting of the achieved results.
Reporting consists of evaluating our ML system’s performance based on its final goal and sharing the results with teammates and stakeholders. In chapter 5, we introduced two types of metrics: online and offline metrics, which generate two types of evaluation: offline testing and online testing. While offline testing is relatively straightforward, online testing implies running experiments in real-world scenarios. Typically, the most effective approach involves a series of A/B tests, a crucial procedure in developing an efficient, properly working model, which we cover in section 12.2. This helps capture metrics that either directly match or are highly correlated with our business goals.
Measuring results in offline testing (sometimes called backtesting) is an attempt to assess what effect we can expect to catch during the online testing stage, either directly or through proxy metrics. Online testing, however, is often a trickier story. For example, recommender systems rely heavily on feedback loops, so the training data depends on what we predicted at the previous timestamps. Moreover, we don’t know exactly what would have happened if we had shown item X or Y to user A instead of item Z, which appeared in historical data at timestamp T.
Finally, when the experiment is done, we are ready to report the effect of the model on the business metrics we are interested in. What do we report? How do we present the result? What conclusion should we make about the further steps of system elaboration? In this chapter, we answer questions that might arise during the process.
Before implementing an ML system, we should fully understand what goal we aim to achieve—that’s why our design document should begin with the goal description. Likewise, before running an experiment to check how a change affects our system (in this chapter, we mostly narrow our focus to A/B tests), we design it and outline a hypothesis that covers our expectations of how the given change is expected to affect metrics. This is where offline testing comes into play, as using offline evaluation as a proxy for online evaluation is a valid and beneficial approach. Its objective is to swiftly determine whether the new (modified) solution is better than the existing one and, if so, try to quantify by what margin it is better.
As we already mentioned, metrics are interconnected within a hierarchical structure. There are proxy metrics for other proxy metrics for metrics of our interest (please see chapter 5). Consequently, there are plenty of ways to conduct the offline evaluation.
The first layer of evaluation usually involves a basic estimation of ML metrics, which we introduced in chapter 7. We assume that alongside offline metrics (common metrics like root mean square error, mean absolute percentage error, or normalized discounted cumulative gain), we also improve business metrics (average revenue per user or click-through rate [CTR]).
Prediction evaluation is the fastest way to check the model’s quality and iterate through different versions, but it is also usually the farthest from business metrics (the relationship between prediction quality and online metrics is often far from ideal).
The offline evaluation procedure is trustworthy and valuable for assessing quality. However, there is rarely a direct connection between offline and online metrics, and learning how an increase in offline metrics A, B, and C affects online metrics X, Y, and Z is a regression problem by itself. To bridge the gap between offline and online metrics and make offline testing more robust, we could gather a history of online tests and online metrics and related offline metrics to calculate the correlation between offline and online results.
In some scenarios, we can transition from model predictions to real-world business metrics. To illustrate this, let’s consider the example of forecasting airline ticket sales. Often, the fundamental model performance metrics we use for forecasting are the weighted average percentage error (WAPE) and bias:
Suppose we have two models, each with a bias of ±10% for predicting ticket sales for a specific flight at a certain hour. In this example, we forecast 110 tickets sold for a given period of time, while the actual number of tickets sold was 100. Also, let’s assume we have the actual ticket prices for each day.
Our goal is to avoid overbooking (missed revenue due to offering too many discounted tickets) and minimize unsold seats (lost revenue). Assume the number of tickets sold reflects the passengers’ actual demand and that we adjust prices at the beginning of each day based on our forecast. We are okay with this rough estimation for the sake of illustration.
The actual number of tickets sold during the 4 days is [120, 90, 110, 80]. The predicted ticket sales are
The model biases are
The model WAPEs are
Both models have the same WAPE and bias, but model A has a negative bias of 10% (and tends to underpredict), while model B has a positive bias of 10% (and tends to overpredict).
The ticket prices for each day are [$200, $220, $230, $210].
The total revenue (we choose the minimum of sold and forecasted tickets for each day) is
As you can see, model B generates $8,600 more revenue for a single flight within just 4 days (while the total flight’s cost remains unchanged). When multiplied by hundreds of flights and days, this difference will result in a substantial revenue gain.
Certainly, we’ve oversimplified the real-world context where the model makes forecasts and affects decision-making (e.g., how many tickets to release for sale, how to adjust prices dynamically). Here, the whole transition is done by multiplying by price and, optionally, subtracting flight costs.
Nonetheless, this example illustrates how we can transition from model performance metrics, which may not fully capture the practical effect of predictions, to business indicators like revenue. These business metrics can be reported to the team before running A/B tests.
When the stakes are high, it is reasonable to invest in a more sophisticated, computationally expensive, but more robust offline testing procedure. Building a simulator of the online environment for the problem at hand and running an algorithm against this simulator can become such an investment.
A good example of this kind of environment would be time-series forecasting, which is relatively simple to simulate thanks to the availability of labels, auxiliary information (e.g., historical costs or turnover of goods in stock), and disambiguation in the outcome. That is usually not the case for online recommendation systems, such as news, ads, or product recommendations, where we deal with partly labeled data; we know only impressions of content items (views, clicks, and conversions) that we showed to a particular user, while those that we did not show remain without user feedback as there is always a chance that had we produced an alternative order of impressions, everything might have been very different.
Let’s review how such an environment could be created in the case of a recommender system.
Sometimes online recommendation systems are solved via the multi-armed bandits and contextual bandits approaches (usually, this subset of recommender systems is called real-time bidding, but in a nutshell, its goal remains the same—to provide the best pick from a variety of options to maximize a specific reward). There is an agent, which is our algorithm, that “interacts” with an environment. For example, it consumes information about user context (their recent history) and chooses which ad banner to show to maximize CTR, conversion rate (CVR), or effective cost per mile (eCPM). This is a typical reinforcement learning problem with a great feedback loop influence: previously shown ads change the future user behavior and give us information about their preferences.
The so-called “replay” algorithm is a simple (but not the most intuitive) way to build a simulation for real-time recommendations, derived from the paper “Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms” by Lihong Li et al. (). Here we summarize the paper due to its practicality:
There is also an optimized version for steps 4 and 5: instead of one item (as in the real environment), we recommend three or five items for each row, and if the recommended item matches one of them, we append this row to the virtual dataset. This allows the model to learn more effectively while reducing stochasticity.
A visual example is divided into several pieces in figures 12.1 to 12.5 that demonstrate the work of an unbiased estimator, with figure 12.1 displaying the real history of events and figures 12.2 to 12.5 showcasing simulated events based on historical data.
For the simulated events here, we go through the real history of events and only consider those events in which the output of a new model (chosen landing pages) equals the output of the old model. In figure 12.2, the simulated user behavior implies five transitions from banners to landing pages with two successful purchases and $6 of a predicted profit per click.
Featuring only four goods viewed by users, figure 12.3 suggests the lowest possible profit ($10) and profit per click ($2.5).
Figure 12.4 provides another example of simulated user behavior with a decent conversion rate (50%) but one of the lowest profit-per-click values (only 3%) due to the low cost of purchases.
Finally, a simulated environment in figure 12.5 suggests the best possible CVR (60%) and profit per click ($7), while the overall profit is only $3 less than that of the stream of real events (see figure 12.1).
We eventually end up with a virtual dataset produced by the interaction of our model with a simulated data flow. Based on it, we can calculate all listed metrics:
While not ideal, running different models through this procedure will make it possible to evaluate them from the product metrics point of view and understand the relationship between these estimates of online metrics and the accuracy of predictions and then provide this information to a wider audience.
Sometimes the final option worth mentioning for measuring results before deploying the system to production is validation by the assessors, or human evaluation. If the simulation is computationally expensive and slow, this can be even worse due to limited bandwidth, longer delay, human-induced ambiguity, and greater costs. However, in cases where human-in-the-loop is applicable, it is usually the most precise and trustworthy way to test the model, except for direct online testing on real data.
For instance, after building a new version of a search engine pipeline, we can either ask a group of experts to estimate the relevance of search results from 1 to 5 for some benchmark queries or choose which output is more relevant, comparing the old and new versions of search results. The second (comparative) formulation tends to produce a more robust evaluation as it excludes the subjectiveness of score estimates. In the case of generative large language models, human evaluation and hybrid approaches (auxiliary “critic” model, trained on human feedback) are the most popular options to measure the quality of generations.
While human evaluation has low throughput and high costs that may impede frequent use, its capacity to yield highly precise and trustworthy assessments (compared to fast but limited automated methods) cannot be downplayed.
Heavily used in various fields, including marketing, advertising, product design, and many others, A/B testing is a gold standard for causal inference that helps make data-driven decisions rather than relying solely on intuition. You might have a valid question: we have been discussing how to measure and report results, and now we have suddenly jumped on the causal inference boat; how so? In a sense, measuring results is a part of causal inference, as we want to be sure those results are caused by the factor we have influenced (in our case, an ML system). Of course, we can deploy our system in production and measure all metrics of interest as is, but we rarely do so, mainly for the following reasons:
During A/B experiments, we split entities (in most cases, users) into two groups:
This allows us to isolate the effect of a change on key metrics while controlling external factors that may affect the results (see figure 12.6).
Every experiment starts with a hypothesis:
If we do [action A], it will help us achieve/solve [goal/problem P], which will be reflected in the form of an [X%] expected uplift in the [metrics M] based on [research/estimation R].
Let’s break down the variables mentioned in the hypothesis:
Before running an A/B test, there are at least three hyperparameters we must define:
Along with the variation in the data, this affects the number of samples (and therefore the time required to run the test, taking some seasonality into account as well) needed to run the test; all these factors combined result in the following equation:
Where n is the required sample size per group, Zα is the Z-score corresponding to the desired significance level (for type I error rate), Zβ is the Z-score corresponding to the desired power (for type II error rate), s2 is the estimated variance of the metric, and MDC is the minimum detectable change. As we mentioned before, the very frequent sample unit is user, but it could be session, transaction, shop level, and so on.
After formulating the hypothesis and our expectations, we choose a splitting strategy determined mainly by the nature of the system we deploy and how it will affect the existing processes.
Next, we decide on the statistical criteria and conduct simulation tests, both A/A (where we compare the same system against itself to check type I error rates) and A/B (where we simulate a desired effect through the addition of the noise of a desired magnitude), as a sanity check on historical data. This helps us confirm whether we meet the predefined type I and type II error levels. A comprehensive description of that is beyond the scope of this book, but if you are interested in more details, a great starting point would be Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing () by Ron Kohavi et al.
Finally, we have everything we need to launch our A/B test. Remember that the design of the experiment is fixed beforehand and cannot be changed on the go. Next we briefly outline the basic stages of A/B testing.
The question is, how do we split the data? The usual answer is “by user.” For instance, we have a control group of users to interact with the existing search engine and an experimental group to use the new version. However, things get more complex when we can’t split by users (e.g., there isn’t a consistent user ID), or splitting by users is entirely irrelevant to our service.
To zoom in, let’s examine hypothetical pricing systems applied in different domains:
If splitting data by individual users isn’t feasible, we can use higher-level entities as atomic units. For offline retail, we could use entire stores; for scooters, trips could be divided based on parking slots, neighborhoods, cities, or even regions. However, this strategy might shift data distribution and create unequal sample sizes or lack of representation, which should be considered when selecting an appropriate statistical test.
When splitting by a nontypical key, aiming for group (bucket) similarity is essential. For instance, if you’re compelled to divide geographically, the chosen areas should be similar and possess comparable economic indicators (they should match each other).
Finally, there is a more advanced splitting strategy called switchback testing. This technique divides data into region-time buckets and randomly and continuously switches between different models (see figure 12.7). It ensures that each region will be in both the control group and the testing group for an approximately equal amount of time. For further details, see “Design and Analysis of Switchback Experiments” by Iavor Bojinov et al. ().
When designing an A/B test, selecting metrics is a crucial step that can be a deciding factor in the test’s success or failure. We prefer trustworthy, sensitive, interpretable metrics with a low feedback delay for online testing.
Every experiment usually has three kinds of metrics:
All three groups play an important role in the design and execution of an A/B testing experiment. Key metrics determine the success of the experiment, control metrics ensure the validity of the experiment, and auxiliary metrics provide additional information to support the results and inform future experiments.
The third essential ingredient in any A/B experiment is a statistical test that makes the final decision of whether a significant effect is captured.
In short, statistical criteria are used to quantify the probability of receiving specific results under specific assumptions. For example, the probability of receiving t-statistics equal to or greater than 3 generally will be less than 0.01 (if there’s no difference between the groups and assumptions for which the t-test is met).
The most common statistical test used in A/B testing is the t-test we just mentioned. It has many modifications. For example, Welch’s t-test is relatively easy to interpret and is similar to Student’s t-test, but it can be used even when the variances of two groups are not equal. The statistic for the Welch t-test is calculated as follows:
The null hypothesis (a default assumption under the test’s result) for the Welch t-test implies that there is no difference between the means of the two groups; here, the captured difference between the two groups appeared by accident and was caused by noise. The alternative hypothesis is that the captured difference is not caused by noise, and the difference is statistically significant.
By statistical significance, we mean that the probability of capturing a specific value of a statistic or even a more extreme one under the assumption that the null hypothesis is true (we call it p-value) is less than significance level α (effectively, the same as type I error; typically, α = 0.05). P-value, along with the significance level, is a way to standardize the results of any statistical test and map them to a single reference point rather than to define critical values for each test in particular.
While there are many different statistical tests available, the choice of a particular test depends on the context of the experiment and the nature of the data. That said, providing a detailed guide on choosing the most appropriate test is beyond the scope of this book. For a more detailed discussion on selecting the right statistical test for A/B testing, we once again recommend Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing () by Ron Kohavi et al.
The t-test is widely used because it is a robust test that can handle a variety of situations, it is easy to implement, and it provides results that are easy to interpret. However, the choice of a specific test should always be made considering the type of data you are dealing with, the nature of your experiment, and the assumptions each test requires.
We can run a simulation to check whether we have succeeded in designing an experiment. We replicate the whole pipeline multiple times on a different set of samples and periods.
A simulated A/A test involves randomly sampling two groups with no incurred difference and applying a chosen test statistic. We repeat these actions many times (say 1,000 or 10,000 times) and expect the p-value to follow a uniform distribution (if everything is correct). For instance, in 5% of simulations, the test will reject the null hypothesis for α = 0.05. The exact percentage of cases when the test does catch a difference for two groups is a simulated type I error rate.
A simulated A/B test does a similar thing, but now we add an uplift of a specific size (usually MDE of interest) to the second group B (we assume that groups A and B are the same size as they will be during the real A/B test). Again, we apply our test statistic and run it 1,000 to 10,000 times. After that, we reexamine the distribution of resulting p-values and count how many times we rejected the null hypothesis (“the p-value passed the threshold”). The percentage of cases when the chosen test rejects the null hypothesis is an estimation of sensitivity (1 – β), which is the probability that the test catches the difference if there is one. If we subtract sensitivity from 1, we get a type II error rate (the probability of ignoring the difference between A and B when there is one). If calculated type I and II error rates fit predefined levels, then the statistical test is picked properly, and the sample size is estimated correctly (see figure 12.8).
There are situations where running classic A/B tests is not feasible or desirable—either due to legal restrictions (certain industries, such as healthcare and finance, have strict regulations that prohibit conducting experiments on customers), logistical limitations (it may be difficult to randomly divide the customer base into two groups for testing), or other reasons.
There are different ways to tackle this problem, but since this is a narrow topic that is out of the scope of this book, we will not cover it in detail. For your convenience, however, we list them here so you can use them as keywords for a more in-depth look:
Monitoring an ongoing experiment is important for ensuring it runs smoothly and produces reliable results. If something goes wrong during the experiment, it is critical to promptly identify and address the problem to avoid a growing negative effect. Sometimes it is possible to exclude specific users, items, or segments from an experiment and move on. However, this option is not recommended if it was not considered in the initial design. In other cases, we prefer to terminate an experiment entirely and start investigating factors that have caused the failure.
In section 12.2.1, we mentioned control and auxiliary metrics that we should monitor during the test. They can hint if something goes wrong before the key metrics experience a significant drop. For instance, it is important to track user feedback—if you notice a significant drop in user engagement or a spike in user churn, it is a clear indicator of malfunction. In this case, it may indicate that the experimental group is not responding well to the pushed changes. This information can easily lead to an early termination.
Also, when conducting an A/B test, we should consider fairness and biases among different segments of users, which can affect target and auxiliary metrics. This analysis is similar to what we covered in depth in chapter 9. For further reading, we recommend “Fairness through Experimentation: Inequality in A/B Testing as an Approach to Responsible Design” by Guillaume Saint-Jacques et al. ().
The most valuable measurement to track during an A/B experiment is the uplift or the relative difference in the key metric between group A and group B. The longer the experiment runs, the greater the sensitivity of the test and the smaller the effect (either positive or negative) we can detect.
Figure 12.9 shows a funnel representing the range of effects that we cannot detect with statistical significance. If the uplift moves outside this funnel, the effect is significant. Please note that without a specific design, you cannot peek into the test as many times as you wish until you see the desired results. For further reading on the subject, we recommend “Peeking Problem—The Fatal Mistake in A/B Testing and Experimentation” by Oleg Ya () and “Sequential A/B Testing, Further Reading Choosing Sequential Testing Framework—Comparisons and Discussions” by Mårten Schultzberg et al. ().
This funnel can be calculated based on the MDC. Here, we need to solve the same equation we saw earlier—however, this time for different time periods—but we cannot make a decision until a period of time that we calculated in advance has passed. You could ask: what’s the point of using it then? The answer is to have an emergency stop criterion to abort the experiment if results are outside the funnel and negative—the least expected, desirable, and probable outcome. For any sequential testing framework mentioned by the link earlier (e.g., mSPRT), we can make a decision as soon as results are outside the funnel or the experiment time has ended.
It might be tempting to prematurely stop or extend an A/B test based on interim results. However, this can lead to erroneous conclusions due to changes in statistical power and the risk of false positives or negatives if we have not incorporated sequential testing. So when should we stop an experiment if we are not doing sequential testing?
In general, an experiment should run according to the predetermined design unless there’s a catastrophic drawdown in key metrics (a significant negative effect) that would result in a meaningful financial loss had the experiment continued.
On the other hand, if you see positive effects in key metrics, the best practice is to stick with the initial plan and let the experiment run its full course. It allows you to confirm that these positive effects are sustained over time and are not a result of short-term fluctuations.
What about those borderline cases where some metrics are positive, some are negative, and others hover around the MDE? It’s crucial during the experimental design phase to decide how you will interpret these mixed outcomes and set priorities among the different key metrics.
If you can’t wait to see the final results, consider using methods like sequential testing, which allows for repeated significance testing at different stages of the experiment. These methods come with their own assumptions and considerations, so make sure you understand them fully before proceeding.
Deciding whether to stop or continue an experiment is a complex task that requires careful consideration of the metrics, potential outcomes, and their effects. Document these decisions in your experimental protocol, including criteria for possible early termination.
After finishing the experiment, it is time to start our analysis. Here we need to calculate the required metrics, double-check monitoring, and provide confidence intervals for every measurement.
Not every experiment is successful. According to our experience, it is quite good to have 20% of A/B tests show a statistically significant difference (recall that usually if the false positive rate is 5%, the margin is 15%). And remember: if the test was statistically significant, it doesn’t mean that the results are positive or massive.
Suppose we decide that the A/B test is successful and the effect is significant. In that case, we have two options:
We have provided the core information of the report (of course, the reported metrics should be understandable to the audience, preferably expressed in earned or saved money, longer user sessions, etc.). In addition, it is good to dive deeper into the change in metrics, analyze how it affected different user or item segments, and outline the further steps (how overall metrics might change if we roll up on 100%). Table 12.1 shows examples of what fields can be included in the reporting table.
| Metric | Group A | Group B | MDE | Lift | p-value | Conclusion |
|---|---|---|---|---|---|---|
| CVR | 75.2% | 79.8% | 5% | 6.12% | 0.0472 | +4.2–6.9% (significant) |
| AOV | $232.2 | $242.8 | 11% | 4.57% | 0.3704 | No significant effect |
| … | … | … | … | … | … | … |
Once uplift is reported, the possible positive scenario may include one more experiment on a larger proportion of data or a complete switch to a new system—or a full-scale rollout, pause, and reversed A/B test. A series of successful A/B experiments will provide you with a solid data-driven argument that will serve as adequate support in making decisions on whether to give the green light to further steps.
Writing a debrief document is valuable for transparent communication of A/B experiment results, as well as for improving the quality of future experiments. It should be created during or immediately after the experiment to summarize key findings, including captured insights, detected problems, and recommendations. Sharing this document with the team ensures everyone is on the same page and enables continuous learning and improvement of the system.
If the experiment is successful, a debrief document will include suggestions for similar experiments in other products. In case of failure, it is important to discuss what should be done differently in future tests and what mistakes should be avoided, as well as to develop new control metrics to prevent similar failures in the future earlier during an experiment.
Because reporting cannot be designed at the preproduction stages when the design document is set to be prepared, it should be skipped by default. However, we’ve included it for demonstration purposes.
Measuring and reporting for Supermegaretail as part of the design document have their own peculiarities, and one of the crucial ones is that since we are predicting the future, we can only assess the quality of the system when the future comes; at the end of the day, we want to understand how much profit we will be able to generate, but we can only know this post facto, while the only way to evaluate this is by using certain metrics that are not directly related.
We’ve included this phase as a template for the PhotoStock Inc. case.