12 Measuring and reporting results

This chapter covers

Measuring results
Benefiting from A/B tests
Reporting received results

In the preceding chapters, we discussed the building blocks that form the backbone of a machine learning (ML) system, starting with data collection and model selection, continuing with metrics, losses, and validation split, and ending with a comprehensive error analysis. With all these elements firmly established in the pipeline, it’s time to circle back to the initial purpose of the system and think about proper reporting of the achieved results.

Reporting consists of evaluating our ML system’s performance based on its final goal and sharing the results with teammates and stakeholders. In chapter 5, we introduced two types of metrics: online and offline metrics, which generate two types of evaluation: offline testing and online testing. While offline testing is relatively straightforward, online testing implies running experiments in real-world scenarios. Typically, the most effective approach involves a series of A/B tests, a crucial procedure in developing an efficient, properly working model, which we cover in section 12.2. This helps capture metrics that either directly match or are highly correlated with our business goals.

Measuring results in offline testing (sometimes called backtesting) is an attempt to assess what effect we can expect to catch during the online testing stage, either directly or through proxy metrics. Online testing, however, is often a trickier story. For example, recommender systems rely heavily on feedback loops, so the training data depends on what we predicted at the previous timestamps. Moreover, we don’t know exactly what would have happened if we had shown item X or Y to user A instead of item Z, which appeared in historical data at timestamp T.

Finally, when the experiment is done, we are ready to report the effect of the model on the business metrics we are interested in. What do we report? How do we present the result? What conclusion should we make about the further steps of system elaboration? In this chapter, we answer questions that might arise during the process.

12.1 Measuring results

Before implementing an ML system, we should fully understand what goal we aim to achieve—that’s why our design document should begin with the goal description. Likewise, before running an experiment to check how a change affects our system (in this chapter, we mostly narrow our focus to A/B tests), we design it and outline a hypothesis that covers our expectations of how the given change is expected to affect metrics. This is where offline testing comes into play, as using offline evaluation as a proxy for online evaluation is a valid and beneficial approach. Its objective is to swiftly determine whether the new (modified) solution is better than the existing one and, if so, try to quantify by what margin it is better.

As we already mentioned, metrics are interconnected within a hierarchical structure. There are proxy metrics for other proxy metrics for metrics of our interest (please see chapter 5). Consequently, there are plenty of ways to conduct the offline evaluation.

12.1.1 Model performance

The first layer of evaluation usually involves a basic estimation of ML metrics, which we introduced in chapter 7. We assume that alongside offline metrics (common metrics like root mean square error, mean absolute percentage error, or normalized discounted cumulative gain), we also improve business metrics (average revenue per user or click-through rate [CTR]).

Prediction evaluation is the fastest way to check the model’s quality and iterate through different versions, but it is also usually the farthest from business metrics (the relationship between prediction quality and online metrics is often far from ideal).

The offline evaluation procedure is trustworthy and valuable for assessing quality. However, there is rarely a direct connection between offline and online metrics, and learning how an increase in offline metrics A, B, and C affects online metrics X, Y, and Z is a regression problem by itself. To bridge the gap between offline and online metrics and make offline testing more robust, we could gather a history of online tests and online metrics and related offline metrics to calculate the correlation between offline and online results.

12.1.2 Transition to business metrics

In some scenarios, we can transition from model predictions to real-world business metrics. To illustrate this, let’s consider the example of forecasting airline ticket sales. Often, the fundamental model performance metrics we use for forecasting are the weighted average percentage error (WAPE) and bias:

Suppose we have two models, each with a bias of ±10% for predicting ticket sales for a specific flight at a certain hour. In this example, we forecast 110 tickets sold for a given period of time, while the actual number of tickets sold was 100. Also, let’s assume we have the actual ticket prices for each day.

Our goal is to avoid overbooking (missed revenue due to offering too many discounted tickets) and minimize unsold seats (lost revenue). Assume the number of tickets sold reflects the passengers’ actual demand and that we adjust prices at the beginning of each day based on our forecast. We are okay with this rough estimation for the sake of illustration.

The actual number of tickets sold during the 4 days is [120, 90, 110, 80]. The predicted ticket sales are

Model A—[90, 90, 90, 90]
Model B—[110, 110, 110, 110]

The model biases are

Model A—

Model B—

The model WAPEs are

Model A—

Model B—

Both models have the same WAPE and bias, but model A has a negative bias of 10% (and tends to underpredict), while model B has a positive bias of 10% (and tends to overpredict).

The ticket prices for each day are [$200, $220, $230, $210].

The total revenue (we choose the minimum of sold and forecasted tickets for each day) is

Model A—90 * $200 + 90 * $220 + 90 * $230 + 80 * $210 = $75,300
Model B—110 * $200 + 90 * $220 + 110 * $230 + 80 * $210 = $83,900

As you can see, model B generates $8,600 more revenue for a single flight within just 4 days (while the total flight’s cost remains unchanged). When multiplied by hundreds of flights and days, this difference will result in a substantial revenue gain.

Certainly, we’ve oversimplified the real-world context where the model makes forecasts and affects decision-making (e.g., how many tickets to release for sale, how to adjust prices dynamically). Here, the whole transition is done by multiplying by price and, optionally, subtracting flight costs.

Nonetheless, this example illustrates how we can transition from model performance metrics, which may not fully capture the practical effect of predictions, to business indicators like revenue. These business metrics can be reported to the team before running A/B tests.

12.1.3 Simulated environment

When the stakes are high, it is reasonable to invest in a more sophisticated, computationally expensive, but more robust offline testing procedure. Building a simulator of the online environment for the problem at hand and running an algorithm against this simulator can become such an investment.

A good example of this kind of environment would be time-series forecasting, which is relatively simple to simulate thanks to the availability of labels, auxiliary information (e.g., historical costs or turnover of goods in stock), and disambiguation in the outcome. That is usually not the case for online recommendation systems, such as news, ads, or product recommendations, where we deal with partly labeled data; we know only impressions of content items (views, clicks, and conversions) that we showed to a particular user, while those that we did not show remain without user feedback as there is always a chance that had we produced an alternative order of impressions, everything might have been very different.

Let’s review how such an environment could be created in the case of a recommender system.

Sometimes online recommendation systems are solved via the multi-armed bandits and contextual bandits approaches (usually, this subset of recommender systems is called real-time bidding, but in a nutshell, its goal remains the same—to provide the best pick from a variety of options to maximize a specific reward). There is an agent, which is our algorithm, that “interacts” with an environment. For example, it consumes information about user context (their recent history) and chooses which ad banner to show to maximize CTR, conversion rate (CVR), or effective cost per mile (eCPM). This is a typical reinforcement learning problem with a great feedback loop influence: previously shown ads change the future user behavior and give us information about their preferences.

The so-called “replay” algorithm is a simple (but not the most intuitive) way to build a simulation for real-time recommendations, derived from the paper “Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms” by Lihong Li et al. (). Here we summarize the paper due to its practicality:

Suppose we take a historical dataset that consists of user context (user features), recommended ads (let’s say we can suggest one item at a time), and user feedback (which can be both binary and continuous). Also, for each row of data, we know all available items we could recommend for that user at a given moment (items are arms for contextual bandits).
Also, we have the second dataset, which we’ll call the virtual dataset. It is empty at the start of the simulation.
We split our historical dataset into batches (e.g., 1-hour intervals per iteration).
For each row in a single batch, we predict what ad our algorithm could recommend based on the proposed user context and available items. Our algorithm is trained on the virtual dataset, which is initially empty. Hence, we start with random “exploration.”
Rows, for which the recommendation produced by the model is equal to historically chosen ads, are saved to the virtual dataset (with their feedback).
We retrain the model based on the updated virtual dataset.
We repeat steps 4–6 until we reach the end of the historical dataset.

There is also an optimized version for steps 4 and 5: instead of one item (as in the real environment), we recommend three or five items for each row, and if the recommended item matches one of them, we append this row to the virtual dataset. This allows the model to learn more effectively while reducing stochasticity.

A visual example is divided into several pieces in figures 12.1 to 12.5 that demonstrate the work of an unbiased estimator, with figure 12.1 displaying the real history of events and figures 12.2 to 12.5 showcasing simulated events based on historical data.

Figure 12.1 Unbiased estimator (offline validation): stream of real events

For the simulated events here, we go through the real history of events and only consider those events in which the output of a new model (chosen landing pages) equals the output of the old model. In figure 12.2, the simulated user behavior implies five transitions from banners to landing pages with two successful purchases and $6 of a predicted profit per click.

Figure 12.2 Unbiased estimator (offline validation): simulation of events, example 1

Featuring only four goods viewed by users, figure 12.3 suggests the lowest possible profit ($10) and profit per click ($2.5).

Figure 12.3 Unbiased estimator (offline validation): simulation of events, example 2

Figure 12.4 provides another example of simulated user behavior with a decent conversion rate (50%) but one of the lowest profit-per-click values (only 3%) due to the low cost of purchases.

Figure 12.4 Unbiased estimator (offline validation): simulation of events, example 3

Finally, a simulated environment in figure 12.5 suggests the best possible CVR (60%) and profit per click ($7), while the overall profit is only $3 less than that of the stream of real events (see figure 12.1).

Figure 12.5 Unbiased estimator (offline validation): simulation of events, example 4

We eventually end up with a virtual dataset produced by the interaction of our model with a simulated data flow. Based on it, we can calculate all listed metrics:

CTR = number of clicks/number of views (total number of rows)
CVR = number of conversions/number of clicks
eCPM = revenue/number of clicks * 1000

While not ideal, running different models through this procedure will make it possible to evaluate them from the product metrics point of view and understand the relationship between these estimates of online metrics and the accuracy of predictions and then provide this information to a wider audience.

12.1.4 Human evaluation

Sometimes the final option worth mentioning for measuring results before deploying the system to production is validation by the assessors, or human evaluation. If the simulation is computationally expensive and slow, this can be even worse due to limited bandwidth, longer delay, human-induced ambiguity, and greater costs. However, in cases where human-in-the-loop is applicable, it is usually the most precise and trustworthy way to test the model, except for direct online testing on real data.

For instance, after building a new version of a search engine pipeline, we can either ask a group of experts to estimate the relevance of search results from 1 to 5 for some benchmark queries or choose which output is more relevant, comparing the old and new versions of search results. The second (comparative) formulation tends to produce a more robust evaluation as it excludes the subjectiveness of score estimates. In the case of generative large language models, human evaluation and hybrid approaches (auxiliary “critic” model, trained on human feedback) are the most popular options to measure the quality of generations.

While human evaluation has low throughput and high costs that may impede frequent use, its capacity to yield highly precise and trustworthy assessments (compared to fast but limited automated methods) cannot be downplayed.

12.2 A/B testing

Heavily used in various fields, including marketing, advertising, product design, and many others, A/B testing is a gold standard for causal inference that helps make data-driven decisions rather than relying solely on intuition. You might have a valid question: we have been discussing how to measure and report results, and now we have suddenly jumped on the causal inference boat; how so? In a sense, measuring results is a part of causal inference, as we want to be sure those results are caused by the factor we have influenced (in our case, an ML system). Of course, we can deploy our system in production and measure all metrics of interest as is, but we rarely do so, mainly for the following reasons:

It is dangerous. If something is broken, deploying it to everyone would lead to further breakage. For this reason, we should deploy it to a small fraction first.
Even if we don’t care about the previous point (probably not, but still), what if something changed outside our control zone, affecting the overall performance, but we thought it was us who caused it and thus made a false conclusion?
These two points combined lead us to the need for a control group (status quo) and a treatment group (our change), which helps us handle both; on the one hand, we can vary the size of the treatment group, controlling the affected fraction, and on the other hand, we can have the control group to compare the change affecting the treatment group with it.

During A/B experiments, we split entities (in most cases, users) into two groups:

A control group that uses the existing version of our system
A testing group that uses the new version

This allows us to isolate the effect of a change on key metrics while controlling external factors that may affect the results (see figure 12.6).

Figure 12.6 A/B testing split stage

12.2.1 Experiment design

Every experiment starts with a hypothesis:

If we do [action A], it will help us achieve/solve [goal/problem P], which will be reflected in the form of an [X%] expected uplift in the [metrics M] based on [research/estimation R].

Let’s break down the variables mentioned in the hypothesis:

Action is deploying a new solution to compare it with the existing one. For instance, the goal could be to decrease the number of out-of-stock situations and reduce the gap between predicted and actual sales for Supermegaretail.
Metrics are providing us with a means of quantifying the progress toward our objectives. For example, a “weight loss” metric in a fitness app can demonstrate a user’s progress toward their set target weight. See chapter 5 for further insights.
Expected uplift can be derived from our previous experience, available benchmarks, a rule of thumb, or even wishful thinking (this happens very often!). Additional data provided by offline testing is usually used to make it more precise.

Before running an A/B test, there are at least three hyperparameters we must define:

Minimum detectable effect (MDE), also known as minimum detectable change (MDC), is the smallest disparity between two groups that can be reliably detected given the study’s design and sample size. The MDE must be equal to or less than the expected uplift to avoid overlooking it.
Type I error (false positives) is the probability of wrongly concluding that an effect exists when it does not. A typical threshold for this error is 5% (signified by α = 0.05).
Type II error (false negatives) is the probability of failing to recognize an effect when it is present. A typically used threshold is 20% (represented by β = 0.2).

Along with the variation in the data, this affects the number of samples (and therefore the time required to run the test, taking some seasonality into account as well) needed to run the test; all these factors combined result in the following equation:

Where n is the required sample size per group, Z_α is the Z-score corresponding to the desired significance level (for type I error rate), Z_β is the Z-score corresponding to the desired power (for type II error rate), s² is the estimated variance of the metric, and MDC is the minimum detectable change. As we mentioned before, the very frequent sample unit is user, but it could be session, transaction, shop level, and so on.

After formulating the hypothesis and our expectations, we choose a splitting strategy determined mainly by the nature of the system we deploy and how it will affect the existing processes.

Next, we decide on the statistical criteria and conduct simulation tests, both A/A (where we compare the same system against itself to check type I error rates) and A/B (where we simulate a desired effect through the addition of the noise of a desired magnitude), as a sanity check on historical data. This helps us confirm whether we meet the predefined type I and type II error levels. A comprehensive description of that is beyond the scope of this book, but if you are interested in more details, a great starting point would be Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing () by Ron Kohavi et al.

Finally, we have everything we need to launch our A/B test. Remember that the design of the experiment is fixed beforehand and cannot be changed on the go. Next we briefly outline the basic stages of A/B testing.

12.2.2 Splitting strategy

The question is, how do we split the data? The usual answer is “by user.” For instance, we have a control group of users to interact with the existing search engine and an experimental group to use the new version. However, things get more complex when we can’t split by users (e.g., there isn’t a consistent user ID), or splitting by users is entirely irrelevant to our service.

To zoom in, let’s examine hypothetical pricing systems applied in different domains:

In offline retail, it is physically not feasible to show different price tags to customers within the same store. So a more suitable unit for splitting is the store rather than individual customers.
In online retail, while it is technically possible to show different prices to different users, it also can lead to legal problems, as many countries have laws preventing user discrimination in eCommerce.
In ridesharing, users can be separated as the orders are made using the app, but we must consider potential negative user experiences. If we split by user and two users decide to ride scooters together, one user might see a price of $5 per hour in their app, while it will be $6 per hour for the other. This discrepancy could lead to a confusing and negative user experience and feedback.
When it comes to dynamic pricing applied to loans and credit rates, the offered rate is expected to be highly personalized, like a mortgage. Therefore, different “prices” won’t raise eyebrows, even if two users are in the same room. In this situation, it’s relatively safe to split by user.
An ideal case for user testing is presented by advertising networks, as they offer an abundance of diverse, relatively low-cost data points, and a user-specific cost per click is both common and expected. This type of setup often allows for more traditional user-based splits.

If splitting data by individual users isn’t feasible, we can use higher-level entities as atomic units. For offline retail, we could use entire stores; for scooters, trips could be divided based on parking slots, neighborhoods, cities, or even regions. However, this strategy might shift data distribution and create unequal sample sizes or lack of representation, which should be considered when selecting an appropriate statistical test.

When splitting by a nontypical key, aiming for group (bucket) similarity is essential. For instance, if you’re compelled to divide geographically, the chosen areas should be similar and possess comparable economic indicators (they should match each other).

Finally, there is a more advanced splitting strategy called switchback testing. This technique divides data into region-time buckets and randomly and continuously switches between different models (see figure 12.7). It ensures that each region will be in both the control group and the testing group for an approximately equal amount of time. For further details, see “Design and Analysis of Switchback Experiments” by Iavor Bojinov et al. ().

Figure 12.7 Switchback split pattern, representing time-region cell allocation

12.2.3 Selecting metrics

When designing an A/B test, selecting metrics is a crucial step that can be a deciding factor in the test’s success or failure. We prefer trustworthy, sensitive, interpretable metrics with a low feedback delay for online testing.

Every experiment usually has three kinds of metrics:

Key metrics are the primary metrics the experiment aims to improve. These metrics directly affect the success of our business objectives and are used to determine the final outcome of the experiment. For example, the key metric for an eCommerce website could be the CVR, average order value (AOV), or gross merchandise value / revenue.
Control metrics are expected to remain unchanged during the experiment, serving as a check on its validity. For example, if the control metric for the number of visitors to the website (or page loading time) shows a significant decrease (or increase), this could indicate a problem with the experiment design or a change in external factors that could affect the results.
Auxiliary metrics provide additional information about the experiment but are not used to determine the final outcome. These metrics help provide a deeper understanding of the experiment’s results and can be used to identify potential problems or opportunities for improvement. For example, an auxiliary metric for an eCommerce website could be the number of product views or the time spent on the website.

All three groups play an important role in the design and execution of an A/B testing experiment. Key metrics determine the success of the experiment, control metrics ensure the validity of the experiment, and auxiliary metrics provide additional information to support the results and inform future experiments.

12.2.4 Statistical criteria

The third essential ingredient in any A/B experiment is a statistical test that makes the final decision of whether a significant effect is captured.

In short, statistical criteria are used to quantify the probability of receiving specific results under specific assumptions. For example, the probability of receiving t-statistics equal to or greater than 3 generally will be less than 0.01 (if there’s no difference between the groups and assumptions for which the t-test is met).

The most common statistical test used in A/B testing is the t-test we just mentioned. It has many modifications. For example, Welch’s t-test is relatively easy to interpret and is similar to Student’s t-test, but it can be used even when the variances of two groups are not equal. The statistic for the Welch t-test is calculated as follows:

The null hypothesis (a default assumption under the test’s result) for the Welch t-test implies that there is no difference between the means of the two groups; here, the captured difference between the two groups appeared by accident and was caused by noise. The alternative hypothesis is that the captured difference is not caused by noise, and the difference is statistically significant.

By statistical significance, we mean that the probability of capturing a specific value of a statistic or even a more extreme one under the assumption that the null hypothesis is true (we call it p-value) is less than significance level α (effectively, the same as type I error; typically, α = 0.05). P-value, along with the significance level, is a way to standardize the results of any statistical test and map them to a single reference point rather than to define critical values for each test in particular.

While there are many different statistical tests available, the choice of a particular test depends on the context of the experiment and the nature of the data. That said, providing a detailed guide on choosing the most appropriate test is beyond the scope of this book. For a more detailed discussion on selecting the right statistical test for A/B testing, we once again recommend Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing () by Ron Kohavi et al.

The t-test is widely used because it is a robust test that can handle a variety of situations, it is easy to implement, and it provides results that are easy to interpret. However, the choice of a specific test should always be made considering the type of data you are dealing with, the nature of your experiment, and the assumptions each test requires.

12.2.5 Simulated experiments

We can run a simulation to check whether we have succeeded in designing an experiment. We replicate the whole pipeline multiple times on a different set of samples and periods.

A simulated A/A test involves randomly sampling two groups with no incurred difference and applying a chosen test statistic. We repeat these actions many times (say 1,000 or 10,000 times) and expect the p-value to follow a uniform distribution (if everything is correct). For instance, in 5% of simulations, the test will reject the null hypothesis for α = 0.05. The exact percentage of cases when the test does catch a difference for two groups is a simulated type I error rate.

A simulated A/B test does a similar thing, but now we add an uplift of a specific size (usually MDE of interest) to the second group B (we assume that groups A and B are the same size as they will be during the real A/B test). Again, we apply our test statistic and run it 1,000 to 10,000 times. After that, we reexamine the distribution of resulting p-values and count how many times we rejected the null hypothesis (“the p-value passed the threshold”). The percentage of cases when the chosen test rejects the null hypothesis is an estimation of sensitivity (1 – β), which is the probability that the test catches the difference if there is one. If we subtract sensitivity from 1, we get a type II error rate (the probability of ignoring the difference between A and B when there is one). If calculated type I and II error rates fit predefined levels, then the statistical test is picked properly, and the sample size is estimated correctly (see figure 12.8).

Figure 12.8 Simulation provides a final validation that everything is correct and set up to run an A/B test.

12.2.6 When A/B testing is not possible

There are situations where running classic A/B tests is not feasible or desirable—either due to legal restrictions (certain industries, such as healthcare and finance, have strict regulations that prohibit conducting experiments on customers), logistical limitations (it may be difficult to randomly divide the customer base into two groups for testing), or other reasons.

There are different ways to tackle this problem, but since this is a narrow topic that is out of the scope of this book, we will not cover it in detail. For your convenience, however, we list them here so you can use them as keywords for a more in-depth look:

Casual effect
Difference-in-difference
Synthetic control
Interrupted time-series analysis
Regression discontinuity design
Causal inference

12.3 Reporting results

Monitoring an ongoing experiment is important for ensuring it runs smoothly and produces reliable results. If something goes wrong during the experiment, it is critical to promptly identify and address the problem to avoid a growing negative effect. Sometimes it is possible to exclude specific users, items, or segments from an experiment and move on. However, this option is not recommended if it was not considered in the initial design. In other cases, we prefer to terminate an experiment entirely and start investigating factors that have caused the failure.

Campfire story from Arseny

Those working in an extra-flexible startup environment may think that all the reporting only applies to big corps, but that is not correct; even in small companies, enough effort should be dedicated to ensure proper reporting and experiment management. I learned this lesson the hard way: once I worked in a relatively small company with only three ML engineers, where I was to improve the core performance of a model. After months of research and experiments, I managed to achieve good results, which were highly appreciated by my manager and the company’s CTO. I was into production implementation of the new model when a message from the CTO popped up: “Hey, we have a board meeting tomorrow, and I want to highlight your achievement on the slides. Please give me detailed information on X with the granularity of Y.”

I tried to make a report per the request, but suddenly, after careful analysis, the numbers showed that the new model’s performance was actually worse than the old one’s! It looked awful, but I definitely wasn’t going to try to trick the CTO by reporting nonexistent better results. Given that it was my first big project at this company, it looked very suspicious. The group of three—the CTO, my manager, and myself—started digging into the problem.

It was only close to midnight that I found the root cause: while running additional experiments, I accidentally overwrote the results of the best model with some unsuccessful experiments and used this data for the board meeting report. Because I was the only person working on the problem, I didn’t pay enough attention to that, and the lack of proper reporting and experiment management (just semirandom chunks of data stored on a dev machine instead) led to this incident.

12.3.1 Control and auxiliary metrics

In section 12.2.1, we mentioned control and auxiliary metrics that we should monitor during the test. They can hint if something goes wrong before the key metrics experience a significant drop. For instance, it is important to track user feedback—if you notice a significant drop in user engagement or a spike in user churn, it is a clear indicator of malfunction. In this case, it may indicate that the experimental group is not responding well to the pushed changes. This information can easily lead to an early termination.

Also, when conducting an A/B test, we should consider fairness and biases among different segments of users, which can affect target and auxiliary metrics. This analysis is similar to what we covered in depth in chapter 9. For further reading, we recommend “Fairness through Experimentation: Inequality in A/B Testing as an Approach to Responsible Design” by Guillaume Saint-Jacques et al. ().

12.3.2 Uplift monitoring

The most valuable measurement to track during an A/B experiment is the uplift or the relative difference in the key metric between group A and group B. The longer the experiment runs, the greater the sensitivity of the test and the smaller the effect (either positive or negative) we can detect.

Figure 12.9 shows a funnel representing the range of effects that we cannot detect with statistical significance. If the uplift moves outside this funnel, the effect is significant. Please note that without a specific design, you cannot peek into the test as many times as you wish until you see the desired results. For further reading on the subject, we recommend “Peeking Problem—The Fatal Mistake in A/B Testing and Experimentation” by Oleg Ya () and “Sequential A/B Testing, Further Reading Choosing Sequential Testing Framework—Comparisons and Discussions” by Mårten Schultzberg et al. ().

Figure 12.9 An example of cumulative lift dynamics over time with specific confidence intervals

This funnel can be calculated based on the MDC. Here, we need to solve the same equation we saw earlier—however, this time for different time periods—but we cannot make a decision until a period of time that we calculated in advance has passed. You could ask: what’s the point of using it then? The answer is to have an emergency stop criterion to abort the experiment if results are outside the funnel and negative—the least expected, desirable, and probable outcome. For any sequential testing framework mentioned by the link earlier (e.g., mSPRT), we can make a decision as soon as results are outside the funnel or the experiment time has ended.

12.3.3 When to finish the experiment

It might be tempting to prematurely stop or extend an A/B test based on interim results. However, this can lead to erroneous conclusions due to changes in statistical power and the risk of false positives or negatives if we have not incorporated sequential testing. So when should we stop an experiment if we are not doing sequential testing?

In general, an experiment should run according to the predetermined design unless there’s a catastrophic drawdown in key metrics (a significant negative effect) that would result in a meaningful financial loss had the experiment continued.

On the other hand, if you see positive effects in key metrics, the best practice is to stick with the initial plan and let the experiment run its full course. It allows you to confirm that these positive effects are sustained over time and are not a result of short-term fluctuations.

What about those borderline cases where some metrics are positive, some are negative, and others hover around the MDE? It’s crucial during the experimental design phase to decide how you will interpret these mixed outcomes and set priorities among the different key metrics.

If you can’t wait to see the final results, consider using methods like sequential testing, which allows for repeated significance testing at different stages of the experiment. These methods come with their own assumptions and considerations, so make sure you understand them fully before proceeding.

Deciding whether to stop or continue an experiment is a complex task that requires careful consideration of the metrics, potential outcomes, and their effects. Document these decisions in your experimental protocol, including criteria for possible early termination.

12.3.4 What to report

After finishing the experiment, it is time to start our analysis. Here we need to calculate the required metrics, double-check monitoring, and provide confidence intervals for every measurement.

Not every experiment is successful. According to our experience, it is quite good to have 20% of A/B tests show a statistically significant difference (recall that usually if the false positive rate is 5%, the margin is 15%). And remember: if the test was statistically significant, it doesn’t mean that the results are positive or massive.

Suppose we decide that the A/B test is successful and the effect is significant. In that case, we have two options:

Report a pointwise effect—“Effect is significant and equal to X,” where X is the calculated difference between metric values on both experimental and control groups.
Estimate the confidence interval for effect—If the pessimistic estimate of the effect is equal to the lower confidence bound of the difference, the conservative estimate is equal to the pointwise difference, and the optimistic estimate is equal to the upper confidence bound.

We have provided the core information of the report (of course, the reported metrics should be understandable to the audience, preferably expressed in earned or saved money, longer user sessions, etc.). In addition, it is good to dive deeper into the change in metrics, analyze how it affected different user or item segments, and outline the further steps (how overall metrics might change if we roll up on 100%). Table 12.1 shows examples of what fields can be included in the reporting table.

Table 12.1 An example of what fields can be included in the reporting table

Metric	Group A	Group B	MDE	Lift	p-value	Conclusion
CVR	75.2%	79.8%	5%	6.12%	0.0472	+4.2–6.9% (significant)
AOV	$232.2	$242.8	11%	4.57%	0.3704	No significant effect
…	…	…	…	…	…	…

Once uplift is reported, the possible positive scenario may include one more experiment on a larger proportion of data or a complete switch to a new system—or a full-scale rollout, pause, and reversed A/B test. A series of successful A/B experiments will provide you with a solid data-driven argument that will serve as adequate support in making decisions on whether to give the green light to further steps.

12.3.5 Debrief document

Writing a debrief document is valuable for transparent communication of A/B experiment results, as well as for improving the quality of future experiments. It should be created during or immediately after the experiment to summarize key findings, including captured insights, detected problems, and recommendations. Sharing this document with the team ensures everyone is on the same page and enables continuous learning and improvement of the system.

If the experiment is successful, a debrief document will include suggestions for similar experiments in other products. In case of failure, it is important to discuss what should be done differently in future tests and what mistakes should be avoided, as well as to develop new control metrics to prevent similar failures in the future earlier during an experiment.

12.4 Design document: Measuring and reporting

Because reporting cannot be designed at the preproduction stages when the design document is set to be prepared, it should be skipped by default. However, we’ve included it for demonstration purposes.

12.4.1 Measuring and reporting for Supermegaretail

Measuring and reporting for Supermegaretail as part of the design document have their own peculiarities, and one of the crucial ones is that since we are predicting the future, we can only assess the quality of the system when the future comes; at the end of the day, we want to understand how much profit we will be able to generate, but we can only know this post facto, while the only way to evaluate this is by using certain metrics that are not directly related.

Design document: Supermegaretail

IX. Measuring and reporting

i. Measuring results

As a first step to improving the prediction quality of the already deployed sales forecasting model, we plan to experiment with combining existing models (one per category) into one. The reasoning behind this is that we will encode information about the group of items without loss in specificity but will gain more data and, thus, better generalization. Even with no improvement in quality, it is still much easier to maintain one model than many.

As offline metrics, we looked into different quantiles, percentage errors, and biases. But instead of evaluating only the overall score, we checked on metrics and the error analysis of specific categories. These offline tests yielded the following intermediate results:

The general prediction quality steadily increases across all metrics, and the majority of validation folds when switching to a split model.
The categories with a small number of products showed an increase in the offline metrics compared to a baseline (multi) model. The amount of data they had was insufficient to learn meaningful patterns. The large categories didn’t change many wins when switching to a unified model. The result is reproducible through different seasons and key geo regions.

Previously, A/B tests provided some estimation of what uplift to expect in each major category based on offline metrics. If we reduce metric M for product P by X%, it leads to a decrease in missed profit (out-of-stock) by Y% and cuts losses due to the overstock situation by Z%. According to our estimates, the total increase in revenue for the pilot group is expected to be 0.3% to 0.7%.

ii. A/B tests

Hypothesis —The experiment hypothesis is that according to the offline metrics improvement, we expect revenue to increase by at least 0.3.
Key metrics —The best proxy metric for revenue we can use in the A/B experiment is the average check. It perfectly correlates with the revenue (assuming the number of checks has not changed). The atomic observation for the statistical test will be a single check.
Splitting strategy —We split by distribution center and, through them, acquire two sets of stores, as each center serves a cluster of stores. From those sets, we pick subsets that are representative of each other and use them as groups A and B.
Additional metrics —Control metrics for the experiment are
- Number of checks per day —Whether the sales volume has no significant drop.
- Model’s update frequency —Does the model accumulate newly gathered data regularly?
- Model’s offline metrics —Quantile metric, bias, WAPE.
Auxiliary metrics —
- Daily revenue
- Daily profit
Statistical criteria —We’ll use Welch’s t-test to capture the difference between samples.
Error rates —We set the significant level to 5% and the type II error to 10%.
Experiment duration —According to our calculations, 2 weeks will be enough to check the results. However, given the distribution center’s replenishment cadence of 1 week, we will extend this period to one full month.

iii. Reporting results

A report containing the following chapters is to be provided:

Results —Shown as 95% confidence intervals for primary and auxiliary metrics
Graphical representation —The value of a specific metric at a specific date for all metrics from both control and treatment groups for ease of consumption
Absolute numbers —For example, the number of shops in each group, the total number of checks, and total revenue
Methodology —For example, how groups were picked to be representative of each other, simulations run to check for type I and type II errors, etc. (see the appendix for more details)
Recommendation/further steps —What to do next based on the received results

12.4.2 Measuring and reporting for PhotoStock Inc.

We’ve included this phase as a template for the PhotoStock Inc. case.

Design document: PhotoStock Inc.

IX. Measuring and reporting

i. Measuring results

The following is our baseline list of offline metrics:

NDCG@10 (human) —We use human-labeled relevance scores for each photo in the search engine results page (SERP).
NDCG@30 (implicit) —We use implicit feedback as a relevance score:
- 3 —Image bought
- 2 —Image added to Favorites
- 1 —Image clicked on
- 0 —Views with no interactions
MRR (human) —The average position of the first photo labeled as relevant.
MRR (click) —The average position of the first photo clicked in SERP.
MRR (purchase) —The average position of the first photo purchased in SERP.

We retrospectively apply these metrics to the current non-ML search algorithm by collecting SERPs displayed for each query. We then measure our online metrics (CTR and CVR) over different time periods. After that, we calculate Spearman’s rank correlation coefficient and select an offline metric that correlates most strongly with each online metric. The chosen offline metric is then used in a simple regression model to provide an initial estimate of the MDE in an A/B experiment.

Given that this is the first A/B test, we do not yet have comparative data for more precise MDE estimation. However, for future testing, we propose a systematic approach where we will collect and correlate the differences in offline and online metrics between variations A and B from each A/B test. The ultimate goal is to establish an approximate correlation: for each X% improvement in offline metric C, we see a Y% uplift in online metric D.

Further refinement of our online metrics estimation could involve building a user behavior simulation. However, that would be a more complex task, and we believe the current approach provides a solid foundation for our testing.

Apart from the aforementioned automated offline testing methods, we can also incorporate feedback from assessors. We propose two evaluation tasks:

Given a query Q and the top 20 photos suggested by the new search engine, assessors would rank the overall SERP from 1 to 5 based on its relevance.
Presented with a query Q and the top 20 photos from two different search engines (one being the current version and the other the new version), assessors would determine which SERP is more relevant.

The second task allows for a comparative assessment, which generally produces more robust evaluations by reducing the influence of individual scoring biases.

ii. A/B experiment design

Let’s say we have received approximate estimates of the minimum expected effect in the offline metrics: +5% for CTR and +20% for CVR. Our experiment hypothesis is the following:

If we switch the current search engine to our ML-based solution, it will provide more relevant search results, which will be reflected in an increase in CTR by at least 5% and an increase in CVR by at least 20%.

Consequently, our key metrics are CTR and CVR.

Control metrics, which should not change during the experiment, are

Views per variation —If not equal, either there is a problem in splitting strategy, or we greatly affected user experience in the testing group.
Response latency —This should not increase much, or at least fit SLAs.
Queries per second —If the value is low, the search engine seems to become unavailable.
A daily number of queries and percentage of sessions started from the search bar —Both inform user engagement; we expect these numbers to not decrease.
A number of customer support reports —If there is a spike, something is probably broken.

Auxiliary metrics that give additional information but do not participate directly in the decision-making are

Average time spent on SERP —If increased, it can signal that the search engine has become worse and the user needs to spend more time, make more queries, and scroll more pages to find what they are looking for.
The total number of clicks and purchases per variation—Both will report what leads to an increase in conversion rate (the nominator, the denominator, or both with different rates).
Total profit and average profit per purchase.
An average position of the clicked photo in SERP and an average number of queries per session —A tricky metrics that, with a decrease, under some assumptions, indicate that it has become easier for the user to navigate and fewer actions and less time are needed to find what they needed.
Percentage of sessions ended with a purchase—A less sensitive version of CVR.
Splitting strategy —No extra dependencies are added to the user experience by the new search engine, so nothing prevents us from dividing users in a standard way by their unique ID.
Statistical criteria —Considering that we aim to increase CTR and CVR (both are proportion metrics), we choose the proportional Z-test.

We’ll use conservative error rates for type I and II errors: α = 0.01 and β = 0.05:

Sample size —Let’s analyze what we have before the test:
- PhotoStock Inc. website has 100,000 visits daily
- Average search queries: 1.5 per visit
- Average number of photos viewed after each query: 20
- Average percentage of viewed photos that were clicked: 10%
- Average percentage of clicked photos that was purchased: 1%

Therefore, we have 100,000 × 1.5 × 20 = 3,000,000 views daily:

Baseline CTR is 10% (number of clicks per view)
Baseline CVR is 10% * 1% = 0.1% (number of purchases per view)

Thus, we have 300,000 clicks and 3,000 purchases daily.

Previously, we discussed that to avoid possible risks, we would like to roll out gradually. So for the first A/B, we will take only 10% for both groups (5% for group A and 5% for group B). That means we have 3,000,000 × 2.5% = 150,000 views per day for each group.

How much time should the experiment take? The following is a formula for sample size estimation based on required MDE and error rates:

Here p_A and p_B are old and new proportions (in our case, CTRs or CVRs), and Z_x are critical values of Z-statistic for a given level.

We have 10% and 10.5% as old and new CTR values:

This leads to 131,095 views required, while we have 150,000 views daily for each group. This means that we need only 1 day to collect the necessary amount of data to detect a significant increase (5% or greater) for CTR.

What about the conversion rate? We have 0.1% for the old version and 0.12% for the new version. Applying the preceding formula once again will lead to 978,693 views required. So we need 7 days to collect this data.

Summing up, the experiment will run for a week to satisfy both metrics. A week is also good because it will cover the full cycle of week seasonality, if any (e.g., for PhotoStock Inc., there may be more professional users during working days and amateurs during weekends).

To check whether we did everything right, we will run simulated A/A and A/B tests multiple times, generating samples from the binomial distribution of estimated size and with required CTR and CVR values to ensure that calculated error rates are less than the desired ones.

iii. Reporting template

During an experiment, we primarily monitored MDE dynamics. As expected, the CTR effect was statistically significant on the second day. In contrast, a statistically significant effect in CVR was detected only on the fourth day of the experiment.

CTR effect

CVR effect

By the end, we collected the following metrics values:

CTR is 9.91% in group A and 10.51% in group B (+6% difference).
CVR is 0.10% in group A and 0.13% in group B (+29% difference).

We make the preliminary conclusion:

The new search engine only slightly increased the number of clicks per view (+6%); however, the clickable photos turned out to be more relevant (+29% more conversions into purchases). Therefore, the A/B experiment was successful.

Expended effect reporting

We applied the bootstrap procedure to switch from point-wise effect estimation to distribution. We treat the lower and upper bounds for the 95% confidence interval as pessimistic and optimistic effect estimations. We use the original point-wise effect as a conservative estimate.

Bootstrapped CTR

We report the following effect for CTR:

The pessimistic estimate is +5.1% (new CTR is 10.4%).
The conservative estimate is +6% (new CTR is 10.5%).
The optimistic estimate is +6.8% (new CTR is 10.6%).

Bootstrapped CVR

We report the following effect for the CVR:

The pessimistic estimate is +19.2% (new CVR is 0.12%).
The conservative estimate is +29% (new CVR is 0.13%).
The optimistic estimate is +39.4% (new CVR is 0.14%).

Scaling up the effect

Let’s evaluate these gains in terms of revenue when we scale the newly tested ML-based search engine to all traffic, assuming that the average photo profit ($3.0) and amount of daily visits (3 million) are the same.

The baseline CVR was 0.1%. Multiplying by daily traffic and fee, we get $9,000 daily income or $3.3 million yearly:

The pessimistic estimate gives us $10,800 daily (+$1,800) or $3.9 million yearly (+$600,000).
The conservative estimate gives us $11,700 daily (+$2,700) or $4.3 million yearly (+$1 million).
The optimistic estimate gives us $12,600 daily (+$13,600) or $4.6 million yearly (+$1.3 million).

These numbers fully justify all our efforts and time.

Further steps will require running two to three shorter A/B experiments to cover a larger percentage of users (for instance, 20% -> 30% -> 50%) to ensure the system is safe, the effect is reproducible, and the experiment results are trustworthy.

Consider these side notes:

Control metrics detected no incidents during the experiment.
The average search time slightly decreased, which is a good sign. This means that the user experience has improved and that users can find what they need faster.
The number of sessions that ended with purchase grew. SERPs became more relevant.
The average profit per photo remained unchanged (~$3.0).

After the experiment, we also retrospectively estimated how many overall clicked photos were in the top 1,000 of the new model. It turns out that Recall@1000 for these clicks was 97%. This important finding hints at the further improvement of the system, specifically switching to the two-stage pipeline by adding a retrieval stage (candidate model).

Summary

Use offline evaluation to get an approximation of the expected effect of your ML system before deploying and running the online testing phase.
There are various ways to derive a connection between offline and online metrics, including benchmarks and the history of previously run A/B tests and their respective offline metrics.
To make data-driven decisions about whether a new model is better than the old one (or an ML-based solution is better than a non-ML one), the industry standard is an A/B test (however, there are other options to derive a causal relationship).
To run an A/B test smoothly and safely, we need to formulate a plausible hypothesis, pick the right key/control/auxiliary metrics, estimate expected lift and experiment duration, and sometimes perform simulated A/A and A/B tests before conducting real testing.
While reporting the experiment results to stakeholders and teammates, we need to communicate the captured effect, what effect to expect if we roll out the service, what effect is seen in different segments, what problems were found during the experiment, and what the next steps and suggestions for further improvements are.