7 Validation schemas

This chapter covers

Ensuring reliable evaluation
Standard validation schemas
Nontrivial validation schemas
Split updating procedure
Validation schemas as part of the design document

Building a robust evaluation process is essential for a machine learning (ML) system, and in this chapter, we will cover the process of building a proper validation schema to achieve confident estimates of system performance. We will touch upon typical validation schemas, as well as how to select the right validation based on the specifics of a given problem and what factors to consider when designing the evaluation process in the wild.

A proper validation procedure aims to imitate what knowledge we are supposed to have and what knowledge can be dropped while operating in a real-life environment. This is somewhat connected to the overfitting problem or generalization, which we’ll cover in detail in chapter 9.

It also provides a reliable and robust estimation of a system’s performance, ideally with some theoretical guarantees. As an example, we guarantee that a real value will be in the range between the lower confidence bound and upper confidence bound 95 times out of 100 (this case will be covered in a campfire story from Valerii later in the chapter). It also helps detect and prevent data leaks, overfitting, and divergence between offline and online performance.

Performance estimation is the primary goal of validation. We use validation to estimate the model’s predictive power on unseen data, and the preferred schema is usually the one with the highest reliability and robustness (i.e., low bias/low variance).

As long as we have a reliable and robust performance estimation, we can use it for various things, like hyperparameter optimization, architecture, algorithm, and feature selection. To some extent, there is a similarity to A/B testing where schema yielding lower variance provides higher sensitivity, which we will cover later in the chapter.

7.1 Reliable evaluation

When validating anything, it is almost always a good idea to build a stable and reliable pipeline that produces repeatable results (see figure 7.1). Standard advice that you most probably can find in the literature comes down to the following three classic conditions: all you need to do is to split the data into training, validation, and test datasets. A training set is used for model training, a validation set is designed to evaluate performance during training, and a test set is used to calculate final metrics. This three-set approach is well known to those familiar with competitive ML (e.g., challenges hosted by Kaggle) or academia. At the same time, there are subtle but important distinctions within applied ML that we will discuss further in this chapter.

Figure 7.1 Basic high-level model development cycle

There are some points to pay attention to:

A simple train-validation-test split assumes that all three datasets come from the same distribution and that this will hold in the future. This is a strong assumption that has to be validated by itself. If this assumption doesn’t hold, there is no guarantee of future performance.
There must be a repeatable use of the validation set to estimate model performance. Overestimating the model’s performance based on the validation set leads to a bias and overfit toward this set. Stop and think: when we talk about things like hyperparameter optimization, feature selection, or model selection from a high-level perspective, it is basically a part of the learning process as well. By induction, the test set can be abused in the same manner.

That is why using the same validation split over and over again for evaluation and searching for optimal hyperparameters or anything else will lead to biased/overfitted and nonrobust results. For this reason, instead of viewing validation as the thing done once at the very beginning, we view it as a continuous process to be done repeatedly once the context of the system changes (e.g., there are new sources of data, new features, potential feedback loops caused by model usage, etc.).

We are never 100% sure what the world will bring next; that’s why we must expect the unexpected.

7.2 Standard schemas

As practice shows, you won’t need to reinvent the wheel when picking a validation schema for your ML system. Most of the standard schemas are time-tested and well-performing solutions that mostly require you to pick one that fits the requirements of your project. We will briefly cover these schemas in several subsections.

Classic validation schemas are well implemented in the evergreen Python ML library scikit-learn, and all the relevant documentation is worth reading if you have doubts about your knowledge of the material. The information is available at .

7.2.1 Holdout sets

We’ll start by splitting the dataset into two or more chunks. Probably the golden classic mentioned in almost any book on ML is the training/validation/test split we discussed earlier.

With this approach, we partition data into three sets (it might be random or based on a specific criterion or strata) with different ratios—for example, 60/20/20 (see figure 7.2). The percentage may vary depending on the number of samples and metrics (the amount of data, metric variance, sensitivity, robustness, and reliability requirements). Empirically, the bigger the full dataset, the smaller the share that’s dedicated to validation and testing, so the training set is growing faster. The test set (i.e., outer validation) is used for the final model evaluation and should never be used for any other purpose. Meanwhile, we can use the validation set (i.e., inner validation) primarily for model comparison or tuning hyperparameters.

Figure 7.2 Standard by-the-book data split

7.2.2 Cross-validation

The holdout validation is a good choice for computationally expensive models, such as deep learning models. It is easy to implement and doesn’t add much time to the learning loop.

But let’s remember that we take a single random subsample from all the data. We are not reusing all available data that might lead to biased evaluation or underutilization of available data. What’s the worst part? We get a single number that does not allow us to understand the distribution of the estimates.

The silver bullet for resolving such a problem in statistics is a bootstrap procedure. In the validation case, it would look like randomly sampling train validation splits many times, training and evaluating the model each iteration. Training a model is time-consuming, and we want to iterate quickly for general parameter tweaking and experimentation. So how can we do it?

We can use a similar but simplified sampling procedure called cross-validation. We can split data into K folds (usually five), exclude each of them one by one, fit the model to the K – 1 folds of data, and measure performance on the excluded fold. Hence, we get K estimates and can calculate their mean and standard deviation. As a result, we get five numbers instead of one, which is more representative (see figure 7.3).

Figure 7.3 K-fold split: each sample is assigned to a fold, and each fold provides validation once and trains once in the rest of the training rounds.

There are several variations of cross-validation, including:

Stratified cross-validation (we need to maintain the balance of classes).
Repeated cross-validation (we split into K folds N times, so that each object participates in the evaluation N times).
Grouped cross-validation (when objects are similar within groups, we may want to avoid a leak; the entire group must be fully included either in the training sample or in the validation sample).

Suppose we predict the flow rate of oil at hundreds of wells. Wells are grouped based on their location: neighboring wells extract oil from the same oil field, so their production affects each other. For this case, a grouped K-fold is a reasonable choice. Finding a proper criterion for grouping samples while assigning them to folds is one of the key decisions for validation overall, and mistakes here greatly affect the result.

7.2.3 The choice of K

The only question left is what number of folds to choose. The choice of K is dictated by three variables: bias, variance, and computation time. The rule of thumb is to use K = 5, which provides a good balance between bias and variance.

An extreme case for K is a leave-one-out cross-validation when each fold contains a single sample of data; thus K is equal to the overall number of samples in the dataset. This schema is the worst in terms of computation time and variance, but it’s the best in terms of bias.

There is a classic paper by Ron Kohavi from 1995 titled “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection” () that provides the following guidelines:

Increasing the number of folds reduces bias and improves performance estimation.
At the same time, variance increases along with the number of folds due to a lower number of samples in each validation fold (the estimates become too noisy). With an assumption of consistent bias, the sensitivity of the validation schema is determined by variance.
Using repeated cross-validation for model comparison goals with K = 2 or K = 3 repeated 10 to 20 times is a good idea. However, for the bias optimization, repeated K-fold isn’t helpful since estimates between different repeats already share consistent bias.
The number of required folds naturally decreases with the growth of the dataset size. The more data you have in each fold, the more representative it is.
For simpler models (which is the case when dealing with baseline solutions) and well-behaved datasets, you expect both bias and variance to decrease with the number of folds.

It is important to remember that the validation schema’s high sensitivity (i.e., low variance) only matters when the changes we try to catch in the model’s performance are small.

7.2.4 Time-series validation

When dealing with time-sensitive data, we can’t sample data randomly. Sales of products on neighboring days share some information with each other. Similarly, recent user actions provide a hint on some aspects of their later actions. But we can’t predict the past based on data from the future. In time series, the distribution of patterns is not uniform along the dataset, and we must figure out other kinds of validation schemas. How do we evaluate the model in this case?

Validation schemas used in time-series data are similar to the holdout set and cross-validation but with nonrandom splitting by timestamp. The recommendations for choosing the number of folds and their size in rolling cross-validation are similar to the ordinal K-fold.

Time-series validation adds several extra degrees of freedom that need to be considered. A great paper, “Evaluating Time Series Forecasting Models” by Cerqueira et al. (), elaborates on the following points:

Window size—The size of the testing set should reflect how far we make the forecast and how long the model will stay in production before retraining.
Training size—There are two options in regard to the amount of data used for training: we either use all available history or limit the training size to one or two previous periods (those can be weeks, months, or years, depending on a given seasonality) and discard all previous history as irrelevant.
Seasonality—There are patterns in data that depend on cycles of days, weeks, months, quarters, or years. We should select sizes of testing and training sets accordingly to capture these patterns. For example, to capture yearly patterns, the training data should include at least 2 years of history. Another example is a weekly seasonality in a testing set: to minimize variance between folds, each should contain the same days of the week (so we take whole weeks in each fold).
Gap—There can be a gap between training and testing data, which pursues two goals. First, it prepares us for a lag in receiving new data (which leads to a lag in features), and second, it makes training and testing data less correlated, thus minimizing the risk of a leak. For instance, we may skip 2 to 3 days between training and testing sets in both cases.

While time-series validation is one of the most defect-sensitive validation methods, relying solely on the simple “don’t look at future data while training” rule would be far too shortsighted. Following this rule can save you from 95% of typical mistakes; still, there are cases where you may need to break it. For example, ML applied to financial data (such as stock market time series) is known for its high bar in precise validation requirements. At the same time, some experts in the area highlight that trivial time-series validation, as shown in figure 7.4, can lead to overfitting caused by limited data subsets (for more details, see “Backtesting Through Cross-Validation,” chapter 12 of Marcos Lopez de Prado’s Advances in Financial Machine Learning; Wiley). A similar reason to violate this rule may be rooted in your need to estimate how the model performs in anomaly scenarios. To get this signal, you can train the model on data from 2017 to 2019 and 2021 to 2023 and later test it on data from the COVID period of 2020. Such a split barely works as the default validation schema but still can be useful as auxiliary information.

Figure 7.4 Standard time-based split. The test dataset always follows the train one, so train samples are “past” and test is “future.”

Sometimes you need to use a combination of different schemas. In the earlier example of flow rate prediction, we might combine grouped K-fold validation with time-series validation:

import numpy as np  from sklearn.model_selection import GroupKFold    import numpy as np  from sklearn.model_selection import GroupKFold  from sklearn.exceptions import NotFittedError    def grouped_time_series_kfold(model, X, y, groups, n_folds=5,   n_repeats=10, seed=0):      scores = []      np.random.seed(seed)      unique_groups = np.unique(groups)        for i in range(n_repeats):          gkf = GroupKFold(n_splits=n_folds)          shuffled_groups = np.random.permutation(unique_groups)            for train_group_idx, test_group_idx in gkf.split(X, y,          groups=shuffled_groups):              train_groups = shuffled_groups[train_group_idx]              test_groups = shuffled_groups[test_group_idx]                # Find the earliest and latest indices for train and test groups              train_indices = np.where(np.isin(groups, train_groups))[0]              test_indices = np.where(np.isin(groups, test_groups))[0]              train_end = np.min(test_indices)                # Ensure temporal order              train_mask = np.isin(groups, train_groups) &              (np.arange(len(groups)) < train_end)              test_mask = np.isin(groups, test_groups)                model.fit(X[train_mask], y[train_mask])              score = model.score(X[test_mask], y[test_mask])              scores.append(score)        return np.array(scores)

Campfire story from Valerii

When I was working in the dynamic pricing service of a large online retailer, we were set to build a sales forecast model that would predict sales volumes 1 week ahead, along with postprocessing the predictions to determine optimal prices.

Initially, we took the previous week for validation. As new daily data became available, the validation week was shifted 1 day forward. However, it was observed that the performance metrics on the validation set showed significant fluctuations from day to day. This made it difficult to determine how the model’s quality was changing in the context of periodic feature additions and adjustments, as well as changes to prediction postprocessing.

We wanted to understand the fluctuations in the metrics, and after thoroughly investigating the issue, we discovered that the variety of products changed by 15% week to week and by 40% month to month. Additionally, the sales dynamics of individual products were found to be highly heterogeneous (e.g., 10 units sold today but 0 units sold in the next 2 days). As a result, we relied on changes in the metric, which were caused by the daily updates to the validation set, rather than on actual changes in the model’s quality.

To address this issue, we implemented a “delayed shift” validation approach. Instead of updating the validation set daily, we updated it once a month while still using a 1-week validation period. This ensured that the data used for calculating metrics remained relatively fresh (no older than 1 month) while keeping the validation set fixed for an entire month. Consequently, the comparison between the two models became more meaningful, and the performance metrics became far less noisy.

7.3 Nontrivial schemas

We’ve reviewed standard validation schemas that cover most ML applications. Sometimes they are not enough to reflect the actual difference between seen and unseen data, even if you use a combination of them (e.g., time-based validation with group K-fold). As you know, inadequate validation leads to data leakage and, consequently, too optimistic model performance estimation (if not random!).

Such situations require you to look for unorthodox processes. Let’s review some of them.

7.3.1 Nested validation

Nested validation is an approach used when we want to run hyperparameter optimization (or any other model selection procedure) as part of the learning process. We can’t just use an excluded fold or holdout set, which we will need for the final evaluation to estimate how good a given set of parameters is. Access to the score on the testing data while fitting any parameters is a direct way to overfitting.

Instead, we use a fold-in-fold schema. We add an “inner” split of training data in each “outer” split to tune the parameters first. Then we fit the model on all available training folds with selected hyperparameters and make a prediction for the data that was not seen during hyperparameter tuning. Thus, we get two layers of validation, each of which can have its specific properties (e.g., we may prefer the inner layer to have lower variance and the outer layer to have a lower bias). We can apply nesting not only to cross-validation but also to time-series validation and ordinal holdout split (or mixed schemas of different natures) (see figures 7.5 and 7.6).

Figure 7.5 Example of nested cross-validation

Figure 7.6 Example of nested validation with mixed schemas: holdout split for the outer loop and K-fold for the inner loop

7.3.2 Adversarial validation

Instead of using a random subsample of data like in a standard holdout set, you may prefer to choose a different path. There is a technique called adversarial validation, a popular approach on ML competition platforms such as Kaggle. It applies an ML model for better validation of another ML model.

Adversarial validation numerically estimates whether two given datasets differ (those two may be sets of labeled and unlabeled data). And, if so, it can even quantify it on the sample level, making it possible to construct an arbitrary number of datasets, representative of each other, providing a perfect tool for estimation. An additional bonus is that it does not require data to be labeled.

The algorithm is simple:

We combine datasets of interest (cutting off the target variable, if present), labeling the anchor dataset (the one we want to represent) as 1 and marking the rest as 0.
We fit an auxiliary model on this concatenated dataset to solve the binary classification task (thus 0 and 1 marks).
If datasets are representative of each other and come from the same distribution, we could expect receiver operating characteristic area under the curve (ROC AUC) to be near 0.5. If they are separable (e.g., ROC AUC is greater than 0.6), then we can use the output from the model as a measure of proximity.

Note that while this trick was used in ML competitions for a long time (the first mention we found has been there since 2016, ), it was not part of more formal research until 2020 when it appeared in the paper “Adversarial Validation Approach to Concept Drift Problem in User Targeting Automation Systems at Uber” by Pan et al. ().

We can use this kind of splitting in many cases. When we’re checking the similarity of labeled and unlabeled datasets, there are questions we should keep in mind. How different are their distributions? What features are the best predictors of this difference? Analyzing the model created by adversarial validation may answer these questions. We will also reuse this technique for a similar matter in chapter 9.

7.3.3 Quantifying dataset leakage exploitation

We find an interesting validation technique in a paper by DeepMind titled “Improving Language Models by Retrieving from Trillions of Tokens (2021; ), which proposes a generative model trained on the next-word-prediction task.

The paper’s authors enhance the language model by conditioning it on a context retrieved from a large corpus based on local similarity with preceding tokens. This system memorizes the entire dataset and performs the nearest-neighbors search to find chunks of text in the history that are relevant to the recent sentences. But what if the sentences we try to continue are almost identical to those the model has seen in the training set? It looks like there is a high probability of encountering dataset leakage.

The authors discussed this problem in advance and proposed a noteworthy evaluation procedure. They developed a specific measure to quantify leakage exploitation.

The general idea is the following:

Partition the dataset into training and validation sets as in the usual holdout validation.
Split both into chunks of fixed length.
For each chunk in the validation set, retrieve N nearest neighbors from the training set based on chunk embeddings (here we will omit how chunks are transformed into embedding space, but you can find the details in the paper).
Calculate the ratio of tokens that are common in the two chunks (they use a score similar to the Jaccard Index); this gives us a score ranging from 0 (a chunk is totally different) to 1 (a chunk is a duplicate).
If this score exceeds a certain threshold, filter out this chunk from the training set.

This approach forces the model to retrieve useful information from similar texts and paraphrase it instead of copy-pasting it. You can use this procedure with any modern language model. It is a good example of an exotic technique that allows for minimizing data leakage and increasing the representativity of your dataset. A clear understanding of how the model will be applied will help you develop your own nontrivial validation schema if standard approaches are unsuitable.

7.4 Split updating procedure

We spend as much time on the test data as on the training data.
— Andrej Karpathy

Regardless of which schema we use, we will probably apply it to a dynamically changing dataset. Periodically we get new data that may differ in distribution and include new patterns. How often should we update the test set to make sure our evaluation is always relevant?

There are at least two goals we may want to reach while designing a split update procedure for new data. First, we want our test set to be representative of these new patterns. From this point of view, the evaluation process should be adaptive.

Second, we want to see an evaluation dynamic: how has the model been changing through time with all updates in the architecture or features? For that, the estimates must be robust.

Some of the most common options are as follows (see figure 7.7):

Figure 7.7 Common options for updating train/validation sets. The light data blocks are used for training, while the dark ones are used for validation.

Fixed shift—When dealing with data that has a strong dependence on time and novelty, we are not interested in evaluating the model’s performance on data from a year ago or older due to the drastic change in target distribution. Instead, we would like to use only recent data for validation.
For instance, we take the last 2 weeks as a validation set (starting from the last finished day) and update this set daily while retraining the model used for evaluation each time.
Fixed ratio—When dealing with images or text, we don’t gather new labels for data regularly. In contrast to the first case, we may have no strong dependence on data recency, meaning that newly added data may not be more important than the old data. Typically we expand the set of available data after receiving an extra portion of labels.
If we include newly labeled data only in the training set, we increase metrics due to more data available for the model. If we include this data only in the validation set, the model may miss some unseen patterns. The optimal solution is to keep the ratio between training and validation dataset sizes unchanged so that newly added data will be split accordingly.
Fixed set—Sometimes, instead of a balanced subset of all currently available data, we would like to evaluate our model’s quality on an unchanged “golden set” used as a benchmark. This approach guarantees that the two models are still comparable in terms of any metric, even if there is a long modeling period between them.
This fixed set can be sampled from the dataset before modeling or cherry-picked manually to contain a diverse range of hard cases and reference responses. It is not supposed to be updated by design to ensure consistent model comparison. If we extend this golden set in the future, we will treat it as a completely new benchmark.

Remember: we should perform validation on the whole pipeline, including the dataset; inference on the test set should be the same as in production. If we want to compare models side by side accurately, we should somehow save previous versions of both datasets and models. Tools for data version control and model version control (such as DVC, Git LFS, or Deep Lake) may be of help.

Once there are clues that the options here do not cover your particular use case, you may want to dive deeper into the literature dedicated to dynamic (nonstationary) data streams and concept drifts to get a holistic overview of related theory (e.g., “Scarcity of Labels in Non-Stationary Data Streams: A Survey” []). We will also touch on the surface of the concept drift problem in chapter 11 as one of the underlying reasons why setting up a reliable validation schema is not easy.

Campfire story from Valerii

When I was working in a big tech company, we would train a number of ML models on a local ML platform to catch spammers, scammers, scrapers, and other malevolent agents. However, the platform only produced point estimates when assessing model performance on the validation set. This turned out to be a problem, as offline estimation often was significantly different from online performance, creating either a considerable number of falsely banned users or wrong expectations.

To illustrate the point estimate problem, let’s take a coin toss example.

If we flip a fair coin 100 times, we can calculate the number of times it lands heads. That is our point estimate. If we do it again, we will end up with another number. If, instead, we say that 95 out of 100 times, we expect this number to be within the range of 40 to 60, this is a confidence interval. The lower confidence bound will be 40, meaning that we expect this number to be at least 40 in 95% of cases.

The point estimate lacks robustness, as it does not consider an ever-present uncertainty, which is easy to illustrate graphically. The plots in the following figures demonstrate the variance of the two metrics, precision and recall, using the same threshold, ML classifier, and validation data generated by the same distribution on offline data.

Distribution of precision and recall with sample size equal to 100,000; every point is an independent dataset.

Distribution of precision and recall with sample size equal to 200,000; every point is an independent dataset.

Distribution of precision and recall with sample size equal to 500,000; every point is an independent dataset

It was no surprise that when we compared offline point estimates and online performance, they were almost always far apart. Even within offline evaluation, the variance was huge even when the validation data size was 500,000. This situation lacks robustness and creates fragility in the whole system.

With chunks of test data, it is easy to show uncertainty for precision, recall, or other metrics. Still, there are better ways to do this. The gold standard would be random sampling with replacement or, in other words, bootstrap. Unfortunately, bootstrap is very computationally expensive. For each bootstrap iteration (between 10,000 and 100,000), we have to sample the multinomial distribution of length N (with the sample size reaching thousands or millions) and do this N times.

This proved to be a problem. On the one hand, I couldn’t use the existing estimation solution provided by the platform, as it needed to be more reliable and robust. On the other hand, integrating bootstrap into every validation step was also impossible, as it would make even a single training loop run too long.

The solution came from math. Suppose we review each sample independently and run bootstrap in parallel. In that case, we can switch from multinomial sampling to binomial(n,1/n) and independently sample each observation for each bootstrap iteration. With N >> 100, sampling a Poisson with lambda parameter = 1 becomes a close approximation of binomial(n,1/n)—in other words, binomial(n,1/n) ~ Poisson(1) with N >>100. (You can find more details at .)

No N exists in Poisson(1), making it completely independent of the data size and easy to parallel. This significantly increased speed (circa 100–1,000 times in my case with some additional tricks).

We can pick a confidence bound to hold once we have distribution for the metric of interest. In the following figure, we can see a 99% lower confidence bound. On average, 99 times out of 100, recall will not be lower than 0.071.

Distribution recall with every point being a bootstrapped original dataset; the red line is a 99% lower confidence bound.

There is one more thing to take into consideration here. Some metrics, including precision and recall, depend on the threshold we pick to calculate them. The following figures demonstrate the distributions of precision and recall with and without some minor noise (normally distributed with mean = 0 and standard deviation –0.0125) applied to the samples.

It is easy to see that the results with and without applied noise differ significantly, with increased recall and decreased precision in the latter. In a sense, these plots prove that in this case, the decision boundary margin is narrow and not robust. Adding some noise as a hyperparameter helps to estimate the distribution confidence intervals with increased trust in decision boundary robustness.

Distribution of precision and recall with sample size of 200;000; every point is a bootstrapped original dataset, no noise added.

Distribution of precision and recall with a sample size of 200,000; every point is a bootstrapped original dataset, noise added.

Estimating recall at a given precision/specificity is nothing new, but combined with Poisson bootstrap and noise addition, it created new metrics: bootstrapped lower confidence bound of recall at a given precision and bootstrapped lower confidence bound of recall at a given specificity. These metrics provided guaranteed (within a specific confidence level), reliable, and robust estimation of ML model performance.

Metrics embedded into a native ML platform

7.5 Design document: Choosing validation schemas

Time for another block of the design document, and this time we will fill in the information about preferred validation schemas for both Supermegaretail and PhotoStock Inc.

7.5.1 Validation schemas for Supermegaretail

We start with Supermegaretail.

Design document: Supermegaretail

IV. Validation schema

i. Requirements

What are the assumptions that we need to pay attention to when figuring out the evaluation process?

New data is coming daily.
Data can arrive with a delay of up to 48 hours.
New labels (number of units sold) come with the new data.
Recent data is most probably more relevant for the prediction task.
The assortment matrix changes by 15% every month.
There’s seasonality present in the data (weekly/annual cycles).

Despite the fact that the data is naturally divided into categories, it is irrelevant to the choice of validation schema.

ii. Inference

After fixing a model (within the hyperparameter optimization procedure), we train it on the last 2 years of data and predict future demand for the next 4 weeks. This process is fully reproduced in both inner and outer validation.

It is important to note that there should be a gap of 3 days between training and validation sets to be prepared for the fact that data may arrive with a delay. Subsequently, this will affect which features we can and cannot calculate when building a model.

iii. Inner and outer loops

We use two layers of validation. The outer loop is used for the final estimation of the model’s performance, while the inner loop is used for hyperparameter optimization.

First, for the outer loop, given that we are working with time series, rolling cross-validation is an obvious choice. We set K = 5 to train five models with optimal parameters. Since we are predicting 4 weeks ahead, the validation window size also consists of 28 days in all splits. There is a gap of 3 days between sets, and the step size is 7 days.

The following is an example of the outer loop:

First outer fold:
- Data for the testing is from 2022-10-10 to 2022-11-06 (4 weeks).
- Data for the training is from 2020-10-07 to 2022-10-06 (2 years).
Second outer fold:
- Data for the testing is from 2022-10-03 to 2022-10-30.
- Data for the training is from 2020-09-29 to 2022-09-28.
Fifth outer fold:
- Data for the testing is from 2022-09-12 to 2022-10-09.
- Data for the training is from 2020-09-09 to 2022-09-08.

Second, for the inner loop, inside each “train set” of the outer validation, we perform additional rolling cross-validation with a three-fold split. Each inner loop training sample consists of a 2-year history as well to capture both annual and weekly seasonality. We use the inner loop to tune hyperparameters or for feature selection.

The following is an example of the inner loop:

Second fold of the outer loop:
- Training data for the second outer fold is from 2020-10-03 to 2022-10-02.
First inner fold:
- Data for the testing is from 2022-09-05 to 2022-10-02 (4 weeks).
- Data for the training is from 2020-09-02 to 2022-09-01 (2 years).
Second inner fold:
- Data for the testing is from 2022-08-29 to 2022-9-25.
- Data for the training is from 2020-08-26 to 2022-08-25.
Third inner fold:
- Data for the testing is from 2022-08-22 to 2022-09-18.
- Data for the training is from 2020-08-19 to 2022-08-18.

If the model does not require model tuning yet, we can skip the inner loop.

iv. Update frequency

We update the split weekly along with new data and labels (so that each validation set always consists of a whole week). This will help us catch local changes and trends in model performance.

Additionally, we have a separate holdout set as a benchmark (a “golden set”). We update it every 3 months. It helps us track long-term improvements in our system.

7.5.2 Validation schemas for PhotoStock Inc.

We will now add the information on validation schemas for PhotoStock Inc.

Design document: PhotoStock Inc.

IV. Validation schema

Search query is the main object of validation. There are four main caveats to keep in mind when planning a validation strategy for the PhotoStock Inc. search engine:

Validation and test sets should be representative of the production data; in other words, they should represent real user queries.
Validation and test sets should be diverse; in other words, they should cover as wide a range of topics and contexts as possible.
Queries by the same user should only appear in either the training, validation, or test sets but not in multiple sets, so we avoid data leakage.
Duplicate queries should be removed from the dataset to avoid data leakage.

Thus, we suggest using the following splitting strategy:

Group queries by user; each query is assigned to a user once. If another user has the same query, it’s ignored.
Random-split users into training, validation, and test sets with a fixed ratio (to be determined; we don’t know what ratio is the best, but we can start with 90/5/5).
Assign new users to their split once and never change it.

Random split assignment should address potential distribution skewness in the data. For example, we may guess there is a seasonality effect in searches (weekend users are amateurs, while weekday users are professionals) and there is some distribution drift over time (new topics emerge; old topics fade away). The random split should address those issues, although additional analysis is required to confirm that.

To assign splits to users, we suggest using a deterministic bucketing approach: we split users into buckets based on their user_id hash and then assign each bucket to a split. This approach is universal because it allows the split ratio to change in the future. For example, if we want to increase the size of the validation set, we can just assign more buckets to the validation set from the training set.

The following is an example of the bucketing approach:

def assign_bucket(user_id):      _hash = sha1(user_id.encode()).hexdigest()      return int(_hash, 16) % n_buckets    def assign_split(user_id):      bucket = assign_bucket(user_id)      if bucket < n_buckets * train_ratio:          return 'train'      elif bucket < n_buckets * (train_ratio + val_ratio):          return 'val'      else:          return 'test'

For the initial project phase, we don’t plan to add more subsets (e.g., “the golden set”), although we can’t exclude this possibility in the future.

Summary

Use validation schemas as a way to measure your model’s predictive power accurately.
Try to avoid using the same validation split repeatedly for evaluation and searching for optimal hyperparameters, as it may lead to biased/overfitted and nonrobust results.
Try to design a validation schema to reflect how the model is applied in practice.
When looking for a needed number of K folds, base your choice on the following three variables: bias, variance, and computation time.
To do this, consider how the data differs between seen and unseen data (whether there are groups, classes, time, or other essential properties you should take into account).
Design a nonstandard schema to fit a particular problem if necessary.
Remember that different schemas for different goals can work together nicely.