Building a robust evaluation process is essential for a machine learning (ML) system, and in this chapter, we will cover the process of building a proper validation schema to achieve confident estimates of system performance. We will touch upon typical validation schemas, as well as how to select the right validation based on the specifics of a given problem and what factors to consider when designing the evaluation process in the wild.
A proper validation procedure aims to imitate what knowledge we are supposed to have and what knowledge can be dropped while operating in a real-life environment. This is somewhat connected to the overfitting problem or generalization, which we’ll cover in detail in chapter 9.
It also provides a reliable and robust estimation of a system’s performance, ideally with some theoretical guarantees. As an example, we guarantee that a real value will be in the range between the lower confidence bound and upper confidence bound 95 times out of 100 (this case will be covered in a campfire story from Valerii later in the chapter). It also helps detect and prevent data leaks, overfitting, and divergence between offline and online performance.
Performance estimation is the primary goal of validation. We use validation to estimate the model’s predictive power on unseen data, and the preferred schema is usually the one with the highest reliability and robustness (i.e., low bias/low variance).
As long as we have a reliable and robust performance estimation, we can use it for various things, like hyperparameter optimization, architecture, algorithm, and feature selection. To some extent, there is a similarity to A/B testing where schema yielding lower variance provides higher sensitivity, which we will cover later in the chapter.
When validating anything, it is almost always a good idea to build a stable and reliable pipeline that produces repeatable results (see figure 7.1). Standard advice that you most probably can find in the literature comes down to the following three classic conditions: all you need to do is to split the data into training, validation, and test datasets. A training set is used for model training, a validation set is designed to evaluate performance during training, and a test set is used to calculate final metrics. This three-set approach is well known to those familiar with competitive ML (e.g., challenges hosted by Kaggle) or academia. At the same time, there are subtle but important distinctions within applied ML that we will discuss further in this chapter.
There are some points to pay attention to:
That is why using the same validation split over and over again for evaluation and searching for optimal hyperparameters or anything else will lead to biased/overfitted and nonrobust results. For this reason, instead of viewing validation as the thing done once at the very beginning, we view it as a continuous process to be done repeatedly once the context of the system changes (e.g., there are new sources of data, new features, potential feedback loops caused by model usage, etc.).
We are never 100% sure what the world will bring next; that’s why we must expect the unexpected.
As practice shows, you won’t need to reinvent the wheel when picking a validation schema for your ML system. Most of the standard schemas are time-tested and well-performing solutions that mostly require you to pick one that fits the requirements of your project. We will briefly cover these schemas in several subsections.
Classic validation schemas are well implemented in the evergreen Python ML library scikit-learn, and all the relevant documentation is worth reading if you have doubts about your knowledge of the material. The information is available at .
We’ll start by splitting the dataset into two or more chunks. Probably the golden classic mentioned in almost any book on ML is the training/validation/test split we discussed earlier.
With this approach, we partition data into three sets (it might be random or based on a specific criterion or strata) with different ratios—for example, 60/20/20 (see figure 7.2). The percentage may vary depending on the number of samples and metrics (the amount of data, metric variance, sensitivity, robustness, and reliability requirements). Empirically, the bigger the full dataset, the smaller the share that’s dedicated to validation and testing, so the training set is growing faster. The test set (i.e., outer validation) is used for the final model evaluation and should never be used for any other purpose. Meanwhile, we can use the validation set (i.e., inner validation) primarily for model comparison or tuning hyperparameters.
The holdout validation is a good choice for computationally expensive models, such as deep learning models. It is easy to implement and doesn’t add much time to the learning loop.
But let’s remember that we take a single random subsample from all the data. We are not reusing all available data that might lead to biased evaluation or underutilization of available data. What’s the worst part? We get a single number that does not allow us to understand the distribution of the estimates.
The silver bullet for resolving such a problem in statistics is a bootstrap procedure. In the validation case, it would look like randomly sampling train validation splits many times, training and evaluating the model each iteration. Training a model is time-consuming, and we want to iterate quickly for general parameter tweaking and experimentation. So how can we do it?
We can use a similar but simplified sampling procedure called cross-validation. We can split data into K folds (usually five), exclude each of them one by one, fit the model to the K – 1 folds of data, and measure performance on the excluded fold. Hence, we get K estimates and can calculate their mean and standard deviation. As a result, we get five numbers instead of one, which is more representative (see figure 7.3).
There are several variations of cross-validation, including:
Suppose we predict the flow rate of oil at hundreds of wells. Wells are grouped based on their location: neighboring wells extract oil from the same oil field, so their production affects each other. For this case, a grouped K-fold is a reasonable choice. Finding a proper criterion for grouping samples while assigning them to folds is one of the key decisions for validation overall, and mistakes here greatly affect the result.
The only question left is what number of folds to choose. The choice of K is dictated by three variables: bias, variance, and computation time. The rule of thumb is to use K = 5, which provides a good balance between bias and variance.
An extreme case for K is a leave-one-out cross-validation when each fold contains a single sample of data; thus K is equal to the overall number of samples in the dataset. This schema is the worst in terms of computation time and variance, but it’s the best in terms of bias.
There is a classic paper by Ron Kohavi from 1995 titled “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection” () that provides the following guidelines:
It is important to remember that the validation schema’s high sensitivity (i.e., low variance) only matters when the changes we try to catch in the model’s performance are small.
When dealing with time-sensitive data, we can’t sample data randomly. Sales of products on neighboring days share some information with each other. Similarly, recent user actions provide a hint on some aspects of their later actions. But we can’t predict the past based on data from the future. In time series, the distribution of patterns is not uniform along the dataset, and we must figure out other kinds of validation schemas. How do we evaluate the model in this case?
Validation schemas used in time-series data are similar to the holdout set and cross-validation but with nonrandom splitting by timestamp. The recommendations for choosing the number of folds and their size in rolling cross-validation are similar to the ordinal K-fold.
Time-series validation adds several extra degrees of freedom that need to be considered. A great paper, “Evaluating Time Series Forecasting Models” by Cerqueira et al. (), elaborates on the following points:
While time-series validation is one of the most defect-sensitive validation methods, relying solely on the simple “don’t look at future data while training” rule would be far too shortsighted. Following this rule can save you from 95% of typical mistakes; still, there are cases where you may need to break it. For example, ML applied to financial data (such as stock market time series) is known for its high bar in precise validation requirements. At the same time, some experts in the area highlight that trivial time-series validation, as shown in figure 7.4, can lead to overfitting caused by limited data subsets (for more details, see “Backtesting Through Cross-Validation,” chapter 12 of Marcos Lopez de Prado’s Advances in Financial Machine Learning; Wiley). A similar reason to violate this rule may be rooted in your need to estimate how the model performs in anomaly scenarios. To get this signal, you can train the model on data from 2017 to 2019 and 2021 to 2023 and later test it on data from the COVID period of 2020. Such a split barely works as the default validation schema but still can be useful as auxiliary information.
Sometimes you need to use a combination of different schemas. In the earlier example of flow rate prediction, we might combine grouped K-fold validation with time-series validation:
import numpy as np from sklearn.model_selection import GroupKFold import numpy as np from sklearn.model_selection import GroupKFold from sklearn.exceptions import NotFittedError def grouped_time_series_kfold(model, X, y, groups, n_folds=5, n_repeats=10, seed=0): scores = [] np.random.seed(seed) unique_groups = np.unique(groups) for i in range(n_repeats): gkf = GroupKFold(n_splits=n_folds) shuffled_groups = np.random.permutation(unique_groups) for train_group_idx, test_group_idx in gkf.split(X, y, groups=shuffled_groups): train_groups = shuffled_groups[train_group_idx] test_groups = shuffled_groups[test_group_idx] # Find the earliest and latest indices for train and test groups train_indices = np.where(np.isin(groups, train_groups))[0] test_indices = np.where(np.isin(groups, test_groups))[0] train_end = np.min(test_indices) # Ensure temporal order train_mask = np.isin(groups, train_groups) & (np.arange(len(groups)) < train_end) test_mask = np.isin(groups, test_groups) model.fit(X[train_mask], y[train_mask]) score = model.score(X[test_mask], y[test_mask]) scores.append(score) return np.array(scores)
We’ve reviewed standard validation schemas that cover most ML applications. Sometimes they are not enough to reflect the actual difference between seen and unseen data, even if you use a combination of them (e.g., time-based validation with group K-fold). As you know, inadequate validation leads to data leakage and, consequently, too optimistic model performance estimation (if not random!).
Such situations require you to look for unorthodox processes. Let’s review some of them.
Nested validation is an approach used when we want to run hyperparameter optimization (or any other model selection procedure) as part of the learning process. We can’t just use an excluded fold or holdout set, which we will need for the final evaluation to estimate how good a given set of parameters is. Access to the score on the testing data while fitting any parameters is a direct way to overfitting.
Instead, we use a fold-in-fold schema. We add an “inner” split of training data in each “outer” split to tune the parameters first. Then we fit the model on all available training folds with selected hyperparameters and make a prediction for the data that was not seen during hyperparameter tuning. Thus, we get two layers of validation, each of which can have its specific properties (e.g., we may prefer the inner layer to have lower variance and the outer layer to have a lower bias). We can apply nesting not only to cross-validation but also to time-series validation and ordinal holdout split (or mixed schemas of different natures) (see figures 7.5 and 7.6).
Instead of using a random subsample of data like in a standard holdout set, you may prefer to choose a different path. There is a technique called adversarial validation, a popular approach on ML competition platforms such as Kaggle. It applies an ML model for better validation of another ML model.
Adversarial validation numerically estimates whether two given datasets differ (those two may be sets of labeled and unlabeled data). And, if so, it can even quantify it on the sample level, making it possible to construct an arbitrary number of datasets, representative of each other, providing a perfect tool for estimation. An additional bonus is that it does not require data to be labeled.
The algorithm is simple:
Note that while this trick was used in ML competitions for a long time (the first mention we found has been there since 2016, ), it was not part of more formal research until 2020 when it appeared in the paper “Adversarial Validation Approach to Concept Drift Problem in User Targeting Automation Systems at Uber” by Pan et al. ().
We can use this kind of splitting in many cases. When we’re checking the similarity of labeled and unlabeled datasets, there are questions we should keep in mind. How different are their distributions? What features are the best predictors of this difference? Analyzing the model created by adversarial validation may answer these questions. We will also reuse this technique for a similar matter in chapter 9.
We find an interesting validation technique in a paper by DeepMind titled “Improving Language Models by Retrieving from Trillions of Tokens (2021; ), which proposes a generative model trained on the next-word-prediction task.
The paper’s authors enhance the language model by conditioning it on a context retrieved from a large corpus based on local similarity with preceding tokens. This system memorizes the entire dataset and performs the nearest-neighbors search to find chunks of text in the history that are relevant to the recent sentences. But what if the sentences we try to continue are almost identical to those the model has seen in the training set? It looks like there is a high probability of encountering dataset leakage.
The authors discussed this problem in advance and proposed a noteworthy evaluation procedure. They developed a specific measure to quantify leakage exploitation.
The general idea is the following:
This approach forces the model to retrieve useful information from similar texts and paraphrase it instead of copy-pasting it. You can use this procedure with any modern language model. It is a good example of an exotic technique that allows for minimizing data leakage and increasing the representativity of your dataset. A clear understanding of how the model will be applied will help you develop your own nontrivial validation schema if standard approaches are unsuitable.
We spend as much time on the test data as on the training data.— Andrej Karpathy
Regardless of which schema we use, we will probably apply it to a dynamically changing dataset. Periodically we get new data that may differ in distribution and include new patterns. How often should we update the test set to make sure our evaluation is always relevant?
There are at least two goals we may want to reach while designing a split update procedure for new data. First, we want our test set to be representative of these new patterns. From this point of view, the evaluation process should be adaptive.
Second, we want to see an evaluation dynamic: how has the model been changing through time with all updates in the architecture or features? For that, the estimates must be robust.
Some of the most common options are as follows (see figure 7.7):
Remember: we should perform validation on the whole pipeline, including the dataset; inference on the test set should be the same as in production. If we want to compare models side by side accurately, we should somehow save previous versions of both datasets and models. Tools for data version control and model version control (such as DVC, Git LFS, or Deep Lake) may be of help.
Once there are clues that the options here do not cover your particular use case, you may want to dive deeper into the literature dedicated to dynamic (nonstationary) data streams and concept drifts to get a holistic overview of related theory (e.g., “Scarcity of Labels in Non-Stationary Data Streams: A Survey” []). We will also touch on the surface of the concept drift problem in chapter 11 as one of the underlying reasons why setting up a reliable validation schema is not easy.
Time for another block of the design document, and this time we will fill in the information about preferred validation schemas for both Supermegaretail and PhotoStock Inc.
We start with Supermegaretail.
We will now add the information on validation schemas for PhotoStock Inc.