Книга: Machine Learning System Design
Назад: 6 Gathering datasets
Дальше: 8 Baseline solution

7 Validation schemas

This chapter covers

  • Ensuring reliable evaluation
  • Standard validation schemas
  • Nontrivial validation schemas
  • Split updating procedure
  • Validation schemas as part of the design document

Building a robust evaluation process is essential for a machine learning (ML) system, and in this chapter, we will cover the process of building a proper validation schema to achieve confident estimates of system performance. We will touch upon typical validation schemas, as well as how to select the right validation based on the specifics of a given problem and what factors to consider when designing the evaluation process in the wild.

A proper validation procedure aims to imitate what knowledge we are supposed to have and what knowledge can be dropped while operating in a real-life environment. This is somewhat connected to the overfitting problem or generalization, which we’ll cover in detail in chapter 9.

It also provides a reliable and robust estimation of a system’s performance, ideally with some theoretical guarantees. As an example, we guarantee that a real value will be in the range between the lower confidence bound and upper confidence bound 95 times out of 100 (this case will be covered in a campfire story from Valerii later in the chapter). It also helps detect and prevent data leaks, overfitting, and divergence between offline and online performance.

Performance estimation is the primary goal of validation. We use validation to estimate the model’s predictive power on unseen data, and the preferred schema is usually the one with the highest reliability and robustness (i.e., low bias/low variance).

As long as we have a reliable and robust performance estimation, we can use it for various things, like hyperparameter optimization, architecture, algorithm, and feature selection. To some extent, there is a similarity to A/B testing where schema yielding lower variance provides higher sensitivity, which we will cover later in the chapter.

7.1 Reliable evaluation

When validating anything, it is almost always a good idea to build a stable and reliable pipeline that produces repeatable results (see figure 7.1). Standard advice that you most probably can find in the literature comes down to the following three classic conditions: all you need to do is to split the data into training, validation, and test datasets. A training set is used for model training, a validation set is designed to evaluate performance during training, and a test set is used to calculate final metrics. This three-set approach is well known to those familiar with competitive ML (e.g., challenges hosted by Kaggle) or academia. At the same time, there are subtle but important distinctions within applied ML that we will discuss further in this chapter.

figure
Figure 7.1 Basic high-level model development cycle

There are some points to pay attention to:

That is why using the same validation split over and over again for evaluation and searching for optimal hyperparameters or anything else will lead to biased/overfitted and nonrobust results. For this reason, instead of viewing validation as the thing done once at the very beginning, we view it as a continuous process to be done repeatedly once the context of the system changes (e.g., there are new sources of data, new features, potential feedback loops caused by model usage, etc.).

We are never 100% sure what the world will bring next; that’s why we must expect the unexpected.

7.2 Standard schemas

As practice shows, you won’t need to reinvent the wheel when picking a validation schema for your ML system. Most of the standard schemas are time-tested and well-performing solutions that mostly require you to pick one that fits the requirements of your project. We will briefly cover these schemas in several subsections.

Classic validation schemas are well implemented in the evergreen Python ML library scikit-learn, and all the relevant documentation is worth reading if you have doubts about your knowledge of the material. The information is available at .

7.2.1 Holdout sets

We’ll start by splitting the dataset into two or more chunks. Probably the golden classic mentioned in almost any book on ML is the training/validation/test split we discussed earlier.

With this approach, we partition data into three sets (it might be random or based on a specific criterion or strata) with different ratios—for example, 60/20/20 (see figure 7.2). The percentage may vary depending on the number of samples and metrics (the amount of data, metric variance, sensitivity, robustness, and reliability requirements). Empirically, the bigger the full dataset, the smaller the share that’s dedicated to validation and testing, so the training set is growing faster. The test set (i.e., outer validation) is used for the final model evaluation and should never be used for any other purpose. Meanwhile, we can use the validation set (i.e., inner validation) primarily for model comparison or tuning hyperparameters.

figure
Figure 7.2 Standard by-the-book data split

7.2.2 Cross-validation

The holdout validation is a good choice for computationally expensive models, such as deep learning models. It is easy to implement and doesn’t add much time to the learning loop.

But let’s remember that we take a single random subsample from all the data. We are not reusing all available data that might lead to biased evaluation or underutilization of available data. What’s the worst part? We get a single number that does not allow us to understand the distribution of the estimates.

The silver bullet for resolving such a problem in statistics is a bootstrap procedure. In the validation case, it would look like randomly sampling train validation splits many times, training and evaluating the model each iteration. Training a model is time-consuming, and we want to iterate quickly for general parameter tweaking and experimentation. So how can we do it?

We can use a similar but simplified sampling procedure called cross-validation. We can split data into K folds (usually five), exclude each of them one by one, fit the model to the K – 1 folds of data, and measure performance on the excluded fold. Hence, we get K estimates and can calculate their mean and standard deviation. As a result, we get five numbers instead of one, which is more representative (see figure 7.3).

figure
Figure 7.3 K-fold split: each sample is assigned to a fold, and each fold provides validation once and trains once in the rest of the training rounds.

There are several variations of cross-validation, including:

Suppose we predict the flow rate of oil at hundreds of wells. Wells are grouped based on their location: neighboring wells extract oil from the same oil field, so their production affects each other. For this case, a grouped K-fold is a reasonable choice. Finding a proper criterion for grouping samples while assigning them to folds is one of the key decisions for validation overall, and mistakes here greatly affect the result.

7.2.3 The choice of K

The only question left is what number of folds to choose. The choice of K is dictated by three variables: bias, variance, and computation time. The rule of thumb is to use K = 5, which provides a good balance between bias and variance.

An extreme case for K is a leave-one-out cross-validation when each fold contains a single sample of data; thus K is equal to the overall number of samples in the dataset. This schema is the worst in terms of computation time and variance, but it’s the best in terms of bias.

There is a classic paper by Ron Kohavi from 1995 titled “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection” () that provides the following guidelines:

It is important to remember that the validation schema’s high sensitivity (i.e., low variance) only matters when the changes we try to catch in the model’s performance are small.

7.2.4 Time-series validation

When dealing with time-sensitive data, we can’t sample data randomly. Sales of products on neighboring days share some information with each other. Similarly, recent user actions provide a hint on some aspects of their later actions. But we can’t predict the past based on data from the future. In time series, the distribution of patterns is not uniform along the dataset, and we must figure out other kinds of validation schemas. How do we evaluate the model in this case?

Validation schemas used in time-series data are similar to the holdout set and cross-validation but with nonrandom splitting by timestamp. The recommendations for choosing the number of folds and their size in rolling cross-validation are similar to the ordinal K-fold.

Time-series validation adds several extra degrees of freedom that need to be considered. A great paper, “Evaluating Time Series Forecasting Models” by Cerqueira et al. (), elaborates on the following points:

While time-series validation is one of the most defect-sensitive validation methods, relying solely on the simple “don’t look at future data while training” rule would be far too shortsighted. Following this rule can save you from 95% of typical mistakes; still, there are cases where you may need to break it. For example, ML applied to financial data (such as stock market time series) is known for its high bar in precise validation requirements. At the same time, some experts in the area highlight that trivial time-series validation, as shown in figure 7.4, can lead to overfitting caused by limited data subsets (for more details, see “Backtesting Through Cross-Validation,” chapter 12 of Marcos Lopez de Prado’s Advances in Financial Machine Learning; Wiley). A similar reason to violate this rule may be rooted in your need to estimate how the model performs in anomaly scenarios. To get this signal, you can train the model on data from 2017 to 2019 and 2021 to 2023 and later test it on data from the COVID period of 2020. Such a split barely works as the default validation schema but still can be useful as auxiliary information.

figure
Figure 7.4 Standard time-based split. The test dataset always follows the train one, so train samples are “past” and test is “future.”

Sometimes you need to use a combination of different schemas. In the earlier example of flow rate prediction, we might combine grouped K-fold validation with time-series validation:

import numpy as np  from sklearn.model_selection import GroupKFold    import numpy as np  from sklearn.model_selection import GroupKFold  from sklearn.exceptions import NotFittedError    def grouped_time_series_kfold(model, X, y, groups, n_folds=5,   n_repeats=10, seed=0):      scores = []      np.random.seed(seed)      unique_groups = np.unique(groups)        for i in range(n_repeats):          gkf = GroupKFold(n_splits=n_folds)          shuffled_groups = np.random.permutation(unique_groups)            for train_group_idx, test_group_idx in gkf.split(X, y,          groups=shuffled_groups):              train_groups = shuffled_groups[train_group_idx]              test_groups = shuffled_groups[test_group_idx]                # Find the earliest and latest indices for train and test groups              train_indices = np.where(np.isin(groups, train_groups))[0]              test_indices = np.where(np.isin(groups, test_groups))[0]              train_end = np.min(test_indices)                # Ensure temporal order              train_mask = np.isin(groups, train_groups) &              (np.arange(len(groups)) < train_end)              test_mask = np.isin(groups, test_groups)                model.fit(X[train_mask], y[train_mask])              score = model.score(X[test_mask], y[test_mask])              scores.append(score)        return np.array(scores)

7.3 Nontrivial schemas

We’ve reviewed standard validation schemas that cover most ML applications. Sometimes they are not enough to reflect the actual difference between seen and unseen data, even if you use a combination of them (e.g., time-based validation with group K-fold). As you know, inadequate validation leads to data leakage and, consequently, too optimistic model performance estimation (if not random!).

Such situations require you to look for unorthodox processes. Let’s review some of them.

7.3.1 Nested validation

Nested validation is an approach used when we want to run hyperparameter optimization (or any other model selection procedure) as part of the learning process. We can’t just use an excluded fold or holdout set, which we will need for the final evaluation to estimate how good a given set of parameters is. Access to the score on the testing data while fitting any parameters is a direct way to overfitting.

Instead, we use a fold-in-fold schema. We add an “inner” split of training data in each “outer” split to tune the parameters first. Then we fit the model on all available training folds with selected hyperparameters and make a prediction for the data that was not seen during hyperparameter tuning. Thus, we get two layers of validation, each of which can have its specific properties (e.g., we may prefer the inner layer to have lower variance and the outer layer to have a lower bias). We can apply nesting not only to cross-validation but also to time-series validation and ordinal holdout split (or mixed schemas of different natures) (see figures 7.5 and 7.6).

figure
Figure 7.5 Example of nested cross-validation
figure
Figure 7.6 Example of nested validation with mixed schemas: holdout split for the outer loop and K-fold for the inner loop

7.3.2 Adversarial validation

Instead of using a random subsample of data like in a standard holdout set, you may prefer to choose a different path. There is a technique called adversarial validation, a popular approach on ML competition platforms such as Kaggle. It applies an ML model for better validation of another ML model.

Adversarial validation numerically estimates whether two given datasets differ (those two may be sets of labeled and unlabeled data). And, if so, it can even quantify it on the sample level, making it possible to construct an arbitrary number of datasets, representative of each other, providing a perfect tool for estimation. An additional bonus is that it does not require data to be labeled.

The algorithm is simple:

  1. We combine datasets of interest (cutting off the target variable, if present), labeling the anchor dataset (the one we want to represent) as 1 and marking the rest as 0.
  2. We fit an auxiliary model on this concatenated dataset to solve the binary classification task (thus 0 and 1 marks).
  3. If datasets are representative of each other and come from the same distribution, we could expect receiver operating characteristic area under the curve (ROC AUC) to be near 0.5. If they are separable (e.g., ROC AUC is greater than 0.6), then we can use the output from the model as a measure of proximity.

Note that while this trick was used in ML competitions for a long time (the first mention we found has been there since 2016, ), it was not part of more formal research until 2020 when it appeared in the paper “Adversarial Validation Approach to Concept Drift Problem in User Targeting Automation Systems at Uber” by Pan et al. ().

We can use this kind of splitting in many cases. When we’re checking the similarity of labeled and unlabeled datasets, there are questions we should keep in mind. How different are their distributions? What features are the best predictors of this difference? Analyzing the model created by adversarial validation may answer these questions. We will also reuse this technique for a similar matter in chapter 9.

7.3.3 Quantifying dataset leakage exploitation

We find an interesting validation technique in a paper by DeepMind titled “Improving Language Models by Retrieving from Trillions of Tokens (2021; ), which proposes a generative model trained on the next-word-prediction task.

The paper’s authors enhance the language model by conditioning it on a context retrieved from a large corpus based on local similarity with preceding tokens. This system memorizes the entire dataset and performs the nearest-neighbors search to find chunks of text in the history that are relevant to the recent sentences. But what if the sentences we try to continue are almost identical to those the model has seen in the training set? It looks like there is a high probability of encountering dataset leakage.

The authors discussed this problem in advance and proposed a noteworthy evaluation procedure. They developed a specific measure to quantify leakage exploitation.

The general idea is the following:

  1. Partition the dataset into training and validation sets as in the usual holdout validation.
  2. Split both into chunks of fixed length.
  3. For each chunk in the validation set, retrieve N nearest neighbors from the training set based on chunk embeddings (here we will omit how chunks are transformed into embedding space, but you can find the details in the paper).
  4. Calculate the ratio of tokens that are common in the two chunks (they use a score similar to the Jaccard Index); this gives us a score ranging from 0 (a chunk is totally different) to 1 (a chunk is a duplicate).
  5. If this score exceeds a certain threshold, filter out this chunk from the training set.

This approach forces the model to retrieve useful information from similar texts and paraphrase it instead of copy-pasting it. You can use this procedure with any modern language model. It is a good example of an exotic technique that allows for minimizing data leakage and increasing the representativity of your dataset. A clear understanding of how the model will be applied will help you develop your own nontrivial validation schema if standard approaches are unsuitable.

7.4 Split updating procedure

We spend as much time on the test data as on the training data.
— Andrej Karpathy

Regardless of which schema we use, we will probably apply it to a dynamically changing dataset. Periodically we get new data that may differ in distribution and include new patterns. How often should we update the test set to make sure our evaluation is always relevant?

There are at least two goals we may want to reach while designing a split update procedure for new data. First, we want our test set to be representative of these new patterns. From this point of view, the evaluation process should be adaptive.

Second, we want to see an evaluation dynamic: how has the model been changing through time with all updates in the architecture or features? For that, the estimates must be robust.

Some of the most common options are as follows (see figure 7.7):

figure
Figure 7.7 Common options for updating train/validation sets. The light data blocks are used for training, while the dark ones are used for validation.

Remember: we should perform validation on the whole pipeline, including the dataset; inference on the test set should be the same as in production. If we want to compare models side by side accurately, we should somehow save previous versions of both datasets and models. Tools for data version control and model version control (such as DVC, Git LFS, or Deep Lake) may be of help.

Once there are clues that the options here do not cover your particular use case, you may want to dive deeper into the literature dedicated to dynamic (nonstationary) data streams and concept drifts to get a holistic overview of related theory (e.g., “Scarcity of Labels in Non-Stationary Data Streams: A Survey” []). We will also touch on the surface of the concept drift problem in chapter 11 as one of the underlying reasons why setting up a reliable validation schema is not easy.

7.5 Design document: Choosing validation schemas

Time for another block of the design document, and this time we will fill in the information about preferred validation schemas for both Supermegaretail and PhotoStock Inc.

7.5.1 Validation schemas for Supermegaretail

We start with Supermegaretail.

7.5.2 Validation schemas for PhotoStock Inc.

We will now add the information on validation schemas for PhotoStock Inc.

Summary

Назад: 6 Gathering datasets
Дальше: 8 Baseline solution