There’s an empirical heuristic to distinguish experienced machine learning (ML) engineers from newcomers: ask them to describe a working system’s training procedure in one sentence. Newcomers tend to focus on models, while somewhat experienced individuals include data processing. Mature engineers often describe the pipeline—a list of stages required to produce a trained ML model in the end. In this chapter, we will walk in ML engineers’ shoes to analyze these steps and discuss how to interconnect and orchestrate them.
Imagine a small pizza chain company. Its business has been a success in the local market, but the owners who understand that software is eating the world (a phrase taken from Marc Anderssen’s article, ) and everything is going digital know there’s more pie to grab. So it makes this digitalization bet before the COVID pandemic and hires several engineers to build mobile apps, a simple customer relationship management plan, and multiple internal software systems. In other words, the company doesn’t have the scale or appetite of a tech giant, but it follows major trends and knows how to invest in software now to make significant profits later. Suffice it to mention that its app helped the company survive the pandemic in 2020.
Now, with the company following trends and the AI hype train full steam ahead, it’s no surprise that it hires Jane, a young and promising ML engineer whose interest in ML is undeniable. After onboarding, the CTO delegates her first problem to solve: build an AI-powered assistant that will help Pizzaiolos perform basic visual assessments like listing and calculating the number of components on the pizza base for each order.
The software development lifecycle in this pizza company hasn’t included ML systems so far. Thus, engineering manager Alex asks Jane to prepare a model and a small code snippet showing how to run it; the internal systems team will handle the rest.
Fast-forward several months later: Jane gathered a small dataset and trained a model, and it all looked fine during the initial testing, so Alex’s team managed to wrap it in a service. But right before deployment, the product manager brought multiple new recipes that were added to the menu and said the model should be able to support those too. This was nothing complicated and involved just adding some more data, changing the labels map, and retraining the model—sounds like something that shouldn’t affect the deployment schedule much. However, after a discussion, Jane and Alex realized it would take several more months, even assuming the new dataset is readily available. What went wrong here? Jane performed all the required steps to train a model—manually validating datasets, applying numerous data processing and cleaning steps in the Jupyter Notebook environment, training the model with multiple interruptions, validating the result with the chef and customer happiness team on an ad hoc basis, uploading the trained model to the company’s shared storage, and sending the link back to Alex.
Note She did all the right things but followed an ad hoc approach, with no proper effort to make those steps reproducible in a single, transparent workflow.
With this example, we want to show that ML is not just about training a model but also about building a pipeline that allows for the preparation of the model and other artifacts in a reproducible way. In this chapter, we will discuss the steps in the pipeline, how to orchestrate them, and how to make them reproducible.
In the ML world, the term “pipeline” is used in many different contexts. Usually, people refer to a pipeline as a series of ordered steps and processes. Each step is a program that takes some input, performs some actions, and produces some output. The output of one step is the input for the next step. Speaking more formally, we can usually describe the pipeline as a directed acyclic graph (DAG) of steps.
To make things more complicated, the model itself is usually a pipeline of another kind. For example, a simple logistic regression classifier is often enhanced with a feature scaling step, resulting in a pipeline of at least two steps. Usually, there’s also basic feature engineering (e.g., one-hot encoding for categorical variables), so even the simplest model has the properties of a pipeline. Other modalities like images, text, audio, etc., require additional preprocessing steps. For instance, a typical image classification model is a pipeline of image reading, normalization, resizing, and the model itself. If we switch to natural language processing, the pipeline almost always starts with text tokenization, etc. All in all, there’s a lot of space for ambiguity and confusion surrounding the term “pipeline” itself. To make things clearer, we will use the following terminology.
Training pipeline refers to a pipeline used to train a model. It’s a DAG of steps that takes the full dataset and, optionally, additional metadata as input and produces a trained model as output. It is a higher-level abstraction than the model itself (see figure 10.1).
Inference pipeline refers to a pipeline used to run a model in production or as part of a training pipeline (e.g., training a neural network with gradient descent requires numerous inference steps, each of which is a pipeline). It’s a DAG of steps that takes raw data as input and produces predictions as output. It is a lower-level abstraction than the training pipeline (see figure 10.2).
In this chapter, we will focus on the training pipeline, while the inference pipeline will be discussed in chapter 15. At the highest level, a typical training pipeline includes the following steps:
Let’s briefly break down each of the steps.
Data fetching is the first step in the pipeline. It is responsible for downloading the data from the sources and making it available for the subsequent steps. As mentioned in chapter 6, we do not consider ourselves data engineering experts, so we will not discuss data fetching and storage in detail.
Preprocessing is usually the second step in the pipeline. It is a very generic term that can have a different meaning for a respective task. In general, preprocessing is a set of actions performed to prepare the data for model training. While we separate the training and inference pipelines for the sake of the book’s structure, the distinction can be somewhat blurry in practice. For example, you can fully preprocess the raw dataset before training the model or, alternatively, make it part of a single model inference. In this context, we are discussing training-specific preprocessing. Feature selection is one example of such preprocessing: we only do it before training and freeze the selected features for the subsequent steps. Model training is the core of the training pipeline. It is a step that takes preprocessed data and produces a trained model; it is usually the longest and most complex (especially in deep learning-based systems) step in the pipeline.
After the model has been trained, it can be evaluated and tested. These are different aspects aiming to answer the same question: how good is the model? Evaluation is a step of computing metrics, while tests are a set of checks performed to ensure that the model is working as expected.
Postprocessing is a step that is performed after evaluation and testing. It is a set of actions performed to prepare the model for deployment. Here, we can convert the model to a format supported by the target platforms, apply posttraining quantization or other optimizations if applicable, prepare tasks for human evaluation, and so on. It is worth noting that postprocessing and evaluation can be swapped. For example, we can evaluate the model before converting it to the target format, or alternatively, we can convert the model to the target format and then evaluate it using the deployment format.
Artifact packaging is the last step in the pipeline. It is responsible for packaging the model and other artifacts (e.g., config files with preprocessor parameters) into a format that can be easily deployed to production. The goal here is to simplify further deployment and separate the training and deployment pipelines. Ideally, the output should be as agnostic as possible to the training pipeline. For example, the model is exported to a universal format like ONNX for backend serving or CoreML for iOS serving, all config files are exported to a universal format like JSON, and any changes in the training pipeline should affect deployment as little as possible. Otherwise, the deployment pipeline will be tightly coupled with the training pipeline and will require many changes after each training pipeline update, becoming an obstacle for rapid model development and related experiments.
Reports are a special case of artifacts. It is a generic term that can be related to many things, including a basic table containing validation/test metrics, various types of error analysis (recall chapter 9), additional visualizations, and other auxiliary information. While these artifacts are not directly used for deployment, they’re crucial to consider the training successful; no responsible engineer can release a newly trained model without having at least a short look at proper reports. The only exception we have seen is a variation of an AutoML scenario when many new models are trained automatically per user requests. In this case, manual validation is not always possible; thus, engineers can only review suspicious outliers. We will touch on the topic of the release cycle in chapter 13.
Some of these artifacts are related to experiment tracking and reproducibility, and those are crucial for projects with multiple contributors or involved parties. When a researcher works alone on their own problem, they can track all their ideas and experiments using a simple notebook or a text file. However, when a team of researchers works on the same problem, they need a more structured way to track, compare, and reproduce their experiments. A centralized repository for all experiments is one tool that can help achieve this.
Tools and practices related to the training pipeline, along with the inference pipelines, deployment pipelines, and monitoring services in the context of the ML system design, are often attributed to ML operations (MLOps). Given that MLOps is a relatively new field, there are no well-established standards for platforms and tools. Some of them are relatively recognizable (MLflow, Kubeflow, BentoML, AWS Sagemaker, Google Vertex AI, Azure ML), some are gaining traction right now, and some are still in the early stages of development.
In this book, we don’t want to highlight any specific platform or tool, so we will not discuss them in detail. Given the pace of changes in the MLOps landscape, it’s very likely that our current understanding of the tools and platforms will be outdated by the time the book is published. Instead, we will focus on the principles and practices that are common to all platforms and tools. In the simplest case, you can implement a full training pipeline using generic non-ML tools—by creating a series of Python scripts connected with shell scripts, for example. However, this is rarely the case in practice: usually, there are some ML-specific tools that introduce abstractions and simplify the pipeline implementation. Most MLOps tools are “opinionated,” which means that using them strongly suggests a particular way of structuring your code. In the long run, this improves the training pipeline’s code consistency and makes it easier to maintain in the future (see chapter 16). Typical features required from the training pipeline platform are
Besides features, it is important to mention some nonfunctional requirements that practitioners need from platforms, including cost-effectiveness and data privacy (especially in such sensitive areas as healthcare or legal).
It’s important to emphasize here that features don’t have to be covered by the same platform. For example, you can use a generic platform for running the pipeline and a custom tool for experiment tracking because market solutions don’t satisfy custom requirements. Sometimes it’s just a matter of cost optimization. Arseny worked in a company that used a hybrid of two tools just because one provided many useful features and an overall nice developer experience, while the other was integrated with a cloud provider with the cheapest GPUs for training. In this situation, it was reasonable to spend some time on integration and save a lot of money on training costs.
Choosing proper tools is determined by the problem scale and the infrastructure of the company. FAANG-level companies usually have their own ML platforms and tools that can operate on a proper scale, while smaller companies typically prefer a set of open-source tools and cloud services. Every tool has its own adoption cost, so it’s important to choose the proper tool for the right problem. To the best of our knowledge, there is no one-size-fits-all solution, unlike with many other more mature software engineering problems (see figure 10.3).
Scalability can be a crucial property of the training pipeline for certain problems. If we’re dealing with a dataset of thousands of samples of almost any kind, it’s not a significant concern, as even a single machine can likely handle it. However, when it comes to huge datasets, the situation changes. What constitutes a huge dataset? It depends on the problem and type of data (1 million tabular records are nothing compared to 1 million video clips), but what if we have to choose a criterion “for a dataset that doesn’t fit into the RAM of a single machine”?
The current size of a dataset should not be confused with the size of a dataset expected to be used in the future. We may face a cold-start problem (see chapter 6), and having even thousands of samples may be a significant advantage for the initial phase of system development. However, in the future, it can grow by several orders of magnitude, and if you want to use all the data, you need to be able to handle it.
While there is no silver bullet for training models on huge datasets, there are two classic software engineering approaches to scaling. Those are vertical and horizontal scaling (see figure 10.4).
Vertical scaling means upgrading your hardware or replacing the training machine with a more powerful node. The biggest advantage of this approach lies in its simplicity; adding more resources (especially if using cloud compute resources, which is often the case) is very easy. The drawback, however, is how limited vertical scaling is. Let’s say you doubled or even quadrupled machine RAM and upgraded the GPU to the latest generation. If that’s not enough, there isn’t much you can do within the vertical scaling approach.
Horizontal scaling involves splitting the load between multiple machines. The first level of horizontal scaling is using multiple GPU machines; in this case, an ML engineer often needs to introduce small changes to the pipeline code, as most of the heavy lifting is already done by the training framework. However, this isn’t true horizontal scaling, as we’re still talking about a single machine. Genuine horizontal scaling involves using multiple machines and distributing the load among them. Nowadays, this type of scaling is often provided by frameworks as well, but it is more complex and usually requires more engineering effort during implementation. DeepSpeed by Microsoft, Accelerate by Hugging Face, and Horovod, originated in Uber, are some examples of such frameworks.
One ML-specific way of scaling is subsampling: if your dataset is overly huge, it could be reasonable to subsample it and reduce the required compute resources. The most straightforward way of subsampling is applicable to most problems and involves removing duplicated samples. However, there are more aggressive methods: downsampling near-duplicates (samples that are very similar based on a simple distance function; e.g., Levenshtein distance for strings) and downsampling based on an internal ID (e.g., not more than X samples per user) or an artificial ID (e.g., clusterizing the full dataset with a simple method and keeping not more than X samples per cluster).
In ML pipelines, scaling often requires changing other pipeline parameters: for example, batch size is limited by GPU memory, larger batches lead to faster convergence, and the learning rate schedule depends on both batch size and expected convergence schedule. So the ability to alter the parameters is important.
When ML engineers design the configurability of a training pipeline, there is a spectrum with two bad practices on each side: underconfiguration and overconfiguration.
Underconfiguration means that the pipeline is not configurable enough, making it difficult to change the model’s architecture, dataset, preprocessing steps, etc. Things are hardcoded here and there in a convoluted way, and it’s hard to understand how the pipeline works or alter even the simplest aspects. This is a typical problem in the early stages of ML development. When the pipeline is small and simple, it’s easy to understand and change. Therefore, researchers without a software engineering background may find it unnecessary to introduce proper software abstractions, leading to the addition of more and more code without proper structure. This antipattern often occurs among such researchers.
Overconfiguration is not ideal either. A typical ML pipeline has many hyperparameters related to dataset processing, model architecture, feature engineering, and the training process. In reality, it’s hard to predict all the possible use cases and parameters that can be changed, and inexperienced developers may try to cover all the possible cases and introduce as many abstractions as possible. At some point, these additional abstraction layers only increase complexity. Note that in this section, we use “parameters of the training pipeline” and “hyperparameters of the model” interchangeably. Just as a reminder, hyperparameters are those parameters of the model that are not learned during the training process but are set by the user.
In the example of overconfigured code in the following listing, we see how a multilevel hierarchy of subconfigs may complicate things.
def train(): ... batch_size = 32 ... learning_rate = 3e-4 ... model.train(data, batch_size, loss_fn) ... ^ example of underconfigured code: things are too rigid class Config(BaseConfig): def __init__(self): self.data_config = DataConfig() self.model_config = ModelConfig() self.training_config = TrainingConfig() self.inference_config = InferenceConfig() self.environment_config = EnvironmentConfig() class DataConfig(BaseConfig): def __init__(self): self.train_data_config = TrainDataConfig() self.validation_data_config = ValidationDataConfig() self.test_data_config = TestDataConfig() config = Config( data_config=DataConfig( train_data_config=TrainDataConfig( ... ), validation_data_config=ValidationDataConfig( ... ), test_data_config=TestDataConfig( ... ), ), ... )
To find a good balance between the two extremes, you should estimate the probability of various parameters being changed. For example, it’s practically 100% certain that datasets will be updated, and it’s not very likely that activation functions inside the model will be changed. So it’s reasonable to make datasets configurable while ignoring activation functions. Changeable parameters differ for various pipelines, so the only way to find a good balance is to consider what potential experiments you would deem low-hanging fruits in the next few months. One helpful guide we recommend for deep learning-based pipelines is provided by the Google team: .
After preliminarily deciding which hyperparameters are tunable, it’s important to determine the tuning strategy. When computational resources are limited, handcrafted experiments are preferable. When resources are abundant, it makes sense to apply an automated hyperparameter tuning method, such as a straightforward random search or a more advanced Bayesian optimization. Tools for hyperparameter tuning (e.g., Hyperopt, Optuna, and scikit-optimize) can be part of the ML platform and may dictate how the configuration files should look.
From our experience, extensive hyperparameter tuning is more applicable for small datasets, where it’s possible to run numerous experiments in a reasonable time. When a single experiment takes weeks, it’s more practical to rely on the intuition of the ML engineer and run a few experiments manually. It is worth noting that experiments with smaller datasets may help build this intuition, although not every conclusion may generalize to a large training run.
It is important to find a proper way to config the training pipeline (see figure 10.5).
The most typical approach would be dedicating a single file (often written in a specific language like YAML or TOML) that contains all the changeable values. Another popular way to go is using libraries like Hydra (). One antipattern we have seen is having the config spread between the training pipeline files with the same parameter specified in multiple files that have various priority levels (e.g., batch size can be read from file X, but if not specified there, try fetching from file Y). It could be error-prone at the experimentation stage, especially if experiments are performed by less experienced engineers who are not familiar with this particular pipeline.
One common problem we often see in ML pipelines is the lack of tests. It’s not surprising, as testing ML pipelines is not an easy task. When building a regular software system, we can test it by running it with some input and checking the output. However, running a training pipeline may take days, and obviously we can’t run it again after every change we implement. Another problem, as mentioned earlier, is that ML pipelines are often not configurable enough, making them difficult to test in isolation. Finally, given the number of possible hyperparameters, it’s nearly impossible to test all the possible combinations in a reasonable time. Simply put, introducing tests to ML pipelines is a challenging task. But it’s worth doing!
Proper tests serve three purposes:
Our suggestion for testing ML pipelines is to use a combination of high-level smoke tests for the whole pipeline and low-level unit tests for at least its most important individual components.
A smoke test should be as fast as possible so you can run it on a small subset of the dataset, a small number of epochs, and maybe with a reduced version of the model. It should check that the pipeline runs without errors and produces reasonable output—for example, it ensures that the loss is decreasing on this toy dataset. The following listing shows a simplified example of a smoke test for a training pipeline.
from unittest.mock import patch, Mock import torch from training_pipeline import train, get_config class DummyResnet(torch.nn.Module): def __init__(self): super().__init__() self.model = torch.nn.Sequential(torch.nn.AdaptiveAvgPool2d(1), torch.nn.Conv2d(3, 2048, 1)) def forward(self, x): return self.model(x).squeeze(-1).squeeze(-1) def test_train_pipeline(): config = get_config() config["dataset_path"] = "/path/to/fixture" config["num_epochs"] = 1 mock = Mock(wraps=DummyResnet) with patch('training_pipeline.models.Resnet', mock): result = train(config) assert mock.call_count == 1 assert result['train_loss'] < .5 assert result['val_loss'] < 1 Smoke tests like this significantly increase iteration speed, thus simplifying experimentation and debugging. However, there is a downside. Like any integration tests, they require a lot of maintenance efforts on their own. This is because almost any significant pipeline change may affect the code. Lower-level unit tests should cover individual components of the pipeline. It’s not uncommon to have a few of them or even none at all—and there’s no reason to be ashamed if you don’t have them. However, we recommend covering at least the most sensitive components. An example of such a sensitive component could be the final model conversion—imagine the model is trained with Pytorch and later is supposed to be deployed to iOS (and run with CoreML) and the backend (and run with ONNX). It’s important to make sure that the model is converted properly and the conversion process doesn’t introduce any changes, which means results by the converted models should be the same as by the original model.
Another group of tests is applicable to the trained model, inspired by the property-based testing approach. Property-based testing is a software testing approach that involves generating random inputs for a function or a system and then verifying that certain properties or invariants hold true for all the inputs. Instead of writing specific test cases with predetermined inputs and expected outputs, property-based testing focuses on defining the general properties that the system should satisfy and then automatically generates test cases to validate those properties.
In the context of an ML project, property-based testing can be used to ensure that the final trained model behaves as expected and satisfies certain properties. The following are some examples of properties that can be tested in an ML project:
Monotonicity is often expected in various price prediction models. For example, the price of a house should increase with its square footage if the rest of the features are fixed.
where g is an expected transformation. This could be a rotation or scaling for images, changing an entity to its synonym for natural language processing, altering the volume of a sound for audio, and so on.
We already covered a very similar concept in section 5.2.1. The difference is that in one case, we expect some variation in the results (and we want to measure it), while in the other case, we expect strict consistency (and thus we want to assert it). Using some data samples as fixtures and writing property-based tests for them is a good way to ensure that the model behaves as expected and maintains its reliability.
An exact list of tests is not usually included in the design document; however, we recommend thinking about it in advance and mentioning it in the document. The design document is often used as a reference for implementation, so it’s useful to mention the tests in it.
NOTE If you’re interested in ML testing, we recommend Arseny’s slides with a deeper review of the topic: .
As we continue our work on two separate design documents for our imaginary businesses, it’s time to cover training pipelines for Supermegaretail and PhotoStock Inc.
Let’s see how a potential pipeline could look for Supermegaretail.
Now we go back to PhotoStock Inc., where we are required to build a smart in-house search engine to boost correct result output.