10 Training pipelines

This chapter covers

The essence of training pipelines
Tools and platforms you can use to build and maintain training pipelines
Scalability and configurability of training pipelines
Methods of testing pipelines

There’s an empirical heuristic to distinguish experienced machine learning (ML) engineers from newcomers: ask them to describe a working system’s training procedure in one sentence. Newcomers tend to focus on models, while somewhat experienced individuals include data processing. Mature engineers often describe the pipeline—a list of stages required to produce a trained ML model in the end. In this chapter, we will walk in ML engineers’ shoes to analyze these steps and discuss how to interconnect and orchestrate them.

10.1 Training pipeline: What are you?

Imagine a small pizza chain company. Its business has been a success in the local market, but the owners who understand that software is eating the world (a phrase taken from Marc Anderssen’s article, ) and everything is going digital know there’s more pie to grab. So it makes this digitalization bet before the COVID pandemic and hires several engineers to build mobile apps, a simple customer relationship management plan, and multiple internal software systems. In other words, the company doesn’t have the scale or appetite of a tech giant, but it follows major trends and knows how to invest in software now to make significant profits later. Suffice it to mention that its app helped the company survive the pandemic in 2020.

Now, with the company following trends and the AI hype train full steam ahead, it’s no surprise that it hires Jane, a young and promising ML engineer whose interest in ML is undeniable. After onboarding, the CTO delegates her first problem to solve: build an AI-powered assistant that will help Pizzaiolos perform basic visual assessments like listing and calculating the number of components on the pizza base for each order.

The software development lifecycle in this pizza company hasn’t included ML systems so far. Thus, engineering manager Alex asks Jane to prepare a model and a small code snippet showing how to run it; the internal systems team will handle the rest.

Fast-forward several months later: Jane gathered a small dataset and trained a model, and it all looked fine during the initial testing, so Alex’s team managed to wrap it in a service. But right before deployment, the product manager brought multiple new recipes that were added to the menu and said the model should be able to support those too. This was nothing complicated and involved just adding some more data, changing the labels map, and retraining the model—sounds like something that shouldn’t affect the deployment schedule much. However, after a discussion, Jane and Alex realized it would take several more months, even assuming the new dataset is readily available. What went wrong here? Jane performed all the required steps to train a model—manually validating datasets, applying numerous data processing and cleaning steps in the Jupyter Notebook environment, training the model with multiple interruptions, validating the result with the chef and customer happiness team on an ad hoc basis, uploading the trained model to the company’s shared storage, and sending the link back to Alex.

Note She did all the right things but followed an ad hoc approach, with no proper effort to make those steps reproducible in a single, transparent workflow.

With this example, we want to show that ML is not just about training a model but also about building a pipeline that allows for the preparation of the model and other artifacts in a reproducible way. In this chapter, we will discuss the steps in the pipeline, how to orchestrate them, and how to make them reproducible.

10.1.1 Training pipeline vs. inference pipeline

In the ML world, the term “pipeline” is used in many different contexts. Usually, people refer to a pipeline as a series of ordered steps and processes. Each step is a program that takes some input, performs some actions, and produces some output. The output of one step is the input for the next step. Speaking more formally, we can usually describe the pipeline as a directed acyclic graph (DAG) of steps.

To make things more complicated, the model itself is usually a pipeline of another kind. For example, a simple logistic regression classifier is often enhanced with a feature scaling step, resulting in a pipeline of at least two steps. Usually, there’s also basic feature engineering (e.g., one-hot encoding for categorical variables), so even the simplest model has the properties of a pipeline. Other modalities like images, text, audio, etc., require additional preprocessing steps. For instance, a typical image classification model is a pipeline of image reading, normalization, resizing, and the model itself. If we switch to natural language processing, the pipeline almost always starts with text tokenization, etc. All in all, there’s a lot of space for ambiguity and confusion surrounding the term “pipeline” itself. To make things clearer, we will use the following terminology.

Training pipeline refers to a pipeline used to train a model. It’s a DAG of steps that takes the full dataset and, optionally, additional metadata as input and produces a trained model as output. It is a higher-level abstraction than the model itself (see figure 10.1).

Figure 10.1 A DAG scheme representing a training pipeline

Inference pipeline refers to a pipeline used to run a model in production or as part of a training pipeline (e.g., training a neural network with gradient descent requires numerous inference steps, each of which is a pipeline). It’s a DAG of steps that takes raw data as input and produces predictions as output. It is a lower-level abstraction than the training pipeline (see figure 10.2).

Figure 10.2 A scheme representing a training step in a pipeline

In this chapter, we will focus on the training pipeline, while the inference pipeline will be discussed in chapter 15. At the highest level, a typical training pipeline includes the following steps:

Data fetching
Preprocessing
Model training
Model evaluation and testing
Postprocessing
Report generation
Artifact packaging

Let’s briefly break down each of the steps.

Data fetching is the first step in the pipeline. It is responsible for downloading the data from the sources and making it available for the subsequent steps. As mentioned in chapter 6, we do not consider ourselves data engineering experts, so we will not discuss data fetching and storage in detail.

Preprocessing is usually the second step in the pipeline. It is a very generic term that can have a different meaning for a respective task. In general, preprocessing is a set of actions performed to prepare the data for model training. While we separate the training and inference pipelines for the sake of the book’s structure, the distinction can be somewhat blurry in practice. For example, you can fully preprocess the raw dataset before training the model or, alternatively, make it part of a single model inference. In this context, we are discussing training-specific preprocessing. Feature selection is one example of such preprocessing: we only do it before training and freeze the selected features for the subsequent steps. Model training is the core of the training pipeline. It is a step that takes preprocessed data and produces a trained model; it is usually the longest and most complex (especially in deep learning-based systems) step in the pipeline.

After the model has been trained, it can be evaluated and tested. These are different aspects aiming to answer the same question: how good is the model? Evaluation is a step of computing metrics, while tests are a set of checks performed to ensure that the model is working as expected.

Postprocessing is a step that is performed after evaluation and testing. It is a set of actions performed to prepare the model for deployment. Here, we can convert the model to a format supported by the target platforms, apply posttraining quantization or other optimizations if applicable, prepare tasks for human evaluation, and so on. It is worth noting that postprocessing and evaluation can be swapped. For example, we can evaluate the model before converting it to the target format, or alternatively, we can convert the model to the target format and then evaluate it using the deployment format.

Artifact packaging is the last step in the pipeline. It is responsible for packaging the model and other artifacts (e.g., config files with preprocessor parameters) into a format that can be easily deployed to production. The goal here is to simplify further deployment and separate the training and deployment pipelines. Ideally, the output should be as agnostic as possible to the training pipeline. For example, the model is exported to a universal format like ONNX for backend serving or CoreML for iOS serving, all config files are exported to a universal format like JSON, and any changes in the training pipeline should affect deployment as little as possible. Otherwise, the deployment pipeline will be tightly coupled with the training pipeline and will require many changes after each training pipeline update, becoming an obstacle for rapid model development and related experiments.

Reports are a special case of artifacts. It is a generic term that can be related to many things, including a basic table containing validation/test metrics, various types of error analysis (recall chapter 9), additional visualizations, and other auxiliary information. While these artifacts are not directly used for deployment, they’re crucial to consider the training successful; no responsible engineer can release a newly trained model without having at least a short look at proper reports. The only exception we have seen is a variation of an AutoML scenario when many new models are trained automatically per user requests. In this case, manual validation is not always possible; thus, engineers can only review suspicious outliers. We will touch on the topic of the release cycle in chapter 13.

Some of these artifacts are related to experiment tracking and reproducibility, and those are crucial for projects with multiple contributors or involved parties. When a researcher works alone on their own problem, they can track all their ideas and experiments using a simple notebook or a text file. However, when a team of researchers works on the same problem, they need a more structured way to track, compare, and reproduce their experiments. A centralized repository for all experiments is one tool that can help achieve this.

10.2 Tools and platforms

Tools and practices related to the training pipeline, along with the inference pipelines, deployment pipelines, and monitoring services in the context of the ML system design, are often attributed to ML operations (MLOps). Given that MLOps is a relatively new field, there are no well-established standards for platforms and tools. Some of them are relatively recognizable (MLflow, Kubeflow, BentoML, AWS Sagemaker, Google Vertex AI, Azure ML), some are gaining traction right now, and some are still in the early stages of development.

In this book, we don’t want to highlight any specific platform or tool, so we will not discuss them in detail. Given the pace of changes in the MLOps landscape, it’s very likely that our current understanding of the tools and platforms will be outdated by the time the book is published. Instead, we will focus on the principles and practices that are common to all platforms and tools. In the simplest case, you can implement a full training pipeline using generic non-ML tools—by creating a series of Python scripts connected with shell scripts, for example. However, this is rarely the case in practice: usually, there are some ML-specific tools that introduce abstractions and simplify the pipeline implementation. Most MLOps tools are “opinionated,” which means that using them strongly suggests a particular way of structuring your code. In the long run, this improves the training pipeline’s code consistency and makes it easier to maintain in the future (see chapter 16). Typical features required from the training pipeline platform are

Resolving dependencies—As the pipeline is a DAG of steps, it’s important to resolve dependencies between steps and run them in the right order.
Reproducibility—Given a set of parameters and pipeline version (e.g., specified by git commit), the pipeline should produce the same result every time.
Integration with computational resources (such as cloud providers or Kubernetes installations)—For example, users should be able to run the job on a specific compute instance (e.g., a virtual machine with X CPU cores and N GPUs) or on a cluster of instances.
Artifacts storage—Once the training pipeline has been run, its artifacts should be available. Experiment tracking can be viewed as a subset of this feature.
Caching intermediate results—As long as many steps in the pipeline are computationally expensive, it’s important to cache intermediate results to save resources and time.

Besides features, it is important to mention some nonfunctional requirements that practitioners need from platforms, including cost-effectiveness and data privacy (especially in such sensitive areas as healthcare or legal).

It’s important to emphasize here that features don’t have to be covered by the same platform. For example, you can use a generic platform for running the pipeline and a custom tool for experiment tracking because market solutions don’t satisfy custom requirements. Sometimes it’s just a matter of cost optimization. Arseny worked in a company that used a hybrid of two tools just because one provided many useful features and an overall nice developer experience, while the other was integrated with a cloud provider with the cheapest GPUs for training. In this situation, it was reasonable to spend some time on integration and save a lot of money on training costs.

Choosing proper tools is determined by the problem scale and the infrastructure of the company. FAANG-level companies usually have their own ML platforms and tools that can operate on a proper scale, while smaller companies typically prefer a set of open-source tools and cloud services. Every tool has its own adoption cost, so it’s important to choose the proper tool for the right problem. To the best of our knowledge, there is no one-size-fits-all solution, unlike with many other more mature software engineering problems (see figure 10.3).

Figure 10.3 Using proper frameworks for a training pipeline is usually a good practice, although sometimes it is enough to keep things simple.

10.3 Scalability

Scalability can be a crucial property of the training pipeline for certain problems. If we’re dealing with a dataset of thousands of samples of almost any kind, it’s not a significant concern, as even a single machine can likely handle it. However, when it comes to huge datasets, the situation changes. What constitutes a huge dataset? It depends on the problem and type of data (1 million tabular records are nothing compared to 1 million video clips), but what if we have to choose a criterion “for a dataset that doesn’t fit into the RAM of a single machine”?

The current size of a dataset should not be confused with the size of a dataset expected to be used in the future. We may face a cold-start problem (see chapter 6), and having even thousands of samples may be a significant advantage for the initial phase of system development. However, in the future, it can grow by several orders of magnitude, and if you want to use all the data, you need to be able to handle it.

While there is no silver bullet for training models on huge datasets, there are two classic software engineering approaches to scaling. Those are vertical and horizontal scaling (see figure 10.4).

Figure 10.4 Vertical scaling vs. horizontal scaling

Vertical scaling means upgrading your hardware or replacing the training machine with a more powerful node. The biggest advantage of this approach lies in its simplicity; adding more resources (especially if using cloud compute resources, which is often the case) is very easy. The drawback, however, is how limited vertical scaling is. Let’s say you doubled or even quadrupled machine RAM and upgraded the GPU to the latest generation. If that’s not enough, there isn’t much you can do within the vertical scaling approach.

Horizontal scaling involves splitting the load between multiple machines. The first level of horizontal scaling is using multiple GPU machines; in this case, an ML engineer often needs to introduce small changes to the pipeline code, as most of the heavy lifting is already done by the training framework. However, this isn’t true horizontal scaling, as we’re still talking about a single machine. Genuine horizontal scaling involves using multiple machines and distributing the load among them. Nowadays, this type of scaling is often provided by frameworks as well, but it is more complex and usually requires more engineering effort during implementation. DeepSpeed by Microsoft, Accelerate by Hugging Face, and Horovod, originated in Uber, are some examples of such frameworks.

One ML-specific way of scaling is subsampling: if your dataset is overly huge, it could be reasonable to subsample it and reduce the required compute resources. The most straightforward way of subsampling is applicable to most problems and involves removing duplicated samples. However, there are more aggressive methods: downsampling near-duplicates (samples that are very similar based on a simple distance function; e.g., Levenshtein distance for strings) and downsampling based on an internal ID (e.g., not more than X samples per user) or an artificial ID (e.g., clusterizing the full dataset with a simple method and keeping not more than X samples per cluster).

In ML pipelines, scaling often requires changing other pipeline parameters: for example, batch size is limited by GPU memory, larger batches lead to faster convergence, and the learning rate schedule depends on both batch size and expected convergence schedule. So the ability to alter the parameters is important.

10.4 Configurability

When ML engineers design the configurability of a training pipeline, there is a spectrum with two bad practices on each side: underconfiguration and overconfiguration.

Underconfiguration means that the pipeline is not configurable enough, making it difficult to change the model’s architecture, dataset, preprocessing steps, etc. Things are hardcoded here and there in a convoluted way, and it’s hard to understand how the pipeline works or alter even the simplest aspects. This is a typical problem in the early stages of ML development. When the pipeline is small and simple, it’s easy to understand and change. Therefore, researchers without a software engineering background may find it unnecessary to introduce proper software abstractions, leading to the addition of more and more code without proper structure. This antipattern often occurs among such researchers.

Overconfiguration is not ideal either. A typical ML pipeline has many hyperparameters related to dataset processing, model architecture, feature engineering, and the training process. In reality, it’s hard to predict all the possible use cases and parameters that can be changed, and inexperienced developers may try to cover all the possible cases and introduce as many abstractions as possible. At some point, these additional abstraction layers only increase complexity. Note that in this section, we use “parameters of the training pipeline” and “hyperparameters of the model” interchangeably. Just as a reminder, hyperparameters are those parameters of the model that are not learned during the training process but are set by the user.

In the example of overconfigured code in the following listing, we see how a multilevel hierarchy of subconfigs may complicate things.

Listing 10.1 Multilevel hierarchy of subcodings

def train():     ...     batch_size = 32     ...      learning_rate = 3e-4     ...     model.train(data, batch_size, loss_fn)     ...       ^ example of underconfigured code: things are too rigid        class Config(BaseConfig):      def __init__(self):          self.data_config = DataConfig()          self.model_config = ModelConfig()          self.training_config = TrainingConfig()          self.inference_config = InferenceConfig()          self.environment_config = EnvironmentConfig()      class DataConfig(BaseConfig):      def __init__(self):          self.train_data_config = TrainDataConfig()          self.validation_data_config = ValidationDataConfig()          self.test_data_config = TestDataConfig()      config = Config(      data_config=DataConfig(          train_data_config=TrainDataConfig(              ...          ),          validation_data_config=ValidationDataConfig(              ...          ),          test_data_config=TestDataConfig(              ...          ),      ),      ...  )

To find a good balance between the two extremes, you should estimate the probability of various parameters being changed. For example, it’s practically 100% certain that datasets will be updated, and it’s not very likely that activation functions inside the model will be changed. So it’s reasonable to make datasets configurable while ignoring activation functions. Changeable parameters differ for various pipelines, so the only way to find a good balance is to consider what potential experiments you would deem low-hanging fruits in the next few months. One helpful guide we recommend for deep learning-based pipelines is provided by the Google team: .

After preliminarily deciding which hyperparameters are tunable, it’s important to determine the tuning strategy. When computational resources are limited, handcrafted experiments are preferable. When resources are abundant, it makes sense to apply an automated hyperparameter tuning method, such as a straightforward random search or a more advanced Bayesian optimization. Tools for hyperparameter tuning (e.g., Hyperopt, Optuna, and scikit-optimize) can be part of the ML platform and may dictate how the configuration files should look.

From our experience, extensive hyperparameter tuning is more applicable for small datasets, where it’s possible to run numerous experiments in a reasonable time. When a single experiment takes weeks, it’s more practical to rely on the intuition of the ML engineer and run a few experiments manually. It is worth noting that experiments with smaller datasets may help build this intuition, although not every conclusion may generalize to a large training run.

It is important to find a proper way to config the training pipeline (see figure 10.5).

Figure 10.5 A pipeline requires proper config for optimal performance

The most typical approach would be dedicating a single file (often written in a specific language like YAML or TOML) that contains all the changeable values. Another popular way to go is using libraries like Hydra (). One antipattern we have seen is having the config spread between the training pipeline files with the same parameter specified in multiple files that have various priority levels (e.g., batch size can be read from file X, but if not specified there, try fetching from file Y). It could be error-prone at the experimentation stage, especially if experiments are performed by less experienced engineers who are not familiar with this particular pipeline.

10.5 Testing

One common problem we often see in ML pipelines is the lack of tests. It’s not surprising, as testing ML pipelines is not an easy task. When building a regular software system, we can test it by running it with some input and checking the output. However, running a training pipeline may take days, and obviously we can’t run it again after every change we implement. Another problem, as mentioned earlier, is that ML pipelines are often not configurable enough, making them difficult to test in isolation. Finally, given the number of possible hyperparameters, it’s nearly impossible to test all the possible combinations in a reasonable time. Simply put, introducing tests to ML pipelines is a challenging task. But it’s worth doing!

Proper tests serve three purposes:

Avoiding regression bugs while introducing changes
Increasing iteration speed by catching defects earlier
Improving pipeline design overall, as it forces an engineer to find the proper balance of configurability

Our suggestion for testing ML pipelines is to use a combination of high-level smoke tests for the whole pipeline and low-level unit tests for at least its most important individual components.

A smoke test should be as fast as possible so you can run it on a small subset of the dataset, a small number of epochs, and maybe with a reduced version of the model. It should check that the pipeline runs without errors and produces reasonable output—for example, it ensures that the loss is decreasing on this toy dataset. The following listing shows a simplified example of a smoke test for a training pipeline.

Listing 10.2 What a smoke test for a training pipeline might look like

from unittest.mock import patch, Mock  import torch  from training_pipeline import train, get_config      class DummyResnet(torch.nn.Module):      def __init__(self):          super().__init__()          self.model = torch.nn.Sequential(torch.nn.AdaptiveAvgPool2d(1),                                           torch.nn.Conv2d(3, 2048, 1))        def forward(self, x):          return self.model(x).squeeze(-1).squeeze(-1)      def test_train_pipeline():      config = get_config()      config["dataset_path"] = "/path/to/fixture"      config["num_epochs"] = 1        mock = Mock(wraps=DummyResnet)      with patch('training_pipeline.models.Resnet', mock):          result = train(config)          assert mock.call_count == 1          assert result['train_loss'] < .5          assert result['val_loss'] < 1

Smoke tests like this significantly increase iteration speed, thus simplifying experimentation and debugging. However, there is a downside. Like any integration tests, they require a lot of maintenance efforts on their own. This is because almost any significant pipeline change may affect the code. Lower-level unit tests should cover individual components of the pipeline. It’s not uncommon to have a few of them or even none at all—and there’s no reason to be ashamed if you don’t have them. However, we recommend covering at least the most sensitive components. An example of such a sensitive component could be the final model conversion—imagine the model is trained with Pytorch and later is supposed to be deployed to iOS (and run with CoreML) and the backend (and run with ONNX). It’s important to make sure that the model is converted properly and the conversion process doesn’t introduce any changes, which means results by the converted models should be the same as by the original model.

10.5.1 Property-based testing

Another group of tests is applicable to the trained model, inspired by the property-based testing approach. Property-based testing is a software testing approach that involves generating random inputs for a function or a system and then verifying that certain properties or invariants hold true for all the inputs. Instead of writing specific test cases with predetermined inputs and expected outputs, property-based testing focuses on defining the general properties that the system should satisfy and then automatically generates test cases to validate those properties.

In the context of an ML project, property-based testing can be used to ensure that the final trained model behaves as expected and satisfies certain properties. The following are some examples of properties that can be tested in an ML project:

Consistency —Given the same input data, the model should produce the same output or prediction consistently, regardless of the number of times it is executed:

Monotonicity —In simple ML models, the output should be monotonically increasing or decreasing with respect to certain input features. Property-based testing can be used to verify that the model’s output follows the expected monotonic behavior:

Monotonicity is often expected in various price prediction models. For example, the price of a house should increase with its square footage if the rest of the features are fixed.

Invariance under transformations —Some ML models should be invariant under specific transformations of the input data, such as scaling or rotation. Property-based testing can be used to check that the model’s output remains unchanged when the input data is transformed in a specific way:

where g is an expected transformation. This could be a rotation or scaling for images, changing an entity to its synonym for natural language processing, altering the volume of a sound for audio, and so on.

Robustness —The model should be robust to small perturbations in the input data. Property-based testing can be used to verify that the model’s output does not change significantly when the input data is perturbed by a small amount:

Negation —The model should provide the opposite prediction when the input data is flipped. Property-based testing can be used to verify that the model’s output is the opposite of the expected output when the input data is negated. The simplest example is the sentiment analysis, where the model usually should predict a negative sentiment if the word “love” is replaced with “hate”:

We already covered a very similar concept in section 5.2.1. The difference is that in one case, we expect some variation in the results (and we want to measure it), while in the other case, we expect strict consistency (and thus we want to assert it). Using some data samples as fixtures and writing property-based tests for them is a good way to ensure that the model behaves as expected and maintains its reliability.

An exact list of tests is not usually included in the design document; however, we recommend thinking about it in advance and mentioning it in the document. The design document is often used as a reference for implementation, so it’s useful to mention the tests in it.

NOTE If you’re interested in ML testing, we recommend Arseny’s slides with a deeper review of the topic: .

10.6 Design document: Training pipelines

As we continue our work on two separate design documents for our imaginary businesses, it’s time to cover training pipelines for Supermegaretail and PhotoStock Inc.

10.6.1 Training pipeline for Supermegaretail

Let’s see how a potential pipeline could look for Supermegaretail.

Design document: Supermegaretail

VII. Training pipeline

i. Overview

The demand forecasting model for Supermegaretail aims to predict the demand for specific items in specific stores during a particular period. To achieve this, we need a training pipeline that can preprocess data, train the model, and evaluate its performance. We assume the pipeline should be scalable and easy to maintain and allow for experimentation with various model architectures, feature engineering techniques, and hyperparameters.

ii. Toolset

The suggested tools for the pipeline are

Python as the primary programming language for its versatility and rich ecosystem for data processing and ML
Spark for parallel and distributed computing
PyTorch for deep learning models
MLflow for tracking experiments and managing the machine learning lifecycle
Docker for containerization and reproducibility
AWS Sagemaker or Google Cloud AI Platform for cloud-based training and deployment

iii. Data preprocessing

The data preprocessing stage should include

Data cleaning —Handling missing values, removing duplicates, and correcting erroneous data points
Feature engineering —Creating new features from existing ones, such as aggregating sales data, extracting temporal features (day of the week, month, etc.), and incorporating external data (e.g., holidays, weather, and promotions)
Data normalization —Scaling numeric features to a standard range
Train–test split —Splitting the dataset into training and validation sets, ensuring that they do not overlap in time to prevent data leakage

iv. Model training

The model training stage should accommodate various model architectures and configurations, including

Baseline models —Simple forecasting methods like moving average, exponential smoothing, and autoregressive integrated moving average
ML models —Decision trees, random forests, gradient boosting machines, and support vector machines
Deep learning models —Recurrent neural networks, long short-term memory networks, and transformers (if needed!)

We should also implement a mechanism for hyperparameter tuning, such as grid search or Bayesian optimization, to find the best model configurations.

v. Model evaluation

Model performance should be evaluated using metrics we derived prior to that, such as quantile metrics for quantiles of 1.5, 25, 50, 75, 95, and 99, both as is and with weights equal to SKU price. It is calculated as point estimates with 95% confidence intervals (using bootstrap or cross-validation) plus standard metrics such as mean absolute error (MAE), mean squared error (MSE), or root mean squared error (RMSE). We should also include custom metrics specific to Supermegaretail’s business requirements, such as the cost of overstock and out-of-stock situations. (See Validation chapter.)

vi. Experiment tracking and model management

Using a tool like MLflow, we should track and manage experiments, including

Model parameters and hyperparameters
Input data and feature engineering techniques
Evaluation metrics and performance
Model artifacts, such as trained model weights and serialized models

vii. Continuous integration and deployment

The training pipeline should be integrated into Supermegaretail’s existing CI/CD infrastructure. This includes setting up automated training and evaluation on a regular basis, ensuring that the latest data is used to update the model, and deploying the updated model to production with minimal manual intervention.

viii. Monitoring and maintenance

We should monitor the model’s performance in production and set up alerts for significant deviations from expected performance. This will enable us to catch problems early and trigger retraining or model updates when necessary (see chapter 14).

ix. Future work and experimentation

The training pipeline should be flexible enough to accommodate future experimentation, such as incorporating additional data sources, trying new model architectures, and adjusting loss functions to optimize for specific business objectives.

10.6.2 Training pipeline for PhotoStock Inc.

Now we go back to PhotoStock Inc., where we are required to build a smart in-house search engine to boost correct result output.

Design document: PhotoStock Inc.

VII. Training pipeline

The multimodal ranking model is a core component of the PhotoStock Inc. search engine, and we need a training pipeline to train this model. As discussed earlier, we have a solid baseline based on the pretrained CLIP model. However, we need to finetune it on our own dataset, which is a combination of images and text descriptions. While the dataset is not going to be large in the beginning, it can grow down the stretch, so we need to make the pipeline somewhat scalable. We assume we can start with training it on a single top-class GPU, but we want to be able to scale it to multiple GPUs on a single machine in the future. We don’t aim for fully distributed training at the moment.

We suggest the following toolset for the pipeline:

PyTorch —The default deep learning framework, as it’s the most popular one worldwide and has a lot of community support
PyTorch Lightning —A high-level framework to simplify the training loop and make it more reproducible
Flyte —A workflow management tool because it’s already used in the company for data engineering jobs, and we can reuse some of the existing code
AWS Sagemaker —A training platform because AWS is already used in the company, and it’s easy to integrate with Flyte
Tensorboard —A simple visualization tool for training metrics
Docker —A containerization tool to make the pipeline more portable and reproducible

The pipeline’s output should be two models: a text encoder and an image encoder. Both should be converted to static graph representation (ONNX) and saved to S3. Additionally, we should output the list of training parameters that will be used for inference (prompt generation, image preprocessing, distance function). Finally, every run should produce a report with training metrics. All the artifacts should be saved to S3 after the run.

We expect active experimentation in the following areas:

What to finetune —Some components, the full model, or a combination of both with a custom scheduler.
Augmentation techniques —We can use different augmentation techniques for images and text.
Various loss functions—With different weights for different components.
Various backbones for the CLIP model family —For example, convolutional-based or transformer-based encoder for images; there is no strong intuition about which one is better, so we need to experiment with both.
Ways to generate text prompts for the text encoder —It must be a composition of image description, tags, etc.
Ways to preprocess image inputs —For example, resize, crop, pad parameters.

Summary

Remember that ML is not just about training a model. One of its pillars is building a pipeline that allows for the preparation of the model and other artifacts in a reproducible way.
While the difference between training pipelines and inference pipelines may seem somewhat vague and hard to distinguish, we suggest the following definition for each: a training pipeline is used to train a model itself, while an inference pipeline is used to run a model in production or as part of a training pipeline.
The life cycle of a typical training pipeline includes seven sequential steps, from data fetching, preprocessing, training, evaluating, and testing the model to postprocessing, artifact packaging, and report generating.
At this point, there are no well-established standards for platforms and tools to use when working with pipelines. However, there are time-proven solutions in general ML that you can find fitting for the type of system you’re designing.
At some point, you will face a choice between the two scaling methods for your pipeline—vertical scaling or horizontal scaling. The former is simpler and easy to achieve yet is limited by the potential maximum performance of your machine. The latter, however, allows much bigger opportunities for enhancing the performance of your hardware.
Try to find a way to make your pipeline well-balanced in terms of its configurability. If you fall into either of the extremes (underconfiguration or overconfiguration), your pipeline will be either too rigid and resistant to change or overly complex for a given set of objectives.
Don’t neglect testing your pipeline! It will help you avoid regression bugs while introducing changes, increase iteration speed by catching defects earlier, and improve the overall design, as it will force you to find the proper balance of configurability.