Книга: Machine Learning System Design
Назад: 14 Monitoring and reliability
Дальше: 16 Ownership and maintenance

15 Serving and inference optimization

This chapter covers

  • Challenges that may arise during the serving and inference stage
  • Tools and frameworks that will come in handy
  • Optimizing inference pipelines

Making your machine learning (ML) model run in a production environment is among the final steps required for reaching an efficient operating lifecycle of your system. Some ML practitioners demonstrate low interest in this aspect of the craft, preferring instead to focus on developing and training their models. This might be a false move, however, as the model can only be useful if it’s deployed and effectively utilized in production. In this chapter, we discuss the challenges of deploying and serving ML models, as well as review different methods of optimizing the inference process.

As we mentioned in chapter 10, an inference pipeline is a sequence (in most cases) or a more complicated acyclic graph of steps that takes raw data as input and produces predictions as output. Along with the ML model itself, the inference pipeline includes steps like feature computation, data preprocessing, output postprocessing, and others. Preprocessing and postprocessing are somewhat generic terms, as they can be built differently depending on the specific requirements of a given system. Let’s take a few examples. A typical computer vision pipeline often starts with image resizing and normalization; a typical language processing pipeline, in turn, initiates with tokenization, while a typical recommender system pipeline kicks off by pulling user features from a feature store.

The role of properly tuned inference may vary from domain to domain as well. High-frequency trading companies or adtech businesses are on a constant hunt for the brightest talents to refine their systems of extremely low latency. Mobile app and Internet of Things (IoT) developers put efficient inference at the top of their priority list, chasing more efficient battery consumption and thus improved user experience. Those who utilize high-load backend in their products are interested in keeping the load without spending a fortune on the infrastructure. At the same time, there are many scenarios when the model prediction is not a bottleneck. For example, once per week, we need to run a batch job to predict the next week’s sales, generate the report, and distribute it to the procurement department. In those cases, if the report is required to be in corporate emails by Monday morning, it doesn’t make much of a difference if it will take 10 minutes or 3 hours to compile, as long as it’s scheduled for Sunday night.

In general, this chapter is mostly focused on deep learning-based systems. And it shouldn’t come as a surprise, as heavy models require more engineering efforts before production deployment, while serving some lightweight solution like a logistic regression or a small tree ensemble is not too complicated as long as your feature infrastructure (described in chapter 11) is in place. At the same time, some principles and techniques described here apply to any ML system.

15.1 Serving and inference: Challenges

As always happens in system design, our first step lies in defining the requirements. There are several crucial factors to keep in mind, and we touch on each of them next:

Things can get more complicated when we have to use more than one platform for our product. In certain cases, we may decide to run small batches on user devices and send the rest to the backend or load the backend with finetuning before sending the model to a user’s device for inference. This way, the combined requirements of both platforms are followed. Furthermore, we may need to use various computing units within a single platform. Arseny once needed to speed up a system that ran on devices with low-end GPUs. Under the hood, the system used a small model to process multiple concurrent requests, which led to those cheap GPUs failing to handle the load. The solution Arseny came up with introduced a pool of mixed-device sessions: every time the GPU was overloaded, the following request was processed by a CPU, thus providing a more balanced load across devices and meeting the latency requirements.

15.2 Tradeoffs and patterns

The factors mentioned here are often conflicting and even mutually exclusive. For this reason, it is not possible to optimize for all of them simultaneously, meaning we’ll have to find compromises between those factors. However, there are certain patterns we can lean toward in finding a fair balance in various scenarios.

15.2.1 Tradeoffs

Let’s start with latency and throughput. Both are very popular candidates for optimization, and improvements in one of them may lead to either improvements or degradation in the other.

Real production systems are often optimized for the best throughput for a given latency budget. The latency budget is dictated by the product or user experience needs (do users expect a real-time/near-real-time response, or are they tolerant of delays?), and given this budget, the aim is to maximize throughput (or minimize the number of required servers for expected throughput) by changing either the model architecture or inference design (e.g., with batching, as we describe later in section 15.3).

Let’s go through some examples. Imagine a simple deep learning model—a convolutional network like Resnet or a transformer like BERT. If you just reduce the number of blocks, it is very likely both your latency and throughput numbers will improve no matter what your inference setup is, so things are very straightforward (the model’s accuracy can drop, but that’s not the case in the current example). But imagine having two models with the same number of blocks and the same number of parameters per block but utilizing two different architectures: model A runs them in parallel with further aggregation (similar to the ResNeXt architecture), and model B runs every block sequentially. Model B will have a higher latency compared to model A due to the sequential nature of its architecture, but at the same time, you can run multiple instances of model B in parallel or run it with a large batch size, so the throughput of model A is not any worse than that of model B. Thus, parallelism is one of the internal factors that can affect latency and throughput in different ways. In practice, it means that the number of parameters or the number of blocks cannot be the definitive factor in comparing models, and our understanding of the architecture in the context of the inference platform is what helps us identify the bottlenecks, as seen in figure 15.1.

figure
Figure 15.1 An example of wide versus deep models. While they have the same number of parameters, the deep one is more limited in terms of parallel computation

Another tradeoff is related to model accuracy (or other ML-specific metrics used for the system) and correlated latency/throughput/costs. ML engineers are often tempted to solve the model’s imperfection by upgrading to a larger model. This can provide eventual benefits occasionally, but it is an expensive and poorly scalable way to fix the problem in the first place. At the same time, trying to solely optimize for cost by choosing the simplest model is not a good idea either. A dependency between compute costs and the value of the model is not linear, and in most cases, there is a sweet spot where the model is good enough, and the cost is relatively low, while radical solutions on the edges of the spectrum are less efficient. If we revisit section 9.1, one can be built with the same idea in mind: we can plot the model’s accuracy as a function of the compute costs and find a proper balance for our particular problem (see figure 15.2 for details).

figure
Figure 15.2 Different sizes of models belonging to the same families and their respective accuracy values on the Imagenet dataset. The chart illustrates that larger models tend to have higher accuracy, but this effect tends to saturate. (Source: .)

Finally, a tradeoff that might not be that obvious but is still worth mentioning is based on a link between research flexibility and production-level serving performance. On the one hand, we can benefit from building a model that is easy to migrate from the experimental sandbox directly to production. On the other hand, having a clear separation between research and production will allow the research team to experiment with new ideas without affecting the production system. But as we go further down the stretch, the difference between experimental code and real system inference will amplify, thus increasing the chance of facing a defect caused by the difference in the environments, which is hard to track without proper investments in monitoring and integration tests (please see chapter 13 for deployment aspects and chapter 14 for observability).

There are certain tools and frameworks that you may want to consider when facing the choice between either option. Because the route you take will significantly affect the sustainability of your system, the following section is fully dedicated to this very important subject.

15.2.2 Patterns

Once we have determined the tradeoffs we need to make for optimizing our model, it’s time to choose the right pattern to implement. We will once again limit ourselves to a small list and mention three diverse patterns. Although this list is not complete, it is highly likely that you will use one of these patterns in your system.

The first pattern worth mentioning is batching, a technique used to improve throughput, which needs to be considered in advance. If you know that the system can be used in a batch mode, you should design it accordingly. Batching is not just a binary property (to use or not to use) and features multiple nuances. Imagine a typical API for a web page: the client, whom we can’t control, sends some data, and the backend returns a prediction. That’s not a typical offline batch mode prediction, but there is room for dynamic batching: on the backend side, we wait for a short period (e.g., 20 ms or, say, 32 requests, whichever comes first) and collect all the requests that landed during this period, and then run a batch inference on them. This way, we can reduce the latency for the client but still keep the system responsive. This approach is implemented in frameworks like the Triton Inference Server by Nvidia and Tensorflow Serving. Another example of smart batching is related to language models: the typical input size for the model is dynamic, and running a batch inference on multiple inputs of various sizes requires padding to the size of the longest input. However, we can group inputs by size and run batch inference on each group separately, thus reducing the padding overhead (at the cost of smaller batches or a longer batch accumulation window). You can read more about this technique at Graphcore’s blog ().

The second pattern we’d like to talk about is caching, which has earned the status of the ultimate level of optimization (we’ll discuss inference optimization later in the chapter). Cache usage comes from the obvious idea: never compute the same thing twice. Sometimes it is as simple as in more traditional software systems: input data is associated with a key, and some key-value storage is used to keep previously computed results, so we don’t run expensive computations twice or more.

Listing 15.1 Example of the simplest possible in-memory cache
class InMemoryCache:      def __init__(self):          self.cache = {}        def get(self, key: str, or_else: Callable):          v = self.cache.get(key)          if v is None:              v = or_else()              self.cache[key] = v          return v

In reality, the distribution of inputs for an ML model may have a long tail, as most requests are unique. For example, according to Google, 15% of all search queries are totally unique (). Given that cache time to live is typically way lower than the whole Google history, the share of unique (noncacheable) queries would be high within any reasonable window. Does it mean the cache is useless? Not necessarily. One reason is that even a low percentage of saved compute can provide a great benefit in terms of saved money. Another concept is the idea of using fuzzy caches: instead of checking the direct match of the key, we can relax the matching condition. For example, Arseny has seen a system where the cache key was based on a regular expression, so the result could be shared by multiple similar requests matching the same regex. Even more aggressive caching can be built based on the semantic similarity of the key (e.g., uses such an approach to cache LLM queries). As practitioners who care about reliability, we recommend thinking twice about using such caching: its level of fuzziness is huge, and thus the cache would be prone to false cache hits.

A third pattern has recently emerged in the LLM world: routing between models. Inspired by the “mixture of experts” architecture (), it proved to be a good step for optimization: some queries are hard (and should be processed by expensive in serving but advanced models); some are simpler (and thus can be delegated to less complex and cheaper models). Such an approach is mostly used for general LLMs, machine translation, and other—mostly natural language processing-specific—tasks. A good example of the implementation is finely presented in the paper “Leeroo Orchestrator: Elevating LLMs Performance Through Model” by Alireza Mohammadshahi et al. ().

A variation of this pattern is to use two models (one is fast and imperfect; the other is slow and more accurate). With this combination, a quick response is generated by the first, smaller model (so it can be rendered with low latency) to be later replaced with the output of the second, heavier model. Unlike the previous pattern, it does not save the total compute required, but it does optimize a special kind of latency, which is time to initial response.

15.3 Tools and frameworks

The inference process is heavily engineering focused. Luckily, there are many tools and frameworks available for use. Following the approach taken for this book, we don’t aim to provide a comprehensive overview of every framework you might need to construct a reliable inference pipeline; instead, we focus on the principles and mention popular solutions for illustrative purposes.

15.3.1 Choosing a framework

One common, although not immediately apparent, heuristic is to separate your training and inference frameworks. It is typical to use tools like Pandas, scikit-learn, and Keras as they offer flexibility and simplicity during research, prototyping, or training. However, they are not ideal for inference due to the inevitable tradeoff between flexibility and simplicity. This is why it’s a popular practice to train models in one framework and then convert them for further inference in another. Additionally, it’s essential to decouple the training framework from the inference framework as much as possible so that if you need to switch to a different training framework with new features, it won’t affect the inference pipeline. This is especially crucial for production systems expected to be operational and evolve for years.

From the other perspective, some research-first frameworks like Torch tend to close the gap between research and production. The compilation functionality introduced in Torch 2.0 allows for making a fairly optimized inference pipeline from the same code used for training and relevant experiments. So, whether you want to use a universal framework or a combination of two or even more solutions for different purposes, both paradigms are viable, depending on which approach you choose (see figure 15.3 for information on the most popular frameworks).

figure
Figure 15.3 A wide spectrum of tools with their focus on either research or production serving goals

Achieving a balance between research flexibility and high performance in the production environment may require an interframework format. A popular choice for this purpose is ONNX, which is supported by many training frameworks for converting their models to ONNX. On the other hand, inference frameworks often work with the ONNX format or allow the conversion of ONNX models into their own format, making ONNX a common language in the ML world.

Here it should be mentioned that ONNX is not just a representation format but rather an ecosystem. It includes a runtime that can be used for inference and a set of tools for model conversion and optimization. ONNX Runtime is well-suited for a variety of backends and can run on almost any platform with specific optimizations for each of them. This makes it an excellent choice for multiplatform systems. Based on our experience, ONNX Runtime strikes a popular balance between flexibility (although less flexible than serving PyTorch or scikit-learn models directly) and performance (although less performant than tailoring a highly custom solution for the specific combination of models, hardware, and usage patterns).

For the server CPU inference, a popular engine is OpenVINO by Intel. For CUDA inference, TensorRT by Nvidia is commonly used. Both of them are optimized for their target hardware and are also available as ONNX Runtime backends, which allows them to be used in the same way as ONNX Runtime. Two more honorable mentions are TVM (), which is a compiler for deep learning models capable of generating code for a variety of hardware targets, and AITemplate (), which can be used even for running models on the less common AMD GPUs via the ROCm software stack.

Those familiar with iOS model deployment may be aware of CoreML, an engine for iOS inference that uses its own format. Android developers usually opt for TensorFlow Lite, although there are more options available on Android thanks to its greater fragmentation. This list is far from complete, but it provides an idea of how large the variety of available options is.

When we say something like “engine X runs on device Y,” it may not be precisely accurate. For example, even when the overall inference is meant to run on a GPU, some operations may be forwarded to the CPU because it produces more efficiency. GPUs excel at massively parallel computing, making them ideal for tasks like matrix multiplication or convolutions. However, some operations related to control flow or sparse data are better handled by the CPU. For instance, CoreML dynamically splits the execution graph between the CPU, GPU, and Apple Neural Engine to maximize efficiency.

Various inference engines often come with optimizers that can be used to improve the model’s performance. For example, ONNX Runtime offers a set of optimizers that can reduce the model’s size by trimming unused parts of the graph (e.g., those only used in training for loss computation) or by reducing the number of operations after fusing several operations into one. These optimizers are typically separate tools used to prepare the model for inference but are not part of the inference engine itself.

By the time this book is published, there’s a chance this section will be outdated, as the field is evolving rapidly. For example, when we started writing the book, few people paid attention to LLMs, and now they are ubiquitous. People often debate which is a better inference engine for them—whether it’s VLLM (), TGI (), or GGML (). LLM inference is overall a kind of specific topic—unlike most of the more traditional models, LLMs are often bound by large memory footprint and autoregressive paradigm (you can’t predict token T + 1 before token T is predicted, which makes parallelization barely possible without additional tricks). If you’re interested in LLM inference, we recommend reading the blog post at and some truly profound pieces of research (e.g., at the moment of polishing this chapter, we were impressed with “PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU” by Song et al. ().

While the terms inference engine and inference framework are often used interchangeably, they are not completely identical. An inference engine is a runtime used for inference, while an inference framework is a more general term that may include an engine, a set of tools for model conversion and optimization, and other components that assist with various aspects of serving, such as batching, versioning, model registry, logging, and so on. For example, ONNX Runtime is an inference engine, while TorchServe () is an inference framework.

There is no single answer to the question of whether you need a fully featured framework or just a small wrapper on top of the inference engine. From our experience, once your needs reach a high level of certainty and you generally tend to work with more advanced machinery while having enough human resources to maintain it, frameworks are the way to go. On the other hand, once you’re in a startup and need to ship the system somehow but know that your needs will only be formulated in the next several months, it makes sense to opt for the leaner way, which is to deploy the model with a simple combination of an inference engine and some communication layer (e.g., web framework) and postpone a more reliable solution for the next version.

15.3.2 Serverless inference

Serverless inference is an emerging approach that stands out from traditional server-based models. Popularized by AWS Lambda, this serverless paradigm has now found representation in multiple alternatives from major cloud providers such as Google Cloud Functions, Azure Functions, and Cloudflare Workers, as well as from startups like Banana.dev and Replicate. As of the current writing, major providers primarily offer CPU inference with limited GPU capabilities, although this is likely to change as startups continue to push the boundaries in this field.

More advanced products by major cloud providers like AWS Sagemaker jobs can be viewed as serverless as well. They share core serverless properties (managed infrastructure, pay-as-you-go pricing, autoscaling) but aim for another level of abstraction: slower functions and more containerized jobs running for extended periods.

It’s important to note that the term serverless can be somewhat misleading. It doesn’t imply the absence of servers in the system. Rather, it means that engineers have no direct control over the servers, as they are isolated from them by cloud providers, so they don’t need to concern themselves with server management.

Attitudes toward serverless inference are generally either strictly positive or strictly negative due to its notable advantages and disadvantages. Let’s highlight some of them:

Some serverless providers, like Replicate, offer a wide range of pretrained foundation models that can be used out of the box. This is particularly advantageous when starting new projects, especially for prototyping or research purposes.

We’ve observed both successful and failed cases of serverless inference in small pet projects and high-load production environments. It’s unquestionably a viable option, but careful consideration and a thorough cost-benefit analysis are crucial before fully embracing it. The rule of thumb that we follow is as follows: consider using serverless inference when the autoscaling is a significant advantage (e.g., high variance in the number of requests) and the model itself is not excessively large (although even LLMs can sometimes be deployed in CPU-only serverless environment; ). Another good idea is to consider serverless inference when you are uncertain about the future load: it can somewhat fit multiple load patterns at the beginning of the project, giving you enough time to redesign the inference part once the situation becomes clearer.

15.4 Optimizing inference pipelines

Premature optimization is the root of all evil.
— Donald Knuth

Optimizing inference pipelines is a wide topic that is not a part of ML system design per se; still, it’s a crucial part of ML system engineering that deserves a separate section, at least for listing common approaches and tools as a landscape overview. At its core, optimizing inference pipelines often boils down to a tradeoff between the model’s speed and accuracy and the required resource capacity, which implies numerous optimizing techniques that mainly depend on the model’s characteristics and architecture.

A reasonable question that may pop up here is, “What would be the starting procedure for optimizing?” We asked a similar question of a number of ML engineers during job interviews and received a variety of answers that mentioned such terms as model pruning, quantization (see “Pruning and Quantization for Deep Neural Network Acceleration: A Survey” ), and distillation (see “Knowledge Distillation: A Survey,” ), with references to the state of the papers (we only mention a couple of surveys so you can use them as starting points). These techniques are widely known in the ML research community; they’re useful and often applicable, but they are only focused on model optimization without letting us control the whole picture. The most pragmatic answer was, “I would start with profiling.”

15.4.1 Starting with profiling

Profiling is a process of measuring the performance of the system and identifying bottlenecks. In contrast to the techniques mentioned earlier, profiling is a more general approach that can be applied to the whole system. Just like strategy comes before tactics, profiling is a great starting point that allows for the identification of the most promising directions for optimization and the selection of the most appropriate techniques.

You might be surprised by how often the seemingly most obvious factors may not be the model’s weakest links. Let’s take latency as an example. There are cases when it is not the bottleneck (especially when the served model is not a recent generative thing but something more conventional), and the problem is hiding somewhere else (e.g., data preprocessing or network interactions). Furthermore, even if the model is the slowest part of the pipeline, it doesn’t automatically mean it should be the target for optimization. This may seem counterintuitive, but we should look for the most optimizable, not the slowest, part of the system. Imagine that a full run takes 200 ms, 120 ms of which is required for the model. But the fact is, the model is already optimized; it runs on GPU via a highly performant engine, and there is not much room for any more improvement. On the other hand, data preprocessing takes 80 ms, and it is an arbitrary Python code that can be optimized in many ways. In this case, it is better to start with data preprocessing, not the model.

Another example comes from Arseny’s experience; he was once asked to reduce the latency of a system. The system was a relatively simple pipeline that ran tens of simple models sequentially. The first idea to improve timing was to replace the sequential run with a batch inference. However, profiling proved the opposite: the inference itself was insignificant (around 5%) compared to the time spent on data preprocessing, and a batch inference would not help. What helped was optimizing the preprocessing step, which could not benefit from batching and was eventually coupled with the initially designed sequential run. In the end, Arseny managed to speed up the system by 40% without touching the core model inference, and such elements as thread management, IO and serialization/deserialization functions, and cascade caching were the real low-hanging fruit.

Here we should mention that profiling ML systems differs slightly from profiling regular software for reasons like extensive use of GPUs, asynchronous execution, using thin Python wrappers on top of high-performance native code, and so on. And since the whole process may include a larger number of variables, you should be careful when interpreting the results of profiling, as it is easy to get confused by the tricky nature of the problem.

GPU execution can be especially confusing. Typical CPU load is simple: the data is loaded into memory, and the CPU executes the code: done. There may be nuances related to CPU cache, single instruction, multiple data instructions, or concurrent execution, but in most cases, it is straightforward. Although the GPU is a separate device, it usually comes with built-in memory, and the data has to be copied to the GPU memory before the execution. The copying process itself can be a bottleneck, and it’s not always obvious how to measure it. The highly parallel nature of the GPU execution also makes for nonlinear effects. The simplest example is that if you run a model on a single image, it may take 100 ms, but if you run it on 64 images, the processing time will increase only to 200 ms. That is because the GPU is not fully utilized when processing just one item, which leads to a significant overhead of copying the data to the GPU memory. The same happens on a lower level of model architecture: reducing the number of filters in a convolutional layer may not reduce latency, as the GPU is not fully utilized and uses the same CUDA kernel under the hood. Overall, programming for CUDA and other general-purpose GPU frameworks is a separate and extremely deep rabbit hole; the only takeaway we would like to focus on is that the typical programmer’s intuition on what is fast and what is slow can be totally irrelevant for GPU-based computing.

So a proper approach to profiling requires keeping a wide mix of tools at hand, starting from basic profilers like cProfile from a Python standard library, more advanced third-party tools like Scalene (), memray (), py-spy (), and ML framework-specific tools (like PyTorch Profiler) and ending with low-level GPU profilers like nvprof (). Finally, when working with exotic hardware like tensor processing units or IoT processors, you may need to use vendor-specific tools.

After interpreting the profiling results, we can start optimizing the system, addressing the most visible bottlenecks. Typical approaches are arranged across various levels:

15.4.2 The best optimizing is minimum optimizing

If we step away from the operating level and give an overview of optimizing from the overall design perspective, some of the problems that arise during the maintenance stage can be avoided if the system has been given a thorough treatment in accordance with the original requirements during the design stage. If we are aware of strict latency requirements, we should initially choose a model that is fast enough for the target platform. Of course, we can reduce the memory footprint with quantization or pruning or reduce latency with distillation, but it is better to start with a model close to the target requirements rather than trying to speed it up when the system is ready. From our experience, an urgent need for optimization is a result of poor choices at the beginning of the system’s life cycle (e.g., heavy models were used as a baseline), an unexpected success case (a startup built a quick prototype and suddenly needs to scale), or a planned tech debt (“Okay, we’ve built something really suboptimal but fast for now; if it survives and helps us finding the product-market fit, we will clean it up”).

Choosing the level of optimization is a crucial decision for effective inference. A founding engineer in a startup who needs to extend their minimal viable product to the second customer should usually just scale via straightforward renting of additional cloud computing resources. On the other hand, a platform engineer in a big tech company would likely benefit from reading “Algorithms for Modern Hardware” () and applying low-level optimizations at scale.

15.5 Design document: Serving and inference

A separate section of the design document dedicated to inference optimization should cover the anticipated actions during the maintenance stage. Let’s examine the commonalities and differences in inference optimization for the two rather divergent ML systems.

15.5.1 Serving and inference for Supermegaretail

Based on the key features and requirements of retail-focused ML systems, the solution for Supermegaretail will not require real-time involvement, allowing modification of the model in batches; still, it will involve a large scope of work.

15.5.2 Serving and inference for PhotoStock Inc.

Search engine optimization includes two main components:

Summary

Назад: 14 Monitoring and reliability
Дальше: 16 Ownership and maintenance