15 Serving and inference optimization

This chapter covers

Challenges that may arise during the serving and inference stage
Tools and frameworks that will come in handy
Optimizing inference pipelines

Making your machine learning (ML) model run in a production environment is among the final steps required for reaching an efficient operating lifecycle of your system. Some ML practitioners demonstrate low interest in this aspect of the craft, preferring instead to focus on developing and training their models. This might be a false move, however, as the model can only be useful if it’s deployed and effectively utilized in production. In this chapter, we discuss the challenges of deploying and serving ML models, as well as review different methods of optimizing the inference process.

As we mentioned in chapter 10, an inference pipeline is a sequence (in most cases) or a more complicated acyclic graph of steps that takes raw data as input and produces predictions as output. Along with the ML model itself, the inference pipeline includes steps like feature computation, data preprocessing, output postprocessing, and others. Preprocessing and postprocessing are somewhat generic terms, as they can be built differently depending on the specific requirements of a given system. Let’s take a few examples. A typical computer vision pipeline often starts with image resizing and normalization; a typical language processing pipeline, in turn, initiates with tokenization, while a typical recommender system pipeline kicks off by pulling user features from a feature store.

The role of properly tuned inference may vary from domain to domain as well. High-frequency trading companies or adtech businesses are on a constant hunt for the brightest talents to refine their systems of extremely low latency. Mobile app and Internet of Things (IoT) developers put efficient inference at the top of their priority list, chasing more efficient battery consumption and thus improved user experience. Those who utilize high-load backend in their products are interested in keeping the load without spending a fortune on the infrastructure. At the same time, there are many scenarios when the model prediction is not a bottleneck. For example, once per week, we need to run a batch job to predict the next week’s sales, generate the report, and distribute it to the procurement department. In those cases, if the report is required to be in corporate emails by Monday morning, it doesn’t make much of a difference if it will take 10 minutes or 3 hours to compile, as long as it’s scheduled for Sunday night.

In general, this chapter is mostly focused on deep learning-based systems. And it shouldn’t come as a surprise, as heavy models require more engineering efforts before production deployment, while serving some lightweight solution like a logistic regression or a small tree ensemble is not too complicated as long as your feature infrastructure (described in chapter 11) is in place. At the same time, some principles and techniques described here apply to any ML system.

15.1 Serving and inference: Challenges

As always happens in system design, our first step lies in defining the requirements. There are several crucial factors to keep in mind, and we touch on each of them next:

Latency —This defines our expectations of how prompt the system is in providing predictions. For real-time applications that require an immediate response, latency is generally measured in milliseconds (there are extreme cases where even 1 ms is too long!), while in some scenarios, waiting for hours or even days is totally acceptable.
Throughput —This refers to the number of tasks or amount of data that a system can process in a given time frame. In the world of ML, it implies the total number of predictions the model can produce per unit of time. Optimizing for throughput is often crucial in cases when you need to process large volumes of data over a short period, such as batch processing of large datasets.
Scalability —We need to understand the number of predictions the system may face and how this number can either increase or decrease. Load patterns are often seasonal and may vary depending on the industry. For retail, it is common to see a spike in sales during the holiday season or on days of huge discounts such as Black Friday or Cyber Monday. For the adtech industry, the load may be higher due to higher activity on the internet. While some spikes, as mentioned here, are predictable, others can come out of the blue; the viral popularity of an app or a sudden usage burst by a large customer is not something we can prepare for in advance. The system should possess sufficient scalability to handle the peak load without compromising latency and throughput.
Target platforms—Your models can run on CPU-only or GPU-accelerated servers; in serverless environments like AWS Lambda or Cloudflare Workers; on desktop, mobile, or IoT devices; or even in the browser. Along with advantages, each platform has its own limitations and requirements, and we must have a deep understanding of those before building the system. If we are building a mobile app, we’ll need to consider the model’s size, as it should be small enough to fit into the app package yet efficient enough to run on a user’s device without draining the battery. If we are developing a backend system, there’s more freedom in choosing the hardware, which provides a vast array of opportunities in terms of the system’s complexity and size. If our target platform is exotic IoT hardware, there is a chance we can’t design a fancy model with all the bells and whistles, so sticking to the simplest architectures becomes the major technical requirement.

Things can get more complicated when we have to use more than one platform for our product. In certain cases, we may decide to run small batches on user devices and send the rest to the backend or load the backend with finetuning before sending the model to a user’s device for inference. This way, the combined requirements of both platforms are followed. Furthermore, we may need to use various computing units within a single platform. Arseny once needed to speed up a system that ran on devices with low-end GPUs. Under the hood, the system used a small model to process multiple concurrent requests, which led to those cheap GPUs failing to handle the load. The solution Arseny came up with introduced a pool of mixed-device sessions: every time the GPU was overloaded, the following request was processed by a CPU, thus providing a more balanced load across devices and meeting the latency requirements.

Cost—ML systems are often the most resource-intensive solutions in a company, and with heavy generative models gaining more and more popularity, costs are standing out as a crucial factor like never before. With the cost of an ML infrastructure becoming an ever-growing concern, businesses are forced to look for smart decisions for inference pipelines, which can lead to massive financial benefits. It is important to understand the cost of the infrastructure and how it will scale based on an increasing load before building the system. It may even lead to scenarios when the infrastructure cost ends up being higher than the revenue generated by the system. In other cases, it will not be a concern if inference is performed on a user’s device. For mobile apps or IoT devices, for example, a growing user base will not affect the infrastructure cost in any significant manner (we can’t claim it will not affect it at all; if users download the models from your content delivery network, every new 1,000 users will cost you extra cents).
Reliability —Reliability can become a problem when you settle for a cheap option in your choice of inference platforms. Opting for the least expensive hardware vendor only to go through a sudden failure in the middle of a hot season may be at times worse than investing more in a time-tested, reliable solution. Think of all the disasters that may happen during an unexpected spike in the load, a bug in the model, or a hardware failure and, most importantly, how (and if) the system will tackle them.
Flexibility —Even if a newly released system shows stable performance and efficiency, we can’t be certain about future requirements and ideas we’ll need to implement. Thus, the system should be flexible enough to digest changes and improvements. Those may include a new model (which may even be trained with a different framework!), additional preprocessing or postprocessing, new features, additional APIs, etc. Always keep in mind that the system will evolve and should be easy to modify without affecting the existing functionality.
Security and privacy —This subject spreads much farther than just one paragraph of text, but within the scope of this chapter, we can only mention that security requirements rely heavily on the target platform. For example, when your system operates fully within your backend, and the predictions never leave the perimeter of the organization, you barely need to think about security beyond your current protocols. On the other hand, if you are building a mobile app that will run on a user’s device, you need to heavily consider protecting the model from reverse engineering. In some cases, the model should be protected from users themselves, with one of the most notable examples being jailbreaks for large language models (LLMs) that became popular in 2023, when users tried to outsmart them to get answers to sensitive questions.

15.2 Tradeoffs and patterns

The factors mentioned here are often conflicting and even mutually exclusive. For this reason, it is not possible to optimize for all of them simultaneously, meaning we’ll have to find compromises between those factors. However, there are certain patterns we can lean toward in finding a fair balance in various scenarios.

15.2.1 Tradeoffs

Let’s start with latency and throughput. Both are very popular candidates for optimization, and improvements in one of them may lead to either improvements or degradation in the other.

Real production systems are often optimized for the best throughput for a given latency budget. The latency budget is dictated by the product or user experience needs (do users expect a real-time/near-real-time response, or are they tolerant of delays?), and given this budget, the aim is to maximize throughput (or minimize the number of required servers for expected throughput) by changing either the model architecture or inference design (e.g., with batching, as we describe later in section 15.3).

Let’s go through some examples. Imagine a simple deep learning model—a convolutional network like Resnet or a transformer like BERT. If you just reduce the number of blocks, it is very likely both your latency and throughput numbers will improve no matter what your inference setup is, so things are very straightforward (the model’s accuracy can drop, but that’s not the case in the current example). But imagine having two models with the same number of blocks and the same number of parameters per block but utilizing two different architectures: model A runs them in parallel with further aggregation (similar to the ResNeXt architecture), and model B runs every block sequentially. Model B will have a higher latency compared to model A due to the sequential nature of its architecture, but at the same time, you can run multiple instances of model B in parallel or run it with a large batch size, so the throughput of model A is not any worse than that of model B. Thus, parallelism is one of the internal factors that can affect latency and throughput in different ways. In practice, it means that the number of parameters or the number of blocks cannot be the definitive factor in comparing models, and our understanding of the architecture in the context of the inference platform is what helps us identify the bottlenecks, as seen in figure 15.1.

Figure 15.1 An example of wide versus deep models. While they have the same number of parameters, the deep one is more limited in terms of parallel computation

Another tradeoff is related to model accuracy (or other ML-specific metrics used for the system) and correlated latency/throughput/costs. ML engineers are often tempted to solve the model’s imperfection by upgrading to a larger model. This can provide eventual benefits occasionally, but it is an expensive and poorly scalable way to fix the problem in the first place. At the same time, trying to solely optimize for cost by choosing the simplest model is not a good idea either. A dependency between compute costs and the value of the model is not linear, and in most cases, there is a sweet spot where the model is good enough, and the cost is relatively low, while radical solutions on the edges of the spectrum are less efficient. If we revisit section 9.1, one can be built with the same idea in mind: we can plot the model’s accuracy as a function of the compute costs and find a proper balance for our particular problem (see figure 15.2 for details).

Figure 15.2 Different sizes of models belonging to the same families and their respective accuracy values on the Imagenet dataset. The chart illustrates that larger models tend to have higher accuracy, but this effect tends to saturate. (Source: .)

Finally, a tradeoff that might not be that obvious but is still worth mentioning is based on a link between research flexibility and production-level serving performance. On the one hand, we can benefit from building a model that is easy to migrate from the experimental sandbox directly to production. On the other hand, having a clear separation between research and production will allow the research team to experiment with new ideas without affecting the production system. But as we go further down the stretch, the difference between experimental code and real system inference will amplify, thus increasing the chance of facing a defect caused by the difference in the environments, which is hard to track without proper investments in monitoring and integration tests (please see chapter 13 for deployment aspects and chapter 14 for observability).

There are certain tools and frameworks that you may want to consider when facing the choice between either option. Because the route you take will significantly affect the sustainability of your system, the following section is fully dedicated to this very important subject.

15.2.2 Patterns

Once we have determined the tradeoffs we need to make for optimizing our model, it’s time to choose the right pattern to implement. We will once again limit ourselves to a small list and mention three diverse patterns. Although this list is not complete, it is highly likely that you will use one of these patterns in your system.

The first pattern worth mentioning is batching, a technique used to improve throughput, which needs to be considered in advance. If you know that the system can be used in a batch mode, you should design it accordingly. Batching is not just a binary property (to use or not to use) and features multiple nuances. Imagine a typical API for a web page: the client, whom we can’t control, sends some data, and the backend returns a prediction. That’s not a typical offline batch mode prediction, but there is room for dynamic batching: on the backend side, we wait for a short period (e.g., 20 ms or, say, 32 requests, whichever comes first) and collect all the requests that landed during this period, and then run a batch inference on them. This way, we can reduce the latency for the client but still keep the system responsive. This approach is implemented in frameworks like the Triton Inference Server by Nvidia and Tensorflow Serving. Another example of smart batching is related to language models: the typical input size for the model is dynamic, and running a batch inference on multiple inputs of various sizes requires padding to the size of the longest input. However, we can group inputs by size and run batch inference on each group separately, thus reducing the padding overhead (at the cost of smaller batches or a longer batch accumulation window). You can read more about this technique at Graphcore’s blog ().

The second pattern we’d like to talk about is caching, which has earned the status of the ultimate level of optimization (we’ll discuss inference optimization later in the chapter). Cache usage comes from the obvious idea: never compute the same thing twice. Sometimes it is as simple as in more traditional software systems: input data is associated with a key, and some key-value storage is used to keep previously computed results, so we don’t run expensive computations twice or more.

Listing 15.1 Example of the simplest possible in-memory cache

class InMemoryCache:      def __init__(self):          self.cache = {}        def get(self, key: str, or_else: Callable):          v = self.cache.get(key)          if v is None:              v = or_else()              self.cache[key] = v          return v

In reality, the distribution of inputs for an ML model may have a long tail, as most requests are unique. For example, according to Google, 15% of all search queries are totally unique (). Given that cache time to live is typically way lower than the whole Google history, the share of unique (noncacheable) queries would be high within any reasonable window. Does it mean the cache is useless? Not necessarily. One reason is that even a low percentage of saved compute can provide a great benefit in terms of saved money. Another concept is the idea of using fuzzy caches: instead of checking the direct match of the key, we can relax the matching condition. For example, Arseny has seen a system where the cache key was based on a regular expression, so the result could be shared by multiple similar requests matching the same regex. Even more aggressive caching can be built based on the semantic similarity of the key (e.g., uses such an approach to cache LLM queries). As practitioners who care about reliability, we recommend thinking twice about using such caching: its level of fuzziness is huge, and thus the cache would be prone to false cache hits.

A third pattern has recently emerged in the LLM world: routing between models. Inspired by the “mixture of experts” architecture (), it proved to be a good step for optimization: some queries are hard (and should be processed by expensive in serving but advanced models); some are simpler (and thus can be delegated to less complex and cheaper models). Such an approach is mostly used for general LLMs, machine translation, and other—mostly natural language processing-specific—tasks. A good example of the implementation is finely presented in the paper “Leeroo Orchestrator: Elevating LLMs Performance Through Model” by Alireza Mohammadshahi et al. ().

A variation of this pattern is to use two models (one is fast and imperfect; the other is slow and more accurate). With this combination, a quick response is generated by the first, smaller model (so it can be rendered with low latency) to be later replaced with the output of the second, heavier model. Unlike the previous pattern, it does not save the total compute required, but it does optimize a special kind of latency, which is time to initial response.

15.3 Tools and frameworks

The inference process is heavily engineering focused. Luckily, there are many tools and frameworks available for use. Following the approach taken for this book, we don’t aim to provide a comprehensive overview of every framework you might need to construct a reliable inference pipeline; instead, we focus on the principles and mention popular solutions for illustrative purposes.

15.3.1 Choosing a framework

One common, although not immediately apparent, heuristic is to separate your training and inference frameworks. It is typical to use tools like Pandas, scikit-learn, and Keras as they offer flexibility and simplicity during research, prototyping, or training. However, they are not ideal for inference due to the inevitable tradeoff between flexibility and simplicity. This is why it’s a popular practice to train models in one framework and then convert them for further inference in another. Additionally, it’s essential to decouple the training framework from the inference framework as much as possible so that if you need to switch to a different training framework with new features, it won’t affect the inference pipeline. This is especially crucial for production systems expected to be operational and evolve for years.

From the other perspective, some research-first frameworks like Torch tend to close the gap between research and production. The compilation functionality introduced in Torch 2.0 allows for making a fairly optimized inference pipeline from the same code used for training and relevant experiments. So, whether you want to use a universal framework or a combination of two or even more solutions for different purposes, both paradigms are viable, depending on which approach you choose (see figure 15.3 for information on the most popular frameworks).

Figure 15.3 A wide spectrum of tools with their focus on either research or production serving goals

Achieving a balance between research flexibility and high performance in the production environment may require an interframework format. A popular choice for this purpose is ONNX, which is supported by many training frameworks for converting their models to ONNX. On the other hand, inference frameworks often work with the ONNX format or allow the conversion of ONNX models into their own format, making ONNX a common language in the ML world.

Here it should be mentioned that ONNX is not just a representation format but rather an ecosystem. It includes a runtime that can be used for inference and a set of tools for model conversion and optimization. ONNX Runtime is well-suited for a variety of backends and can run on almost any platform with specific optimizations for each of them. This makes it an excellent choice for multiplatform systems. Based on our experience, ONNX Runtime strikes a popular balance between flexibility (although less flexible than serving PyTorch or scikit-learn models directly) and performance (although less performant than tailoring a highly custom solution for the specific combination of models, hardware, and usage patterns).

For the server CPU inference, a popular engine is OpenVINO by Intel. For CUDA inference, TensorRT by Nvidia is commonly used. Both of them are optimized for their target hardware and are also available as ONNX Runtime backends, which allows them to be used in the same way as ONNX Runtime. Two more honorable mentions are TVM (), which is a compiler for deep learning models capable of generating code for a variety of hardware targets, and AITemplate (), which can be used even for running models on the less common AMD GPUs via the ROCm software stack.

Those familiar with iOS model deployment may be aware of CoreML, an engine for iOS inference that uses its own format. Android developers usually opt for TensorFlow Lite, although there are more options available on Android thanks to its greater fragmentation. This list is far from complete, but it provides an idea of how large the variety of available options is.

When we say something like “engine X runs on device Y,” it may not be precisely accurate. For example, even when the overall inference is meant to run on a GPU, some operations may be forwarded to the CPU because it produces more efficiency. GPUs excel at massively parallel computing, making them ideal for tasks like matrix multiplication or convolutions. However, some operations related to control flow or sparse data are better handled by the CPU. For instance, CoreML dynamically splits the execution graph between the CPU, GPU, and Apple Neural Engine to maximize efficiency.

Various inference engines often come with optimizers that can be used to improve the model’s performance. For example, ONNX Runtime offers a set of optimizers that can reduce the model’s size by trimming unused parts of the graph (e.g., those only used in training for loss computation) or by reducing the number of operations after fusing several operations into one. These optimizers are typically separate tools used to prepare the model for inference but are not part of the inference engine itself.

By the time this book is published, there’s a chance this section will be outdated, as the field is evolving rapidly. For example, when we started writing the book, few people paid attention to LLMs, and now they are ubiquitous. People often debate which is a better inference engine for them—whether it’s VLLM (), TGI (), or GGML (). LLM inference is overall a kind of specific topic—unlike most of the more traditional models, LLMs are often bound by large memory footprint and autoregressive paradigm (you can’t predict token T + 1 before token T is predicted, which makes parallelization barely possible without additional tricks). If you’re interested in LLM inference, we recommend reading the blog post at and some truly profound pieces of research (e.g., at the moment of polishing this chapter, we were impressed with “PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU” by Song et al. ().

While the terms inference engine and inference framework are often used interchangeably, they are not completely identical. An inference engine is a runtime used for inference, while an inference framework is a more general term that may include an engine, a set of tools for model conversion and optimization, and other components that assist with various aspects of serving, such as batching, versioning, model registry, logging, and so on. For example, ONNX Runtime is an inference engine, while TorchServe () is an inference framework.

There is no single answer to the question of whether you need a fully featured framework or just a small wrapper on top of the inference engine. From our experience, once your needs reach a high level of certainty and you generally tend to work with more advanced machinery while having enough human resources to maintain it, frameworks are the way to go. On the other hand, once you’re in a startup and need to ship the system somehow but know that your needs will only be formulated in the next several months, it makes sense to opt for the leaner way, which is to deploy the model with a simple combination of an inference engine and some communication layer (e.g., web framework) and postpone a more reliable solution for the next version.

15.3.2 Serverless inference

Serverless inference is an emerging approach that stands out from traditional server-based models. Popularized by AWS Lambda, this serverless paradigm has now found representation in multiple alternatives from major cloud providers such as Google Cloud Functions, Azure Functions, and Cloudflare Workers, as well as from startups like Banana.dev and Replicate. As of the current writing, major providers primarily offer CPU inference with limited GPU capabilities, although this is likely to change as startups continue to push the boundaries in this field.

More advanced products by major cloud providers like AWS Sagemaker jobs can be viewed as serverless as well. They share core serverless properties (managed infrastructure, pay-as-you-go pricing, autoscaling) but aim for another level of abstraction: slower functions and more containerized jobs running for extended periods.

It’s important to note that the term serverless can be somewhat misleading. It doesn’t imply the absence of servers in the system. Rather, it means that engineers have no direct control over the servers, as they are isolated from them by cloud providers, so they don’t need to concern themselves with server management.

Attitudes toward serverless inference are generally either strictly positive or strictly negative due to its notable advantages and disadvantages. Let’s highlight some of them:

No need to manage infrastructure or pay for idle resources —You pay only for what you use. While serverless advocates emphasize this, the reality is that some level of infrastructure management may still be required, although likely at a higher level and with reduced complexity.
Ease of scaling, especially for sporadic loads —However, it’s not a universal solution. Cloud providers may impose limitations on concurrent requests, preventing infinite and swift scaling. Additionally, the problem of “cold start,” where it can take several seconds for a large model to initialize, is a concern for applications with low latency requirements. Some serverless providers, like Runpod, offer more control in this regard, allowing you to set minimum and maximum numbers of workers and customize scaling rules. Cold start is viewed as a significant problem of serverless computing, so providers develop specific solutions to address it, like SnapStart by AWS () and Flashboot by Runpod (). When aiming for serverless inference, cold start time also becomes a factor when choosing the inference engine, as we should aim for slimmer artifacts (like Docker images) and lower load time.
Cost-effectiveness for low loads —Yet, for moderately low but consistent and predictable workloads, dedicated machines might be more cost-efficient than a serverless solution. The pricing and latency combination can also be perplexing. For instance, if a model takes 100 ms to process when warm and 5,000 ms when cold (i.e., right after a break), and the pricing is based on processing time, the cost for the cold request would be 50 times higher. Optimizing such scenarios isn’t always straightforward. A special case of low load is various testing environments: it is nice to avoid costs associated with more traditional architecture and only pay for rare test calls happening in your staging environment.
Harder to test locally —The overall complexity of the dev infrastructure tends to increase for the serverless inference, even though it can be cheaper, as mentioned earlier. It is not just “I can’t test it without an internet connection anymore.” Once the serverless inference is deep in our system, it may bring additional problems (e.g., there’s a need to redeploy test artifacts for trivial changes, make sure the test environment has proper permissions, etc.)

Some serverless providers, like Replicate, offer a wide range of pretrained foundation models that can be used out of the box. This is particularly advantageous when starting new projects, especially for prototyping or research purposes.

We’ve observed both successful and failed cases of serverless inference in small pet projects and high-load production environments. It’s unquestionably a viable option, but careful consideration and a thorough cost-benefit analysis are crucial before fully embracing it. The rule of thumb that we follow is as follows: consider using serverless inference when the autoscaling is a significant advantage (e.g., high variance in the number of requests) and the model itself is not excessively large (although even LLMs can sometimes be deployed in CPU-only serverless environment; ). Another good idea is to consider serverless inference when you are uncertain about the future load: it can somewhat fit multiple load patterns at the beginning of the project, giving you enough time to redesign the inference part once the situation becomes clearer.

15.4 Optimizing inference pipelines

Premature optimization is the root of all evil.
— Donald Knuth

Optimizing inference pipelines is a wide topic that is not a part of ML system design per se; still, it’s a crucial part of ML system engineering that deserves a separate section, at least for listing common approaches and tools as a landscape overview. At its core, optimizing inference pipelines often boils down to a tradeoff between the model’s speed and accuracy and the required resource capacity, which implies numerous optimizing techniques that mainly depend on the model’s characteristics and architecture.

A reasonable question that may pop up here is, “What would be the starting procedure for optimizing?” We asked a similar question of a number of ML engineers during job interviews and received a variety of answers that mentioned such terms as model pruning, quantization (see “Pruning and Quantization for Deep Neural Network Acceleration: A Survey” ), and distillation (see “Knowledge Distillation: A Survey,” ), with references to the state of the papers (we only mention a couple of surveys so you can use them as starting points). These techniques are widely known in the ML research community; they’re useful and often applicable, but they are only focused on model optimization without letting us control the whole picture. The most pragmatic answer was, “I would start with profiling.”

15.4.1 Starting with profiling

Profiling is a process of measuring the performance of the system and identifying bottlenecks. In contrast to the techniques mentioned earlier, profiling is a more general approach that can be applied to the whole system. Just like strategy comes before tactics, profiling is a great starting point that allows for the identification of the most promising directions for optimization and the selection of the most appropriate techniques.

You might be surprised by how often the seemingly most obvious factors may not be the model’s weakest links. Let’s take latency as an example. There are cases when it is not the bottleneck (especially when the served model is not a recent generative thing but something more conventional), and the problem is hiding somewhere else (e.g., data preprocessing or network interactions). Furthermore, even if the model is the slowest part of the pipeline, it doesn’t automatically mean it should be the target for optimization. This may seem counterintuitive, but we should look for the most optimizable, not the slowest, part of the system. Imagine that a full run takes 200 ms, 120 ms of which is required for the model. But the fact is, the model is already optimized; it runs on GPU via a highly performant engine, and there is not much room for any more improvement. On the other hand, data preprocessing takes 80 ms, and it is an arbitrary Python code that can be optimized in many ways. In this case, it is better to start with data preprocessing, not the model.

Another example comes from Arseny’s experience; he was once asked to reduce the latency of a system. The system was a relatively simple pipeline that ran tens of simple models sequentially. The first idea to improve timing was to replace the sequential run with a batch inference. However, profiling proved the opposite: the inference itself was insignificant (around 5%) compared to the time spent on data preprocessing, and a batch inference would not help. What helped was optimizing the preprocessing step, which could not benefit from batching and was eventually coupled with the initially designed sequential run. In the end, Arseny managed to speed up the system by 40% without touching the core model inference, and such elements as thread management, IO and serialization/deserialization functions, and cascade caching were the real low-hanging fruit.

Here we should mention that profiling ML systems differs slightly from profiling regular software for reasons like extensive use of GPUs, asynchronous execution, using thin Python wrappers on top of high-performance native code, and so on. And since the whole process may include a larger number of variables, you should be careful when interpreting the results of profiling, as it is easy to get confused by the tricky nature of the problem.

GPU execution can be especially confusing. Typical CPU load is simple: the data is loaded into memory, and the CPU executes the code: done. There may be nuances related to CPU cache, single instruction, multiple data instructions, or concurrent execution, but in most cases, it is straightforward. Although the GPU is a separate device, it usually comes with built-in memory, and the data has to be copied to the GPU memory before the execution. The copying process itself can be a bottleneck, and it’s not always obvious how to measure it. The highly parallel nature of the GPU execution also makes for nonlinear effects. The simplest example is that if you run a model on a single image, it may take 100 ms, but if you run it on 64 images, the processing time will increase only to 200 ms. That is because the GPU is not fully utilized when processing just one item, which leads to a significant overhead of copying the data to the GPU memory. The same happens on a lower level of model architecture: reducing the number of filters in a convolutional layer may not reduce latency, as the GPU is not fully utilized and uses the same CUDA kernel under the hood. Overall, programming for CUDA and other general-purpose GPU frameworks is a separate and extremely deep rabbit hole; the only takeaway we would like to focus on is that the typical programmer’s intuition on what is fast and what is slow can be totally irrelevant for GPU-based computing.

So a proper approach to profiling requires keeping a wide mix of tools at hand, starting from basic profilers like cProfile from a Python standard library, more advanced third-party tools like Scalene (), memray (), py-spy (), and ML framework-specific tools (like PyTorch Profiler) and ending with low-level GPU profilers like nvprof (). Finally, when working with exotic hardware like tensor processing units or IoT processors, you may need to use vendor-specific tools.

After interpreting the profiling results, we can start optimizing the system, addressing the most visible bottlenecks. Typical approaches are arranged across various levels:

Model-related optimizations like architecture changes, pruning, quantization, distillation, feature selection, etc.
Serving-related optimizations like batching, caching, precomputation, etc.
Code-related optimizations like more effective low-level algorithms, using more effective paradigms (like vectorized algorithms instead of for loops), or rewriting in a faster library/framework (e.g., numba for numeric computation) or even language (e.g., replacing Python bottleneck with a faster wrapper on top of C++ or Rust alternative).
Hardware-related optimizations like using more powerful hardware, vertical/horizontal scaling (please refer to chapter 13), etc.

15.4.2 The best optimizing is minimum optimizing

If we step away from the operating level and give an overview of optimizing from the overall design perspective, some of the problems that arise during the maintenance stage can be avoided if the system has been given a thorough treatment in accordance with the original requirements during the design stage. If we are aware of strict latency requirements, we should initially choose a model that is fast enough for the target platform. Of course, we can reduce the memory footprint with quantization or pruning or reduce latency with distillation, but it is better to start with a model close to the target requirements rather than trying to speed it up when the system is ready. From our experience, an urgent need for optimization is a result of poor choices at the beginning of the system’s life cycle (e.g., heavy models were used as a baseline), an unexpected success case (a startup built a quick prototype and suddenly needs to scale), or a planned tech debt (“Okay, we’ve built something really suboptimal but fast for now; if it survives and helps us finding the product-market fit, we will clean it up”).

Choosing the level of optimization is a crucial decision for effective inference. A founding engineer in a startup who needs to extend their minimal viable product to the second customer should usually just scale via straightforward renting of additional cloud computing resources. On the other hand, a platform engineer in a big tech company would likely benefit from reading “Algorithms for Modern Hardware” () and applying low-level optimizations at scale.

15.5 Design document: Serving and inference

A separate section of the design document dedicated to inference optimization should cover the anticipated actions during the maintenance stage. Let’s examine the commonalities and differences in inference optimization for the two rather divergent ML systems.

15.5.1 Serving and inference for Supermegaretail

Based on the key features and requirements of retail-focused ML systems, the solution for Supermegaretail will not require real-time involvement, allowing modification of the model in batches; still, it will involve a large scope of work.

Design document: Supermegaretail

XII. Serving and inference

The primary considerations for serving and inference are

Efficient batch throughput, as forecasts will be run daily, weekly, and monthly on large volumes of data
Security of the sensitive inventory and sales data
Cost-effective architecture that can scale batch jobs
Monitoring data and prediction quality

i. Serving architecture

We will serve the batch demand forecasting jobs using Docker containers orchestrated by AWS Batch on EC2 machines. AWS Batch will allow for the definition of resource requirements, dynamic scaling of the required number of containers, and queuing of large workloads.

The batch jobs will be triggered on a schedule to process the input data from S3, run inferences, and output results back to S3. A simple Flask API will allow on-demand batch inference requests if required.

All data transferring and processing will occur on secured AWS infrastructure, isolated from external access. Proper credentials will be used for authentication and authorization.

ii. Infrastructure

The batch servers will use auto-scaling groups to match workload demands. Spot instances can be used to reduce costs for flexible batch jobs.

No specialized hardware or optimizations are required at this stage, as batch throughput is the priority, and the batch nature allows ample parallelization. We will use the horizontal scalability options provided by AWS Batch and S3.

iii. Monitoring

Key metrics to track for the batch jobs include

Job success rate, duration, and failure rate
Number of rows processed per job
Server utilization: CPU, memory, disk space
Prediction accuracy compared to actual demand
Data validation checks and alerts

This monitoring will help ensure the batch process remains efficient and scalable and produces high-quality predictions. We can assess optimization needs in the future based on production data.

15.5.2 Serving and inference for PhotoStock Inc.

Search engine optimization includes two main components:

Real-time processing of user requests, where the overall number of users is prone to seasonal drifts (e.g., the day/night difference) with possible drastic spikes.
The index used for the photo search itself, which should be regularly updated. Similar to the Supermegaretail case, a real-time approach is not required but will require processing large amounts of data.

Design document: PhotoStock Inc.

XII. Serving and inference

Given that our search engine is based on vector similarity, there are two aspects we need to care about: generating vectors for searchable items (updating the index) and searching for user queries (querying the index). Those two aspects have different requirements and constraints, so we will design them separately.

i. Index update

Updating the index is a batch process that happens once a day (as mentioned in chapter 13). Other than regular updates, we also need to support the initial index creation or re-creation if the core model is updated. Although it’s a relatively rare event, it is important to have a process that can be run on demand.

Both cases share the same characteristics:

Mild latency requirements
Strict throughput requirements

We need to be able to process a large number of items in a reasonable time at a reasonable cost. For the rough estimates, we should support reindexing ~10e5 items per day within several hours. We also need to be able to reindex ~10e8 items within a reasonable time if the core model is updated.

ii. Index query

Querying the index is a real-time process that engages every user query. We need to minimize the latency of the query so that our throughput requirements are not too high, as our average number of searches is 150,000 per day (please see chapter 12), which is about 2 queries per second. However, the number of queries is not evenly distributed, and we need to be able to handle peak loads of ~100 queries per second, as well as upscale and downscale quickly in cases of traffic spikes.

We suggest using the same model converted to ONNX for both batch and real-time inferences. It is not a requirement, but it will simplify the design and maintenance of the system. However, the inference process is different for each batch, and real-time should be separated given the different requirements.

iii. Framework and hardware

From a software perspective, we will use Nvidia Triton Inference Server as a serving framework. It is a high-performance open source inference serving software that supports ONNX and has a lot of features that simplify the serving process. We will use the HTTP API of Triton Inference Server to communicate with it from our application. We will use the same model for batch and real-time inference, but the real-time inference will use a more latency-optimized version of the config (e.g., dynamic batching’s parameter max_queue_delay_microseconds should be under 10 ms).

For batch inference, we will use a default solution by our cloud provider: AWS Sagemaker. It is a managed service that allows us to run the batch inference on configurable instances, it is easy to scale if needed, and it is integrated with other AWS services we use. We can consider using spot instances under the hood to reduce the cost of the batch inference. The batch job itself will be a simple script on top of the real-time inference that adds an IO layer, reading the data from the queue and writing the results to both S3 and the database.

For real-time inference, it would be nice to have a serverless solution that can scale to zero when there are no queries. However, given our high load and low latency requirements, this may be hard to achieve using major providers like AWS Lambda; thus, we will use a more traditional approach with a pool of servers behind our load balancer. We will use AWS EC2 instances for the servers and AWS Application Load Balancer for the load balancer. We can use spot instances under the hood to reduce the cost of the real-time inference because each worker will be stateless, and we can easily replace it with a new one if it is terminated by AWS. We need to make sure that the system has a reasonable number of available workers guaranteed and enable additional scaling if needed.

An exact hardware configuration for both batch and real-time jobs is a subject of future experiments; obviously, we need to use GPU instances for both, but the exact type of GPU and other resources is not clear yet. We don’t expect heavy CPU usage given that the preprocessing is relatively simple, but we need to ensure we have enough CPU and avoid it becoming a bottleneck; thorough monitoring of resource usage (CPU/RAM/GPU) is required. The exact number of instances under the balancer is also subject to future experiments.

iv. Auxiliary infrastructure

We will start model serving with a default float32 precision, but we will experiment with lower precision (e.g., float16) later to reduce the serving cost. Optimizing the model itself for latency can be done later as well, although at the moment we don’t expect specific bottlenecks, as the CLIP model is relatively simple.

Because queries are more popular than others, we can use a cache to reduce the load on the inference servers. We can use AWS Elasticache for that; it is a managed service that supports Redis and Memcached. We can use a simple key-value cache with a low time to live (the exact number is subject to data analysis). Caching is useful for the runtime inference, not for the batch inference; the batch inference should be responsible for updating the cache, though, if the key has changed.

We need to make sure that the system can scale to handle the peak load. This part of the design should be refined further with the site reliability engineering team and AWS experts. At the initial stage, we want to ensure autoscaling is enabled for the real-time inference and all relevant metrics (e.g., resource usage, # of requests, # of active instances) are monitored + alerts are configured.

No additional security measures are required for the serving system, given that it is not exposed to the internet and is only accessible from our application. We need to make sure that access to the serving system is restricted to our application only and that access is granted only to the required resources (e.g., we don’t need to give the serving system any access to the database).

Summary

While it may be tempting to limit your efforts to developing and training your system, inference optimization is just as important a step that will ensure stable performance when you hit the operation stage.
There are key factors that affect the way you will design the inference optimization process for your system. These include latency, throughput, scalability, target platforms, cost, reliability, flexibility, and security and privacy.
The aforementioned factors can be conflicting or even mutually exclusive. Hence, it would be impossible to optimize for all of them simultaneously, which will inevitably force you to go for tradeoffs to finetune your model in the best way possible.
Remember to train your model on one framework and then convert it for further inference in another. This step is especially important, as it won’t affect the inference pipeline once you need to switch to a different framework with new features.
One of your primary goals will be achieving the balance between research flexibility and high performance in the production environment. Bear in mind, however, that it will require an interframework medium.
The most important point is that the best optimizing is minimum optimizing. Some of the problems that arise during the maintenance stage can be avoided if the system has been properly designed in accordance with the initial requirements.