As we claimed earlier, the worst thing you can do is build a system, only to put it on a shelf instead of going live. Both of us have faced such problems at least once in our careers, and it is not an experience we recommend.
A rookie mistake would be to think that integration is a one-time event or a single phase of a project. That is an antipattern: you cannot just dedicate some weeks to future integration and start building a system in a vacuum. In reality, it is a continuous process that starts from the very beginning of the project and ends only when the system is decommissioned. Even more, when the system’s life cycle comes to an end, it requires certain deintegration efforts, making sure none of the direct or indirect users will be affected by switching it off. Proper integration is the key to the success of your system, making it much easier to get feedback on and improve. The smoother various elements are integrated into your system, the shorter the feedback loop and the faster the iterations you can implement.
In this chapter, we discuss how to efficiently integrate your system, with a focus on technical aspects.
API design is a crucial part of the integration process. It may be perceived as a contract between your system and its users, but it is a contract you need to read through thoroughly before signing. The is that is your API design will be costly to change once it has been set up, even if it’s not set in stone and the system is still in development.
If you are a reader who is experienced in machine learning (ML), you may feel tempted to skip this section, simply because you know how to design APIs and have done it many times. That is a fair claim, as we are not going to teach you the difference between REST and RPC or how to design APIs in general. Besides, there are myriads of great books and articles on this topic (e.g., a nice collection of recommended materials can be found at ). Instead, we will focus on the key aspects and highlight pitfalls specific to ML systems in particular.
If we were to pick just two properties of a good API, we would choose simplicity and predictability. There is a classic software quote by Butler Lampson, which is even referred to as the “fundamental theorem of software engineering”: “We can solve any problem by introducing an extra level of indirection.”
A variation of this quote is that any programming problem can be solved with a layer of abstraction except for a problem of too many abstractions. So the simplicity of an API is the art of finding the right abstraction that is not leaky. An abstraction is considered leaky when it exposes too many underlying implementation details.
The vital role of simplicity lies in its ability to make an API easier to learn and use without a deep understanding of internals. A typical ML system often has many handlers and parameters, and it is always tempting to expose them to the external user. This leads to overcomplicated solutions where calling methods requires providing multiple parameters, and at the end of the day, it is hard to understand their meanings and how they are interconnected. A better approach implies hiding the complexity behind a simple interface and offering a few methods with a small number of parameters. Users will be grateful for that.
However, hiding all the parameters is not the best idea either (see figure 13.1). It is important to provide a way to customize the system’s behavior, especially for debugging purposes. Imagine yourself debugging a system that has a dozen parameters during a late-night on-call shift, and you cannot modify any of them. Not a pleasant experience! In these cases, it always makes sense to suggest reasonable defaults and provide a way to override them.
There is always a possibility of forcing deterministic behavior. One simple example would be taking a random seed from the parameters (and choosing your own seed if not specified). By the way, while many ML libraries use random seeds from a global state, it is not possible in JAX, a new numerical computing library by Google that emerged recently. Its design suggests that you have to pass the random state explicitly exactly for this reason—to force full reproducibility. See for more information.
Another source of nondeterminism would be input data, including some implicit data like the current time (it should be provided via parameters as well).
Let’s look at two implementations of a predict function that uses time as input and returns somewhat stochastic results with different random states. The following listing has the only explicit argument, while its output depends on three parameters, meaning that the output is not reproducible.
predict functiondef predict(features): time = datetime.now() return model.predict(features, time, seed=42)
Unlike the previous example, the function caller shown in the following listing controls all the parameters affecting the output.
predict function def predict(features, time=None, seed=42): if time is None: time = datetime.now() return model.predict(features, time, seed)
Note While we use Python in these examples, the fundamentals for higher-level abstractions remain unchanged. Imagine that this function is part of an HTTP API where time and seed are newly introduced query parameters. In this case, the same principles will be applied.
A specific aspect of predictability is compatibility. When talking about compatibility, engineers usually imply either backward compatibility or forward compatibility. Backward compatibility means that the new version of the API is compatible with the old one (the old code can be used with the new version of the API without any changes). Forward compatibility implies that the old version of the API is compatible with the new one (the new code can be used with the old version of the API without any changes).
In the context of ML systems, compatibility is also related to versioning the underlying model. There is a common practice to version the model and provide a way to request a specific version of the model during initialization.
class Model: def __init__(self, version): self.version = version self.model = load_model(version) def predict(self, features, time=None, seed=42): if time is None: time = datetime.now() return self.model.predict(features, time, seed)
This example is oversimplified, and you may need a much more advanced solution for a complicated system. Read materials about the model registry pattern (e.g., ) if you want to learn more.
Versioning is tricky. One antipattern can be updating the model without bumping the version, which leads to changes in the behavior of the system without any notification. There are a lot of scenarios in which updating the model is considered a breaking change, and it must be reflected in the version. Even more, this is applicable not only for the model but for any aspect of the pipeline—data input/output (IO), preprocessing, postprocessing, etc. Some changes are not even intended: you can update the dependencies and get a different result, thus breaking the compatibility implicitly.
While Arseny’s experience demonstrates the challenges of maintaining compatibility within a single system, the following story from Valerii highlights how versioning problems can become an even bigger difficulty when versioning is broken not within a system but at the point of our interaction with a third-party solution.
Nothing helps to catch things like that better than a proper set of tests running on continuous integration (CI).
Over the years, the industry has developed multiple practices for working with APIs. We mention a few that we consider efficient. They may not be necessarily specific to ML systems, but they are often relevant to them (see figure 13.2):
The release cycle of an ML system is usually similar to that of regular software. However, there are two main differences:
Let’s elaborate on these points.
Due to testing complexity, running tests alone is not always enough, and regressions may occur anyway. Let’s talk about software for a moment. Once regular software is updated, it is usually enough to run tests to ensure everything works as expected. But if we are talking about ML models, the situation is quite different, as many improvements come with a price. When the ML model is updated, even if we have a representative dataset for tests and a good test coverage of software-related parts, we still cannot guarantee that the changes won’t provoke harmful outcomes. For example, say there are 100 samples in the final test set, and the new model improves performance on 3 samples that have been labeled errors before but introduces two new errors on other samples. The overall performance is better, but is the change good enough to be released?
Real-life scenarios are full of similar examples, and this means that many releases require a human in the loop. It is similar to testing user experience changes, where an employee should check if the changes bring sufficient benefits. And evaluating the tradeoffs between improvements and regressions is usually more complicated than just checking if the UI animation works as expected. The human-in-the-loop method may vary depending on the system. In some cases, an ML engineer is responsible; in others, it can be a product manager or an external domain expert, or the job can even be delegated to crowd-sourcing platforms providing a large pool of users with aggregated feedback.
There are simpler cases where some kind of AutoML allows for the model (not the whole system—just the model) to be released automatically without additional reviewing. Imagine you are building an advanced text editor with a powerful feature: an autocomplete that mimics the author’s style (oh, we wish we had one while writing this book!). This software needs to run a custom model (probably on top of a large foundational model) for each user. Gathering new users’ writings and regularly updating the model without human involvement seems a good practice here; otherwise, it is not scalable. Such scenarios need an even higher paranoid level of testing to cover as many pessimistic paths as possible and disable model updates once there is a chance of a malfunction.
Okay, let’s assume testing is not a problem. Say you have found a bug in the preprocessing code; you are confident fixing it is a good solution that won’t make things worse, and an advanced testing toolset can help you make sure that is really the case. For regular software, it would mean that you can just fix the bug, run the tests, and deploy a new version; usually, it does not take too long. But it will not work this way for ML systems because you need to retrain the model. We will not use extra-large-scale examples here, but even based on our experience, models that were trained for weeks are not uncommon, so releasing the fix the next day is not possible at all.
In practice, it means the release cycle for ML systems is usually longer than for regular software; still, it may vary a lot—from multiple times a day to once a year with multiple variations in between. Make sure to take this into account when designing your system. A long release cycle implies that the system should be more reliable and tested extensively, while a short release cycle allows it to be more agile and allows for more experimentation.
Multicomponent systems can have and use different release cycles for different components. Imagine a simple search engine that contains four components: an index containing the documents to be searched across, the lightweight filter for preliminary selection, a heavy model for the final ranking, and an API layer to expose the search results. Documents may be added to the index all the time (with no release required), the index codebase is rarely touched, and the filterer and API layers are pure non-ML software that is easier to test and build, so they can be released more frequently. In its turn, the ranking model is trained and released biweekly with additional verification. It allows us to be more agile with the non-ML components and more stable with the ML component.
There is a family of release-related techniques, including blue-green deployment (see figure 13.3) and canary deployment (see figure 13.4). They may slightly differ, but the core idea behind them is to have two or more systems in production at the same time. New users are sent to a new version once it is deployed while the old version is still functioning. In the blue-green deployment, changes are discrete (all users are switched to either the blue or green version), while in canary deployment, the rollout is granular, and the new version is used for a small portion of the existing users. It allows us to test the new version in production and roll back more easily if something goes wrong. It is not specific to ML systems, but it can be applied for them as well. Another technique ML systems can use is decomposition (e.g., some components like a model can be released with canary deployment; in contrast, other components like API layers can be released in a more traditional way).
Canary deployment should not be confused with the A/B tests we discussed in chapter 12. While technically they may look like similar concepts (multiple instances of a system are live, and traffic is sent toward them in a proper split), their intent is different. A/B tests are used to evaluate the performance of different versions of the system, while canary deployment is used to test the new version of the system before switching to it completely. In A/B tests, we want to compare the performance of the system across different versions, and the time of the test is decided based on statistical significance; it is fine to end up keeping either option A or option B. With canary deployment, we want to make sure the new version is good enough to be used by all users, and we aim to switch to it completely as soon as possible.
While large enterprises usually tend to have stricter policies and longer release cycles, this is not always the case: aiming to reduce the time gap between iterations using both technical and organizational approaches is a noble goal. It is also an important aspect of the DevOps culture, and ML systems are not an exception here. If you’re interested in this topic, we recommend checking out the book The Phoenix Project. It is not about ML systems, and it’s not even a very technical book (more like a “business fable”), but it is a great read about the DevOps culture and how it can be applied in the real world.
Startups and mature big-tech companies are usually more agile and have shorter release cycles, but that can vary as well. Arseny once worked for a startup where deploying late on Friday night was a common practice (sometimes it led to outages that had to be solved by engineers who were already enjoying a solid pint of beer). In the other, more established startup, the release cycle was very flexible; every engineer could deploy their component at any time, but a simple guardrail warned before deployment if the time was imperfect (e.g., Friday after lunch).
Those who are blessed to schedule releases should be aware of all the dependencies: how the system influences other systems and how others can affect it. The biggest outages usually happen either on the infrastructure level or between the systems or components that are not owned by the same team. Having proper communication with other teams to avoid such problems is a must-have skill for any senior engineer. Unfortunately, this skill and the understanding of its importance often come at a certain cost (usually after a big outage).
Arseny’s biggest outage was related to a logger configuration (not something you expect to care much about while building an ML system). Some ML-related load happened in threads, and when aiming to track the behavior, he overengineered a complicated logger to keep requests’ IDs between the threads. It worked fine in one environment and was later deployed to another, where engineers had way less control. It was a dark moment when the defect revealed itself: the problem could only happen after 1,000 requests of a certain type that never happened in previous environments. It took a while to understand the root cause or roll back the new version, so the incident came to be a good lesson to introduce more checks in the release process.
It is never enough to build the system and integrate it directly with other components that need its output in terms of product logic. Any system requires additional connections for healthy operations, both tech and non-tech related. Some are required to smooth the maintenance and operations (from both the engineering and product perspective); others are caused by implicit nonfunctional requirements related to the system (such as legal or privacy concerns). Let’s name a few.
CI is usually the first element of the whole infrastructure to be set up. It helps identify and resolve integration problems while facilitating smoother and faster development processes. Typical tasks for CI are running tests (unit or integration) and building artifacts that will be used down the stream (e.g., for further deployment). However, there may be other needs covered on the CI level, such as security testing, performance testing, cost analysis (“Does this release require us to spin out more cloud servers?”), deployment to test environments, linting the code style, reporting key metrics, and many more.
Two main things that are not part of the system’s data but need to be stored are logs and metrics. Usually, there is a common approach to how logs and metrics are stored, aggregated, and monitored in a company, and you just need to follow the common way. We’ll elaborate a bit more on this topic in the next chapter.
The system’s performance may be prone to malfunctions, and it should not come as a surprise. Thus, the system should be connected to the alerting and incident management platform used in the company, so the person on call will be aware of a potential incident and react appropriately.
But how do they react, exactly? Here you may need to prepare specific cookbooks describing what expected failure modes are and how to approach them. Also, there may be an additional toolset to help with firefighting, like an admin panel for configuration, system-specific dashboards, and so on, which we will cover in chapter 16. Again, usually there is a company standard shared between ML and non-ML systems, so it is very likely that you will only need to adapt the software toolset already in use without reinventing the wheel.
Designing a system requires considering the whole life cycle of the system, not just the happy path, and being a little paranoid helps.
Other than purely technical aspects of operations, there are some non-tech aspects that should be considered. They are often related to customer success or compliance, and they are not always obvious. What should we do if the user wants all their personal data to be purged as allowed by the General Data Protection Regulation (recall chapter 11 and imagine how deeply their data can propagate)? Is there a regulation forcing the model to be explainable, and what is the best way to follow it without sacrificing the model’s performance? What if a high-level executive or a startup investor faces a bug in the system and gets mad? How do we debug the system’s behavior in hard-to-reproduce scenarios (e.g., a defect can be only reproduced by an aforementioned executive’s user account)? All these questions should be answered before the system is released, and the answers should be at least briefly reflected at the design stage. Otherwise, the changes may be too expensive to implement later. Often it requires building additional components, like some user impersonation mechanism or an admin panel for data management or model explainability, and it may take a lot of project time and require help from other teams (e.g., Legal or Compliance to understand the regulations or Web Development to build the required dashboard).
From our experience, all these additional connections and considerations usually take more time than the core system itself, and the bigger the company, the more effort it takes. Given current trends, it’s not likely to improve in the near future as even more regulations related to ML and privacy are being applied globally.
Your system may have a legit reason to fail. An example of such a reason could be an external dependency: you pull out a chunk of data from a third-party API, and at some point, it’s just not available. That is one of the situations where you may want to have a fallback solution.
A fallback is a backup plan or an alternative solution that can be used when the primary plan or solution fails or is not available. We use it to ensure that the system can still function and make decisions even if the primary ML model has failed for whatever reason.
This can be particularly important in systems used for critical tasks or in industries where even the shortest downtime can lead to significant consequences. For example, a fallback can be crucial for a model used to predict equipment failures in a manufacturing setting, ensuring that production can continue even if the primary model experiences problems.
Another reason to use a fallback is to provide an alternative solution when the primary system is unable to provide a satisfactory answer or a confident prediction or when the primary model’s output is outside the acceptable range.
There are quite a few different approaches that can be used for implementing a fallback. One common approach is to use a secondary ML model, which can be trained on a different set of data or using a different algorithm. It might be a simpler baseline solution as we reviewed in chapter 8 or a dual-model setup. In this setup, the first model is built using only stable features, while the second model is used to correct the output of the main model using a larger feature set. The models can be used together, with the output of one model chosen over the other based on the input data or a predetermined rule. Alternatively, input feature drift monitoring (see chapter 14) can be set up for the “core” model to detect crucial shifts.
Another option is to use a rule-based system as a fallback, which can provide a stable and predictable response when the model is unavailable or is performing poorly. It is also possible to use a combination of these approaches, such as using a rule-based system to handle simple cases and the ML model for more complex scenarios (however, this alone introduces additional complexity and breakpoints).
As with baselines, a simple constant can be our fallback as well. Finally, sometimes a fallback solution is to reply with an explicit error message.
A fallback solution should always have a plan in place for activating the fallback and switching between the ML model and a fallback system. It can be either automatic (triggered by a monitoring event), manual, or hybrid, depending on the use case.
One custom type of fallback is an override. It is a way to manually override the model’s output when it is signaling a bad prediction. One example may be dropping the model’s output and using a constant instead when the model’s prediction is beyond the acceptable range or when the model’s confidence is too low. Another reason to use an override is related to a release cycle. For example, a customer complains about the model failing in a very specific scenario. Ideally, we need to ensure this scenario is represented in the training data, retrain the model, run all the checks, and deploy it. But, as we discussed earlier, it may take a while. So we can override the model’s output for this particular scenario using a rule-based approach, keep the customer happy, and address it properly in the following release.
The downside of overrides is that they are not transparent and can be easily forgotten. For this reason, it is important to have a way to track them and to have a plan for how to address them properly; otherwise, they may turn into a technical debt. The positive side effect of having many overrides is that collections of overrides can be used to improve the model via multisource weak supervision—a technique when unlabeled data is labeled with “labeling functions” (these heuristics are not perfect but are cheap to implement). The labeling functions provide a noisy dataset that becomes a foundation for model training. More details on this technique can be found in papers by Alexander Ratner et al., “Data Programming: Creating Large Training Sets, Quickly” () and “Training Complex Models with Multi-Task Weak Supervision” (). The concept of multisource weak supervision has gained recognition in the industry thanks to the popular library named Snorkel ().
Arseny’s colleague once implemented a more elegant solution that helped address the limits of a long release cycle. He worked on a named entity recognition problem, and sometimes the model was not able to recognize some entities that were important to customers. However, training a new model after the problem had been found was not an option because of the long training time. So he implemented a solution based on the knowledge base: before running the model, the input text was checked against the knowledge base, and if the possible entity was found there, it was used as a hint for the model. It allowed the team to fix the problem without retraining. Adding a sample to the knowledge base could be done in a minute by a nontechnical person, so the problem could be addressed promptly. The solution is described in more detail in a blog post (). This case was somewhat similar to overrides, but it augmented the model’s inputs, not outputs.
Approaches to integration for both Supermegaretail and PhotoStock Inc. aim at creating user-friendly, fast, and efficient mechanisms for either inventory predictions or search results.
Supermegaretail’s integration strategy is tailored to offer a seamless, dynamic, and highly responsive prediction system that helps manage inventory.
The integration strategy for PhotoStock Inc. is focused on providing the most relevant search results regardless of the complexity of search queries while maintaining prompt responses.