13 Integration

This chapter covers

API design
Release cycle
Operating the system
Overrides and fallbacks

As we claimed earlier, the worst thing you can do is build a system, only to put it on a shelf instead of going live. Both of us have faced such problems at least once in our careers, and it is not an experience we recommend.

A rookie mistake would be to think that integration is a one-time event or a single phase of a project. That is an antipattern: you cannot just dedicate some weeks to future integration and start building a system in a vacuum. In reality, it is a continuous process that starts from the very beginning of the project and ends only when the system is decommissioned. Even more, when the system’s life cycle comes to an end, it requires certain deintegration efforts, making sure none of the direct or indirect users will be affected by switching it off. Proper integration is the key to the success of your system, making it much easier to get feedback on and improve. The smoother various elements are integrated into your system, the shorter the feedback loop and the faster the iterations you can implement.

In this chapter, we discuss how to efficiently integrate your system, with a focus on technical aspects.

13.1 API design

API design is a crucial part of the integration process. It may be perceived as a contract between your system and its users, but it is a contract you need to read through thoroughly before signing. The is that is your API design will be costly to change once it has been set up, even if it’s not set in stone and the system is still in development.

If you are a reader who is experienced in machine learning (ML), you may feel tempted to skip this section, simply because you know how to design APIs and have done it many times. That is a fair claim, as we are not going to teach you the difference between REST and RPC or how to design APIs in general. Besides, there are myriads of great books and articles on this topic (e.g., a nice collection of recommended materials can be found at ). Instead, we will focus on the key aspects and highlight pitfalls specific to ML systems in particular.

If we were to pick just two properties of a good API, we would choose simplicity and predictability. There is a classic software quote by Butler Lampson, which is even referred to as the “fundamental theorem of software engineering”: “We can solve any problem by introducing an extra level of indirection.”

A variation of this quote is that any programming problem can be solved with a layer of abstraction except for a problem of too many abstractions. So the simplicity of an API is the art of finding the right abstraction that is not leaky. An abstraction is considered leaky when it exposes too many underlying implementation details.

The vital role of simplicity lies in its ability to make an API easier to learn and use without a deep understanding of internals. A typical ML system often has many handlers and parameters, and it is always tempting to expose them to the external user. This leads to overcomplicated solutions where calling methods requires providing multiple parameters, and at the end of the day, it is hard to understand their meanings and how they are interconnected. A better approach implies hiding the complexity behind a simple interface and offering a few methods with a small number of parameters. Users will be grateful for that.

However, hiding all the parameters is not the best idea either (see figure 13.1). It is important to provide a way to customize the system’s behavior, especially for debugging purposes. Imagine yourself debugging a system that has a dozen parameters during a late-night on-call shift, and you cannot modify any of them. Not a pleasant experience! In these cases, it always makes sense to suggest reasonable defaults and provide a way to override them.

Figure 13.1 Overconfiguration vs. underconfiguration

Campfire story from Arseny

I used to work for a company that provided an API for external developers. I was in charge of building a brand-new endpoint that would mirror a classification system under the hood. In the early stages, the API seemed to be very straightforward—just accepting an object as input and delivering a label from a predefined taxonomy as output.

The accuracy of the baseline was not perfect, though, and one of my colleagues suggested returning a list of labels instead of a single label. It was a reasonable suggestion, which I implemented with no hesitation. However, practice showed that users only needed a single label even when it was incorrect (their usage patterns could not use the second label or further). The bad news was that the list of labels had already been exposed in the API, and it would be hard to remove it without breaking compatibility. So the API became overly complicated for no reason, and many users shared their frustration about that: "Why do you return a list of labels if it always contains a single item?” Too quick a decision about the API design led to a suboptimal solution that was hard to fix later.

Predictability is another crucial property of a good API. We’ve already talked about how ML systems tend to be nondeterministic unless they are forced to intentionally (please see chapter 10). This is an even more critical factor for APIs, as they must be deterministic and predictable. It is important to make sure that the same input will always produce the same output. Of course, there are algorithms that are nondeterministic by design (e.g., text generation with temperature sampling), but this is an exception that proves the rule.

There is always a possibility of forcing deterministic behavior. One simple example would be taking a random seed from the parameters (and choosing your own seed if not specified). By the way, while many ML libraries use random seeds from a global state, it is not possible in JAX, a new numerical computing library by Google that emerged recently. Its design suggests that you have to pass the random state explicitly exactly for this reason—to force full reproducibility. See for more information.

Another source of nondeterminism would be input data, including some implicit data like the current time (it should be provided via parameters as well).

Let’s look at two implementations of a predict function that uses time as input and returns somewhat stochastic results with different random states. The following listing has the only explicit argument, while its output depends on three parameters, meaning that the output is not reproducible.

Listing 13.1 A mediocre design for a `predict` function

def predict(features):      time = datetime.now()      return model.predict(features, time, seed=42)

Unlike the previous example, the function caller shown in the following listing controls all the parameters affecting the output.

Listing 13.2 A better design for a `predict` function

def predict(features, time=None, seed=42):      if time is None:          time = datetime.now()      return model.predict(features, time, seed)

Note While we use Python in these examples, the fundamentals for higher-level abstractions remain unchanged. Imagine that this function is part of an HTTP API where time and seed are newly introduced query parameters. In this case, the same principles will be applied.

A specific aspect of predictability is compatibility. When talking about compatibility, engineers usually imply either backward compatibility or forward compatibility. Backward compatibility means that the new version of the API is compatible with the old one (the old code can be used with the new version of the API without any changes). Forward compatibility implies that the old version of the API is compatible with the new one (the new code can be used with the old version of the API without any changes).

In the context of ML systems, compatibility is also related to versioning the underlying model. There is a common practice to version the model and provide a way to request a specific version of the model during initialization.

Listing 13.3 Adding a model version to the API

class Model:      def __init__(self, version):          self.version = version          self.model = load_model(version)        def predict(self, features, time=None, seed=42):          if time is None:              time = datetime.now()          return self.model.predict(features, time, seed)

This example is oversimplified, and you may need a much more advanced solution for a complicated system. Read materials about the model registry pattern (e.g., ) if you want to learn more.

Versioning is tricky. One antipattern can be updating the model without bumping the version, which leads to changes in the behavior of the system without any notification. There are a lot of scenarios in which updating the model is considered a breaking change, and it must be reflected in the version. Even more, this is applicable not only for the model but for any aspect of the pipeline—data input/output (IO), preprocessing, postprocessing, etc. Some changes are not even intended: you can update the dependencies and get a different result, thus breaking the compatibility implicitly.

Campfire story from Arseny

How long might it take to update a Python version? This seems like a simple question, but once I had to learn the hard way that, in fact, it is not. And the reality slap was so hard that it sent me down a rabbit hole.

The system I worked on required full compatibility. And, boy, was I surprised when bumping a Python version (while keeping everything else static!) led to a mismatch in several outputs. After a few iterations of bisection, I realized the problem was related to the image reading library. The same version of the library built for Python 3.5 had slightly different behavior from the version for Python 3.6, and therefore, the files readable with one version of the library were not compatible with the other.

But how could that even happen? It appeared that the library used a low-level JPEG library implemented in C; at the same time, different builds of a Python library—even with the same version—used different versions of the underlying C library as they used one installed globally on a build machine. Finding which one was used in every given case was not easy and required some hardcore software archeology (digging into 6-year-old build logs of the open source library, finding the clues from there, and finally reproducing the very same build).

Once again, the difference was not significant at all; still, it was enough to break compatibility because the models were trained on the images read with one version of the library, and their sensitivity to the input was too high. It was not likely that the users would notice the difference, but it was still a breaking change—a thing that was not supposed to happen due to the user agreement.

While Arseny’s experience demonstrates the challenges of maintaining compatibility within a single system, the following story from Valerii highlights how versioning problems can become an even bigger difficulty when versioning is broken not within a system but at the point of our interaction with a third-party solution.

Campfire story from Valerii

One time I needed to implement a KYC (know your customer) solution for a financial organization. A part of this solution required verifying ID documents uploaded by users and ensuring a user’s face wasn’t present among existing users. In other words, it was a regulatory required constraint on user uniqueness. As a person familiar with a build-or-buy tradeoff (please see chapter 3), I used a face recognition solution from a big popular vendor.

The system was simple: take documents as input, find a face, calculate a vector, make sure there were no similar vectors in the database, and let the user sign up; otherwise, make customer support verify the case manually. The system worked fine until one unlucky day when the customer support team got overloaded with false positives. After the investigation, it appeared that the vendor’s API version had been fixed—unless it was unset by mistake. As a result, a newer version of the API returned implicitly incompatible results that led to the incident.

This kind of failure is dangerous because of its implicit nature. An external API suddenly changing the field name may lead to an outage, which, luckily, is easy to catch and fix. When the model’s version is changed, you may not notice it at first—it has returned a vector of floats before it returns a similar vector now, and the outage can be detected either by a properly configured testing/monitoring setup or after the downstream task (customer verification in this case) degradation.

Nothing helps to catch things like that better than a proper set of tests running on continuous integration (CI).

13.1.1 API practices

Over the years, the industry has developed multiple practices for working with APIs. We mention a few that we consider efficient. They may not be necessarily specific to ML systems, but they are often relevant to them (see figure 13.2):

Design at least two layers of the API. Here, we are talking about the external and internal layers, where the former is logically a subset of the latter while not necessarily following the same protocol. The external API is exposed to users or other components, and the internal API is used by the external one. As long as the internal API is not exposed to users, it can be changed without breaking compatibility. The external API, in its turn, is a subset of the internal API and should be designed with compatibility in mind. It helps separate the concerns and make the external API simpler and easy to maintain in terms of compatibility while leaving the internal API flexible and easy to change.
Try separating the ML and IO components of the API when possible. An ML service is easiest to maintain when it is stateless and thus idempotent. It’s not always possible, but it’s a good practice to strive for. This approach is useful not only for maintenance but also for scalability: IO and ML components can be scaled independently, which is a great property given that they have different requirements (e.g., an ML component is usually CPU- or GPU-bound). On top of that, it simplifies the evolution of the system: you can deploy a new version of the ML component without touching the IO component and use both ML components simultaneously for some time during A/B testing or a gradual rollout.
Build a client library for your API for simpler usage. Whether it is for external users or your teammates, having a client library lowers the entry barrier, so it smooths the debugging process and speeds up experiments. It is also a good place to implement practices that are not parts of the API directly, such as recommended retries, timeouts, and so on.
Consider embedding feature toggles (also known as feature flags) or any other alternative used in your organization. ML systems often operate in risky environments, and it will always be beneficial to have a way of disabling a new version of the model or switch to a fallback solution in case of arising problems. Feature toggles are not part of the API, but they effectively serve as a workaround to control the behavior of the system without redeploying it or changing the API/client behavior.

Figure 13.2 Layered API structure is simpler to maintain, test, and develop.

13.2 Release cycle

The release cycle of an ML system is usually similar to that of regular software. However, there are two main differences:

ML systems are trickier to test.
Training a new model (even with a fully automated pipeline) usually takes way more time than compiling code and building other artifacts.

Let’s elaborate on these points.

Due to testing complexity, running tests alone is not always enough, and regressions may occur anyway. Let’s talk about software for a moment. Once regular software is updated, it is usually enough to run tests to ensure everything works as expected. But if we are talking about ML models, the situation is quite different, as many improvements come with a price. When the ML model is updated, even if we have a representative dataset for tests and a good test coverage of software-related parts, we still cannot guarantee that the changes won’t provoke harmful outcomes. For example, say there are 100 samples in the final test set, and the new model improves performance on 3 samples that have been labeled errors before but introduces two new errors on other samples. The overall performance is better, but is the change good enough to be released?

Real-life scenarios are full of similar examples, and this means that many releases require a human in the loop. It is similar to testing user experience changes, where an employee should check if the changes bring sufficient benefits. And evaluating the tradeoffs between improvements and regressions is usually more complicated than just checking if the UI animation works as expected. The human-in-the-loop method may vary depending on the system. In some cases, an ML engineer is responsible; in others, it can be a product manager or an external domain expert, or the job can even be delegated to crowd-sourcing platforms providing a large pool of users with aggregated feedback.

There are simpler cases where some kind of AutoML allows for the model (not the whole system—just the model) to be released automatically without additional reviewing. Imagine you are building an advanced text editor with a powerful feature: an autocomplete that mimics the author’s style (oh, we wish we had one while writing this book!). This software needs to run a custom model (probably on top of a large foundational model) for each user. Gathering new users’ writings and regularly updating the model without human involvement seems a good practice here; otherwise, it is not scalable. Such scenarios need an even higher paranoid level of testing to cover as many pessimistic paths as possible and disable model updates once there is a chance of a malfunction.

Okay, let’s assume testing is not a problem. Say you have found a bug in the preprocessing code; you are confident fixing it is a good solution that won’t make things worse, and an advanced testing toolset can help you make sure that is really the case. For regular software, it would mean that you can just fix the bug, run the tests, and deploy a new version; usually, it does not take too long. But it will not work this way for ML systems because you need to retrain the model. We will not use extra-large-scale examples here, but even based on our experience, models that were trained for weeks are not uncommon, so releasing the fix the next day is not possible at all.

In practice, it means the release cycle for ML systems is usually longer than for regular software; still, it may vary a lot—from multiple times a day to once a year with multiple variations in between. Make sure to take this into account when designing your system. A long release cycle implies that the system should be more reliable and tested extensively, while a short release cycle allows it to be more agile and allows for more experimentation.

Multicomponent systems can have and use different release cycles for different components. Imagine a simple search engine that contains four components: an index containing the documents to be searched across, the lightweight filter for preliminary selection, a heavy model for the final ranking, and an API layer to expose the search results. Documents may be added to the index all the time (with no release required), the index codebase is rarely touched, and the filterer and API layers are pure non-ML software that is easier to test and build, so they can be released more frequently. In its turn, the ranking model is trained and released biweekly with additional verification. It allows us to be more agile with the non-ML components and more stable with the ML component.

There is a family of release-related techniques, including blue-green deployment (see figure 13.3) and canary deployment (see figure 13.4). They may slightly differ, but the core idea behind them is to have two or more systems in production at the same time. New users are sent to a new version once it is deployed while the old version is still functioning. In the blue-green deployment, changes are discrete (all users are switched to either the blue or green version), while in canary deployment, the rollout is granular, and the new version is used for a small portion of the existing users. It allows us to test the new version in production and roll back more easily if something goes wrong. It is not specific to ML systems, but it can be applied for them as well. Another technique ML systems can use is decomposition (e.g., some components like a model can be released with canary deployment; in contrast, other components like API layers can be released in a more traditional way).

Figure 13.3 Blue-green deployment

Figure 13.4 Canary deployment

Canary deployment should not be confused with the A/B tests we discussed in chapter 12. While technically they may look like similar concepts (multiple instances of a system are live, and traffic is sent toward them in a proper split), their intent is different. A/B tests are used to evaluate the performance of different versions of the system, while canary deployment is used to test the new version of the system before switching to it completely. In A/B tests, we want to compare the performance of the system across different versions, and the time of the test is decided based on statistical significance; it is fine to end up keeping either option A or option B. With canary deployment, we want to make sure the new version is good enough to be used by all users, and we aim to switch to it completely as soon as possible.

While large enterprises usually tend to have stricter policies and longer release cycles, this is not always the case: aiming to reduce the time gap between iterations using both technical and organizational approaches is a noble goal. It is also an important aspect of the DevOps culture, and ML systems are not an exception here. If you’re interested in this topic, we recommend checking out the book The Phoenix Project. It is not about ML systems, and it’s not even a very technical book (more like a “business fable”), but it is a great read about the DevOps culture and how it can be applied in the real world.

Startups and mature big-tech companies are usually more agile and have shorter release cycles, but that can vary as well. Arseny once worked for a startup where deploying late on Friday night was a common practice (sometimes it led to outages that had to be solved by engineers who were already enjoying a solid pint of beer). In the other, more established startup, the release cycle was very flexible; every engineer could deploy their component at any time, but a simple guardrail warned before deployment if the time was imperfect (e.g., Friday after lunch).

Those who are blessed to schedule releases should be aware of all the dependencies: how the system influences other systems and how others can affect it. The biggest outages usually happen either on the infrastructure level or between the systems or components that are not owned by the same team. Having proper communication with other teams to avoid such problems is a must-have skill for any senior engineer. Unfortunately, this skill and the understanding of its importance often come at a certain cost (usually after a big outage).

Arseny’s biggest outage was related to a logger configuration (not something you expect to care much about while building an ML system). Some ML-related load happened in threads, and when aiming to track the behavior, he overengineered a complicated logger to keep requests’ IDs between the threads. It worked fine in one environment and was later deployed to another, where engineers had way less control. It was a dark moment when the defect revealed itself: the problem could only happen after 1,000 requests of a certain type that never happened in previous environments. It took a while to understand the root cause or roll back the new version, so the incident came to be a good lesson to introduce more checks in the release process.

13.3 Operating the system

It is never enough to build the system and integrate it directly with other components that need its output in terms of product logic. Any system requires additional connections for healthy operations, both tech and non-tech related. Some are required to smooth the maintenance and operations (from both the engineering and product perspective); others are caused by implicit nonfunctional requirements related to the system (such as legal or privacy concerns). Let’s name a few.

13.3.1 Tech-related connections

CI is usually the first element of the whole infrastructure to be set up. It helps identify and resolve integration problems while facilitating smoother and faster development processes. Typical tasks for CI are running tests (unit or integration) and building artifacts that will be used down the stream (e.g., for further deployment). However, there may be other needs covered on the CI level, such as security testing, performance testing, cost analysis (“Does this release require us to spin out more cloud servers?”), deployment to test environments, linting the code style, reporting key metrics, and many more.

Two main things that are not part of the system’s data but need to be stored are logs and metrics. Usually, there is a common approach to how logs and metrics are stored, aggregated, and monitored in a company, and you just need to follow the common way. We’ll elaborate a bit more on this topic in the next chapter.

The system’s performance may be prone to malfunctions, and it should not come as a surprise. Thus, the system should be connected to the alerting and incident management platform used in the company, so the person on call will be aware of a potential incident and react appropriately.

But how do they react, exactly? Here you may need to prepare specific cookbooks describing what expected failure modes are and how to approach them. Also, there may be an additional toolset to help with firefighting, like an admin panel for configuration, system-specific dashboards, and so on, which we will cover in chapter 16. Again, usually there is a company standard shared between ML and non-ML systems, so it is very likely that you will only need to adapt the software toolset already in use without reinventing the wheel.

Designing a system requires considering the whole life cycle of the system, not just the happy path, and being a little paranoid helps.

13.3.2 Non-tech-related connections

Other than purely technical aspects of operations, there are some non-tech aspects that should be considered. They are often related to customer success or compliance, and they are not always obvious. What should we do if the user wants all their personal data to be purged as allowed by the General Data Protection Regulation (recall chapter 11 and imagine how deeply their data can propagate)? Is there a regulation forcing the model to be explainable, and what is the best way to follow it without sacrificing the model’s performance? What if a high-level executive or a startup investor faces a bug in the system and gets mad? How do we debug the system’s behavior in hard-to-reproduce scenarios (e.g., a defect can be only reproduced by an aforementioned executive’s user account)? All these questions should be answered before the system is released, and the answers should be at least briefly reflected at the design stage. Otherwise, the changes may be too expensive to implement later. Often it requires building additional components, like some user impersonation mechanism or an admin panel for data management or model explainability, and it may take a lot of project time and require help from other teams (e.g., Legal or Compliance to understand the regulations or Web Development to build the required dashboard).

From our experience, all these additional connections and considerations usually take more time than the core system itself, and the bigger the company, the more effort it takes. Given current trends, it’s not likely to improve in the near future as even more regulations related to ML and privacy are being applied globally.

13.4 Overrides and fallbacks

Your system may have a legit reason to fail. An example of such a reason could be an external dependency: you pull out a chunk of data from a third-party API, and at some point, it’s just not available. That is one of the situations where you may want to have a fallback solution.

A fallback is a backup plan or an alternative solution that can be used when the primary plan or solution fails or is not available. We use it to ensure that the system can still function and make decisions even if the primary ML model has failed for whatever reason.

This can be particularly important in systems used for critical tasks or in industries where even the shortest downtime can lead to significant consequences. For example, a fallback can be crucial for a model used to predict equipment failures in a manufacturing setting, ensuring that production can continue even if the primary model experiences problems.

Another reason to use a fallback is to provide an alternative solution when the primary system is unable to provide a satisfactory answer or a confident prediction or when the primary model’s output is outside the acceptable range.

There are quite a few different approaches that can be used for implementing a fallback. One common approach is to use a secondary ML model, which can be trained on a different set of data or using a different algorithm. It might be a simpler baseline solution as we reviewed in chapter 8 or a dual-model setup. In this setup, the first model is built using only stable features, while the second model is used to correct the output of the main model using a larger feature set. The models can be used together, with the output of one model chosen over the other based on the input data or a predetermined rule. Alternatively, input feature drift monitoring (see chapter 14) can be set up for the “core” model to detect crucial shifts.

Another option is to use a rule-based system as a fallback, which can provide a stable and predictable response when the model is unavailable or is performing poorly. It is also possible to use a combination of these approaches, such as using a rule-based system to handle simple cases and the ML model for more complex scenarios (however, this alone introduces additional complexity and breakpoints).

As with baselines, a simple constant can be our fallback as well. Finally, sometimes a fallback solution is to reply with an explicit error message.

A fallback solution should always have a plan in place for activating the fallback and switching between the ML model and a fallback system. It can be either automatic (triggered by a monitoring event), manual, or hybrid, depending on the use case.

One custom type of fallback is an override. It is a way to manually override the model’s output when it is signaling a bad prediction. One example may be dropping the model’s output and using a constant instead when the model’s prediction is beyond the acceptable range or when the model’s confidence is too low. Another reason to use an override is related to a release cycle. For example, a customer complains about the model failing in a very specific scenario. Ideally, we need to ensure this scenario is represented in the training data, retrain the model, run all the checks, and deploy it. But, as we discussed earlier, it may take a while. So we can override the model’s output for this particular scenario using a rule-based approach, keep the customer happy, and address it properly in the following release.

The downside of overrides is that they are not transparent and can be easily forgotten. For this reason, it is important to have a way to track them and to have a plan for how to address them properly; otherwise, they may turn into a technical debt. The positive side effect of having many overrides is that collections of overrides can be used to improve the model via multisource weak supervision—a technique when unlabeled data is labeled with “labeling functions” (these heuristics are not perfect but are cheap to implement). The labeling functions provide a noisy dataset that becomes a foundation for model training. More details on this technique can be found in papers by Alexander Ratner et al., “Data Programming: Creating Large Training Sets, Quickly” () and “Training Complex Models with Multi-Task Weak Supervision” (). The concept of multisource weak supervision has gained recognition in the industry thanks to the popular library named Snorkel ().

Arseny’s colleague once implemented a more elegant solution that helped address the limits of a long release cycle. He worked on a named entity recognition problem, and sometimes the model was not able to recognize some entities that were important to customers. However, training a new model after the problem had been found was not an option because of the long training time. So he implemented a solution based on the knowledge base: before running the model, the input text was checked against the knowledge base, and if the possible entity was found there, it was used as a hint for the model. It allowed the team to fix the problem without retraining. Adding a sample to the knowledge base could be done in a minute by a nontechnical person, so the problem could be addressed promptly. The solution is described in more detail in a blog post (). This case was somewhat similar to overrides, but it augmented the model’s inputs, not outputs.

13.5 Design document: Integration

Approaches to integration for both Supermegaretail and PhotoStock Inc. aim at creating user-friendly, fast, and efficient mechanisms for either inventory predictions or search results.

13.5.1 Integration for Supermegaretail

Supermegaretail’s integration strategy is tailored to offer a seamless, dynamic, and highly responsive prediction system that helps manage inventory.

Design document: Supermegaretail

X. Integration

i. Fallback strategies

Fallbacks are crucial for maintaining operational efficiency in the face of unforeseen circumstances. Supermegaretail has adopted a multitiered fallback system:

Primary fallback —The primary model is trained on a subset of the most significant features. It will be used if no feature drift/problems are detected within this subset.
Secondary fallback —Our next layer of fallback involves time-series models like SARIMA or Prophet, which we explored in section 4.4. These models are less dependent on external features, allowing for more robust predictions if drift occurs.
Tertiary fallback —As a last resort, we would predict sales akin to the previous week’s data, with modifications for expected events and holidays.

The system monitors for data drifts and quality problems, triggering alarms that automatically switch to the appropriate fallback to ensure the most accurate predictions possible.

ii. API design

HTTP API handler —This component will manage requests and responses, interfacing with users in a structured JSON format.
Model API —This will extract predictions directly from the model.

The request format is

GET /predictions?query=<query_string>&parameters=<parameters>&version=  ↪<version>  &limit=<limit>&request_id=<request_id>&sku=<sku>&entity_id=<entity_id>  &group=<group_type>

The response format is

{      "predictions": [          {              "sku": <sku_id>,              "demand": <demand>,              "entity": <entity_id>,              "period": <time_period_for_demand>,          },          ...      ]  }

iii. Release cycle

A. Release of the wrapper vs. release of the model

Within our integration strategy, the release of the wrapper and the release of the model represent two distinct processes. The following are the nuances for each.

For the release of the wrapper (infrastructure), we should consider the following:

Frequency and timeline. The release typically happens less frequently than that of the model. As demand patterns can shift over the night, it is important to be able to incorporate them into the model through training.
Dependencies. Infrastructure releases are mostly dependent on software updates, third-party services, or system requirements. Any changes in such areas may necessitate a new release.
Testing. Comprehensive integration testing is a must to ensure all components work harmoniously. It is also crucial to ensure backward compatibility, so existing services are not disrupted.
Rollout. This usually employs standard software deployment strategies. Depending on the nature of the changes, a blue-green deployment might not always be necessary, especially if the changes are not user-facing and do not affect batch jobs.
Monitoring. The focus will be on system health, uptime, response times, and any error rates.

For the release of the model, we should consider the following:

Frequency and timeline —Model releases are more frequent and are tied to the availability of new data, changes in data patterns, or significant improvements in modeling techniques.
Dependencies —These predominantly rely on the quality and quantity of new training data. Any drifts in data patterns or introduction of new data sources can trigger the model’s update.
Testing —Before rolling out, the model undergoes a rigorous offline validation. Once validated, it might be tested in a shadow mode, where its predictions run alongside the current model but are not used. This helps in comparing and validating the new model’s performance in a real-world scenario without any risks.
Rollout —When introducing a new model, it’s not just about deploying the model file. There’s also a need to ensure that any preprocessing steps, feature engineering, and other pipelines are consistent with what the model expects.
Monitoring. The primary focus remains on model performance metrics. Also, keeping an eye on data drift is essential. See chapter 14 for more details.

B. Interplay between wrapper and model releases

In cases where the infrastructure has updates that would affect the model (e.g., changes in data pipelines), coordination between the two releases becomes vital. Additionally, any significant changes to the model’s architecture might require updates to the wrapper to accommodate the changes. By treating them as separate processes yet ensuring they’re coordinated, we maintain the system’s stability while continuously improving its capabilities.

iv. Operational concerns

Feedback is integral for continuous improvement. A feedback mechanism, inclusive of an override function, should be available to internal users. Not only does this aid in refining the predictions, but it also gives business users a sense of control and adaptability based on real-time insights.

v. Nonengineering considerations

The integration strategy will also take into account nonengineering factors—for instance

Admin panels —Crucial for managing the system and obtaining a high-level overview
Integration with company-level dashboards —For company-wide visibility and decision-making
Additional reports —Essential for deeper insights and analysis
Overrides —A necessary feature to account for manual adjustments based on unforeseen or unique circumstances

Furthermore, standard CI tools used in the company, along with a typical scheduler, will be integrated to maintain consistency and optimize workflow.

vi. Deployment

Given that our audience primarily consists of internal customers and the frequent batch jobs, there’s no immediate need for green-blue or canary deployment. The absence of end-user traffic eliminates the need for such staggered deployments, simplifying our rollout strategy.

13.5.2 Integration for PhotoStock Inc.

The integration strategy for PhotoStock Inc. is focused on providing the most relevant search results regardless of the complexity of search queries while maintaining prompt responses.

Design document: PhotoStock Inc.

X. Integration

i. API design

Our search engine needs to expose one HTTP API handler, which takes a query string + optional additional filters (e.g., price, collection, resolution, author, etc.) and returns a list of IDs of the photos that match the query sorted by relevance. Besides these product-focused parameters, we need to pass more technical ones, like version, limit, and request_id.

The handler will be only used internally and will not be exposed to the public, so we don’t need to worry about authentication and authorization, given that it’s operating within a private network.

We can’t keep the service fully stateless (we need to own index and overrides), but all the query-related metadata should be handled by the backend service as they already store other users’ metadata.

Under the hood, we will use a simple cascade to narrow the search results. We will first filter by the optional filters, then fetch the nearest neighbors of the query string in the embedding space, and finally sort the results by relevance with the final model.

We consider using Qdrant () as a rapid vector database capable of filtering + fetching candidates at scale; however, it has not been used in the company before, so we may need to test it properly before using it in production. Alternatively, we can consider using other vector databases if needed.

The request format is

```  GET /search?query=<query_string>&filters=<filters>&version=<version>  &limit=<limit>&request_id=<request_id>  ```

The response format is

```  {      "results": [          {              "id": <photo_id>,              "score": <score>          },          ...      ]  }  ```

Both the request and response are in the JSON format, as that’s the default format we use in our internal APIs. Structures of the request and response are simple and straightforward for now but can be extended in the future if needed.

The underlying API should be layered in the following way:

The HTTP API handler only serves as a proxy to the underlying API and does not contain any business logic; it just parses the request, passes it to the underlying API, wraps the response into a JSON format, and handles errors.
The vector DB API is responsible for filtering and fetching candidates given the embedding of the query string.
The model API is responsible for extracting the embeddings from a string and scoring the candidates.
The ranking API is responsible for sorting the candidates by relevance and applying possible overrides.

ii. Release cycle

We assume that model updates will be relatively rare as training takes a lot of time, and we don’t expect the data to change significantly over time. We can expect releases of a new model and relevant APIs every 1 to 2 months, while most hot updates will be related to index and overrides only.

The index is the core of the search engine, and it will require regular updates (for data, not software). We can add new items on a daily basis (e.g., with a batch job running every night) and be ready to update the index on demand if needed (e.g., removing banned images or adding a new image by special requests from VIP users).

iii. Operational concerns

Many internal users can provide a lot of feedback on search results, so we need to provide them with appropriate tools. For a start, we can just add a “Report bad match” button to the search results page for internal PhotoStock Inc. users. It will send a request to the data gateway, so we save an event with the photo ID, search engine results page position, and query string to the data lake. We can then use this data during the model retraining stage and error analysis and for manual overrides. In the future, we can consider providing similar functionality to some external authenticated users (e.g., top buyers we trust).

iv. Overrides and fallbacks

As a fallback, we’re going to use the existing search engine, which is based on Elasticsearch. While it’s not as good as the new one is set to become in terms of relevance, it’s still a decent search engine.

As for overrides, we may have manual overrides for certain queries, which we can store in a separate database. It may happen in case of poor relevancy for popular/critical queries, which we cannot fix with the model promptly enough. For now, it can be a simple key-value store where the key is a regex of the query string, and the value is the list of photo IDs, which we will use in the search engine results page. This solution is not very scalable, but it’s solid enough for the first version. We may want to make a simple UI for managing these overrides in the future.

There is another possible type of override related to “bad photos” that we want to hide from the search results (e.g., photos with nudity/violence that passed moderation). However, if we suddenly realize a given image is no longer welcome in our search results, we can simply remove it from the index.

Summary

Remember that integration is not a one-time event or a phase of the project but rather a continuous process that starts from the beginning of the project and ends only when the system is decommissioned.
When selecting an API for your system, the two main qualities you should look for are simplicity and predictability.
The API practices we consider effective with regard to ML systems are designing at least two layers of an API, separating the ML and IO components of an API when possible, building a client library for an API, and embedding feature toggle or its alternatives.
Having a fallback provides you with a backup plan or alternative solution that can be used when the primary plan or solution fails or is not available.

13 Integration

This chapter covers

13.1 API design

Figure 13.1 Overconfiguration vs. underconfiguration

Campfire story from Arseny

Listing 13.1 A mediocre design for a predict function

Listing 13.2 A better design for a predict function

Listing 13.3 Adding a model version to the API

Campfire story from Arseny

Campfire story from Valerii

13.1.1 API practices

Figure 13.2 Layered API structure is simpler to maintain, test, and develop.

13.2 Release cycle

Figure 13.3 Blue-green deployment

Figure 13.4 Canary deployment

13.3 Operating the system

13.3.1 Tech-related connections

13.3.2 Non-tech-related connections

13.4 Overrides and fallbacks

13.5 Design document: Integration

13.5.1 Integration for Supermegaretail

Design document: Supermegaretail

X. Integration

i. Fallback strategies

ii. API design

iii. Release cycle

A. Release of the wrapper vs. release of the model

B. Interplay between wrapper and model releases

iv. Operational concerns

v. Nonengineering considerations

vi. Deployment

13.5.2 Integration for PhotoStock Inc.

Design document: PhotoStock Inc.

X. Integration

i. API design

ii. Release cycle

iii. Operational concerns

iv. Overrides and fallbacks

Summary

Listing 13.1 A mediocre design for a `predict` function

Listing 13.2 A better design for a `predict` function