Книга: Machine Learning System Design
Назад: 15 Serving and inference optimization
Дальше: index

16 Ownership and maintenance

This chapter covers

  • The importance of proper system maintenance
  • Accountability as one of the key factors in building and maintaining a healthy machine learning system
  • The “bus factor” and the tradeoff between teams’ efficiency and redundancy
  • The fundamental importance of properly arranged documentation
  • The deceptive appeal of complexity
The competent programmer is fully aware of the limited size of his own skull. He therefore approaches his task with full humility and avoids clever tricks like the plague.
— Edsger Dijkstra

Throughout the last 15 chapters, we have tried to keep this book organized as an extended, in-depth checklist that you can always refer to at any stage in the design of your machine learning (ML) system. But these seemingly obvious recommendations are harder to follow than you may think.

Building an ML system from scratch and especially operating an existing solution is a process so complex and all-demanding, we’re guaranteed to stumble, blunder, hit a few bumps, and compromise some of the principles on our road to bringing value to a business. Being imperfect humans, we are also aware that sometimes we may allow ourselves to ignore some of our own tips.

Indeed, sometimes you may see or end up with a system that is not following the principles we advocate in the book, and reasons for that may vary in a vast number of ways: some details could have been missed, the company may have changed priorities and constraints, the system really needs some refresh because its original assumptions are not valid anymore, and so on.

Usually, it’s even more mundane, as the people who originally designed and built the systems tend to change jobs. More than that, they may leave the company before they finish building the system; this is the nature of things. With that in mind, ownership and maintenance is not something we only need to think about later but rather is the cornerstone of the whole system to be ingrained from the very beginning. In this process, however, there is the possibility and risk of becoming too attached to your creation, and this is something that should be avoided. The system should not be viewed as part of its author, and you should detach yourself and your ego from your design to deliver the best possible solution.

To put that in three bullet points, it is essential to know the following at any moment in time:

To some extent, the design document helps answer (although only partially and not always directly) these questions. That means only one thing: we must focus on these aspects to avoid any woeful consequences in the future. After all, the most important thing is to build and maintain a product that will meet your demands and bring value.

In this chapter, we cover the essential basics of making the system long-lasting and robust to human-induced changes.

16.1 Accountability

In earlier chapters of this book, we mentioned the importance of involving a variety of stakeholders in the information-gathering process who will provide critical input into the preparation stage of ML system development.

As the project grows, enriches with new inputs, and evolves from a simple constant baseline to a series of tightly interconnected complex models, it will be gradually joined by new participants. Some of them will only join you for a short period of time and leave once their participation is no longer necessary (e.g., data labelers are vital at the dataset processing stage but fall off by the time of deployment), some of them will turn into permanent contributors until the system’s release, and a handful of people will eventually become what can be called the “core team” (i.e., your closest people on the project). These colleagues of yours (or representatives of external vendors, if that’s the case) will become the ones accountable for the end result and the stability of your system. And it is you who should keep in touch with them on a regular basis, or at least know them personally.

Knowing the areas of responsibility and the people assigned to these roles is crucial for the successful launch and future of any project or system. It may sound absurd, but it’s even more important for team members who are actually responsible for specific components to be aware of that. You would be surprised to know how many times person A was convinced that person B was accountable for a certain part of a project, while person B, in turn, was sure of the opposite. A classic illustration of such a case can be seen in figure 16.1.

figure
Figure 16.1 Explicit versus implicit approaches to involving people in the project

Being redundantly explicit is much more valuable than being implicit in the hopes that everyone is on the same page, and based on our experience, this approach is applicable as soon as there are three people on the project, let alone more team members involved. One of our favorite quotes on the subject comes from the Nobel Prize-winning physicist Richard Feynman. In his book Surely You’re Joking, Mr Feynman!”, he wrote

One of the first interesting experiences I had in this project at Princeton was meeting great men. I had never met very many great men before. But there was an evaluation committee that had to try to help us along, and help us ultimately decide which way we were going to separate the uranium. This committee had men like Compton and Tolman and Smyth and Urey and Rabi and Oppenheimer on it. I would sit in because I understood the theory of how our process of separating isotopes worked, and so they’d ask me questions and talk about it. In these discussions one man would make a point. Then Compton, for example, would explain a different point of view. He would say it should be this way, and he was perfectly right. Another guy would say, well, maybe, but there’s this other possibility we have to consider against it.
So everybody is disagreeing, all around the table. I am surprised and disturbed that Compton doesn’t repeat and emphasize his point. Finally, at the end, Tolman, who’s the chairman, would say, “Well, having heard all these arguments, I guess it’s true that Compton’s argument is the best of all, and now we have to go ahead.”
It was such a shock to me to see that a committee of men could present a whole lot of ideas, each one thinking of a new facet, while remembering what the other fella said, so that, at the end, the decision is made as to which idea was the best—summing it all up—without having to say it three times. These were very great men indeed.

If you don’t have a bunch of geniuses working on a national priority project, you better not rely on their internal understanding and say (or write!) things three times, sharing it with everyone and putting it where it is hard to miss them. Things like meeting minutes, follow-ups, updates, and syncs are essentially done for this sake.

However, we have seen plenty of people abusing these principles, spending too much on meaningless syncs, turning the very concept of meetings into an industry-wide boogeyman. Those rituals, artifacts, or whatever you call them don’t require much time and don’t have to be done just because it’s written in the textbooks but rather serve the specific goal of maintaining focus. Not using them, as well as using them too much, might seriously affect performance and sustainability in the long run. We are convinced documentation is very undervalued, and its importance is severely underestimated.

Areas of accountability should be written down for explicit and unambiguous understanding across the team. There may be various levels of formality—from the startup-friendly “Alice is fully responsible for making system X work” to a complex multipage hierarchy. A fairly balanced approach will involve using the simple yet powerful RACI matrix that splits involved people into four groups:

For better or worse, there’s always room for gaps in accountability; thus, it makes sense to describe an explicit way of escalation—what can be done if the situation doesn’t match the matrix or other accountability-related artifact?

One final and probably least favorite part about accountability is an on-call rotation. Being on call means that a specific person is responsible for responding to any critical incidents during their shift. They must be ready to react quickly in case of emergency, usually within 30 to 60 minutes. The rotation part implies this duty is typically shared among team members and changes on a regular basis. For example, one person might be on call for a week and then pass the duty to the next person for the following week (see figure 16.2).

figure
Figure 16.2 On-call shifts are a necessary measure against team burnout.

An on-call schedule usually emerges over the company’s growth. Those who started their careers or had experience in smaller companies may recall that there’s no need for rotation in those environments, as everyone is on call all the time, with incidents being handled by whoever is available at the moment. However, at some point, this approach becomes unmanageable, when the responsibility is blurred between team members, leading to incidents being managed by everyone and no one at the same time. As soon as you start to pick up signals of such behavior, it’s time to formalize an on-call rotation. Later, there might be a need to have a dedicated on-call team structuring the rotation into tiers like L1/L2/L3 and so on.

But even if your on-call schedule is up and running, there may be other reasons for overloading the whole team. Arseny worked for a company during its early evolution stage, where he faced system incidents that couldn’t be solved at all simply because he had no clue what systems were even affected. This was a frustrating experience until the engineers who had built and maintained those systems finally wrote solid cookbooks with recommendations on typical problems. Some tech debt was cleared, and on-call shifts transformed from endless nightmares to a regular duty—not the most comfortable one but at least tolerable. When L1 incidents became owned by a dedicated team, the amount of time spent on call by the ML team was reduced even further.

Obviously, being pinged at 3:00 am on a Saturday is not ideal, which probably creates additional incentives to build a robust and well-monitored system, which could help minimize problems and build a schedule in a way that people know in advance the on-call “roster” and shift times. Any engineer who doesn’t see the value in proper logging and observability should be put on call for a long enough period to change their mind.

We believe that a person responsible for system design and implementation is also responsible for the system’s maintenance and support. They know the details better than anyone else, and they can forecast corner cases and suggest shortcuts to fix them. It doesn’t mean they have to be the only person on call, but they should prepare the system for the on-call rotation and provide the on-call team with the necessary tools and documentation.

Typical tools include

Access to production data is a bit trickier. It’s not always necessary to have access to production data to fix a problem; we have witnessed cases when privacy policies limit access. However, it’s usually good to have access to the production data to investigate the problem and find the root cause. Otherwise, logs and metrics should be detailed enough to help with that.

There should always be a runbook containing comments on how to fix the most common problems and a list of people to escalate the problem to if needed (e.g., there may be an outage on the vendor’s side, and an on-call engineer has to reach the CTO who can communicate with the vendor). Many problems occur with some cadence; they are reflections of either the tech debt or usage pattern. For example, imagine having a big customer with a spiky usage pattern, which causes the system to struggle with the load. For such a scenario, an on-call engineer may need a recipe on how to spin up additional instances of the system and how to scale it down when the load is back to normal.

Arising production problems provoke two derivatives: solving and learning. To avoid facing the same mistake repeatedly, make sure to set up a process of learning from failures. This can be arranged through retrospectives and postmortems. We recommend following the principle of proportional response—a set of postincident actions that should be sized similarly to the failure effect adjusted for a chance of a similar failure in the future. Some incidents are only worth a 10-minute discussion and adding a paragraph in the runbook. Arseny once triggered an outage so large that a CTO had to start an initiative named “Race for Reliable Releases,” which involved every engineering team in the company and improved the reliability across all systems, not only the ML system Arseny successfully broke.

16.2 Bus factor

The bus factor is a measure of the risk of a project being disrupted by the loss of a single team member. The term bus factor comes from the hypothetical scenario of a team member being hit by a bus, which would suddenly and unexpectedly remove them from the project.

The CAP theorem states that any distributed data store can only offer two of the following three characteristics:

When a network partition failure occurs, a decision must be made between canceling the operation to enhance consistency at the cost of availability or proceeding with the operation to maintain availability while potentially compromising consistency.

We consider the team structure very close to that but with different criteria: efficiency and redundancy. As computer science describes, redundancy means having extra or duplicate resources available to support the primary system. It is a backup or reserve system that can step in if the primary system fails. The reserve resources are redundant as they are not being used if everything is working correctly.

How can this redundancy/efficiency problem affect the team structure? From what we have seen, the company/team/project scheme usually evolves from being efficient to becoming redundant. However, neither being too efficient nor too redundant is beneficial.

16.2.1 Why is being too efficient not beneficial?

If you are too efficient and cover multiple factors with only a few people, you are at risk. Every single person on the team is irreplaceable, not only thanks to their experience but, unfortunately, also because they work at full capacity. If anything happens to at least one of your team members, the company (or a project) is in trouble. The only way to handle problems that arise with no drops in overall efficiency is by acts of heroism that are not in any way scalable and thus are considered an antipattern (see figure 16.3).

figure
Figure 16.3 The downside of ultraefficiency is extremely high vulnerability to external factors.

16.2.2 Why is being too redundant not beneficial?

As opposed to efficiency, the main advantage of redundancy is extra capacity that provides reliability, security, room for improvement, and a margin to outlast a crisis. However, too much redundancy creates sloppiness, reduces trust between team members, and repels top performers, as it impairs the overall vibe and feeling of doing a meaningful and impactful job.

The two examples here are obvious corner cases that should be avoided. As soon as you feel that the team is approaching its capacity limit, it is worth thinking about expanding, which will be the right solution in the long run, despite the increased costs at the current moment.

The same rule will work the other way around: have just enough capacity to avoid working to exhaustion on the one hand but to be well-armed for potential crunching at peak load times.

16.2.3 When and how to use the bus factor

We need to keep the right balance between efficiency and redundancy, and the first step to controlling something is the ability to measure; one of the very simple and well-known metrics is the bus factor.

The bus factor is calculated by counting the number of team members that would need to be lost before the project would be unable to continue. For example, a project with a bus factor of 1 would be in serious trouble even if one (specific) team member was lost.

Obviously, a bus factor of 1 is less than desirable, while a bus factor of 10,000 might be too much (though this might happen if you need many people with relatively similar scope, e.g., customer support); the final number depends on many variables: project importance, budget constraints, deadlines, expected turnover, and project owner’s anxiety/confidence. But as soon as you have a list of people accountable for different parts of the system/project, you have everything you need to calculate a bus factor, which includes areas and people accountable for them. The next thing you need to do is to find out how many other people know this area well enough. As a rule of thumb, a person accountable for the area knows everybody else who can (to some extent) replace them. With that in mind, you can calculate the bus factor, assess potential risks, and make the necessary decisions on hiring, moving, or collaboration activities to cross-pollinate the knowledge and address the fragility you might have on the project.

Of course, we can’t label people in a discrete manner as “they know how the system or its component works” versus “they don’t.” Typically, there are shades of knowledge within a system, and even when it is not possible to keep many engineers informed about everything, it is possible to spread this knowledge partially. This usually manifests through design/code/documentation review, pair programming sessions, and open hours where key members share their expertise.

This reasoning describes the balance between two criteria. Real decision-making, though, may and will require more criteria to take into account (e.g., not just team size but also its seniority balance and exposure to particular technologies).

For those who are interested in a simple yet practical framework on how to see the tradeoff between costs, system performance, and human capacity, we recommend getting familiar with the Rasmussen model for failures, nicely explained in this post: . Unlike the previous text, which focused on the team structure, the post reflects SRE’s perspective on the system tradeoffs: while people and hardware are very different, there are similar patterns in finding an optimal tradeoff.

16.3 Documentation

We have already mentioned the importance of documentation. Unfortunately, documentation is usually highly underestimated and does not always receive the level of care and attention it deserves. Ironically, the word “documentation” often stands next to such words as “later,” “not now,” and “not urgent.” And when documentation is needed, it is usually either too late to prepare or extremely outdated.

The significance of documentation cannot be overstated. It serves as a means of knowledge transfer and onboarding new team members, helping to smooth the learning curve for newcomers and ensuring the system’s longevity. Comprehensive documentation allows someone outside of the responsible team to understand and run the system effectively.

Documentation also helps to achieve higher reliability by enabling smoother transitions when team members leave (bus factor) or new members join. Moreover, documentation facilitates collaboration across different departments and stakeholders, fostering a shared understanding of the system (accountability) and keeping each engineer more autonomous, thus agentic and productive.

While it may require additional time and effort upfront, documentation ultimately saves resources in the long run. It reduces the reliance on tribal knowledge, prevents the repetition of work, and minimizes the risk of costly errors and delays due to a lack of information. Documenting processes, procedures, configurations, and best practices empowers the team to work more efficiently and provides a foundation for continuous improvement. It should be prioritized from the beginning and not considered an afterthought. By acknowledging the importance of documentation and dedicating the necessary resources, teams can build resilient systems that can adapt to changes and continue to deliver value over time.

To some extent, our whole book is dedicated to why and how to write a particular sort of documentation, which captures important information about the system’s design, architecture, implementation details, and operational procedures. But a design document is a very specific piece of documentation and does not aim to become a holistic text covering all aspects of the system but rather an overview that helps to sync the team of system builders and stakeholders. That said, there is always space for other documents, including

Investing time and effort in creating and maintaining documentation from the outset helps teams mitigate risks associated with knowledge gaps and dependencies on specific individuals. In the course of the book, you have seen two design documents that we hope have given you a good overview of the described systems.

16.4 Complexity

Make everything as simple as possible, but not simpler.
— Attributed to Albert Einstein

According to the second law of thermodynamics, the entropy of an isolated system left to spontaneous evolution cannot decrease with time. While originally entropy was a term used for physics, it was later adapted by information theory with almost the same meaning: entropy quantifies the amount of uncertainty or randomness.

Metaphorically speaking, software and ML systems follow the very same law: during time, entropy can only increase. At some point, it becomes too hard to handle, and that’s why old systems are often decomposed into sets of smaller systems (remember the fundamental theorem of software engineering from section 13.1) to be maintainable by teams of decently intelligent people, not generational talents. Adding a new level of abstraction “hides” the entropy but doesn’t really reduce its level across the whole stack.

There are many pieces of collective consciousness stating similar ideas: the YAGNI (“You aren’t gonna need it”) approach from the extreme programming culture, the KISS (“Keep it simple, stupid”) principle that originated in the US Navy in the middle of the 20th century, and even Occam’s razor from as far back as medieval ages. All these ideas pursue the same goal: limit complexity.

At the same time, the software engineering culture tends to land on the opposite side of the spectrum, as expressed in acronymic proverbs like DRY (“Don’t repeat yourself”). At first glance, those are controversial: following DRY suggests introducing more layers of abstraction, and that is exactly one of the (or rather the only) sources of complexity.

Imagine writing a simple code snippet computing a well-known metric across multiple datasets. A typical code set would contain maybe three functions: one reads the data, the next one calculates the metric, and the final one orchestrates the execution, running both functions in a loop and saving the result.

Now, let’s imagine two sides of the spectrum: too-simplistic code on the one hand and overcomplicated code on the other hand, which execute the same thing. The former could be written by a person who doesn’t know much about loops and functions; they would just write it by instruction: read file A, calculate metric for A, save metric for A, read file B, etc. Overcomplicated code would contain some metaprogramming and other complex patterns. While they are completely different in nature, both would make the code readability worse.

This is a trivial example you can find in almost every software engineering textbook that teaches how to write clean code. But the same principle is applicable on a larger scale. Imagine solving the following problem: your company wants to help the customer support team prioritize the most urgent cases to help the most unhappy customers first.

A typical solution for this problem these days would be to use a foundation large language model with a tailored prompt to classify messages into several urgency buckets. Before the large language model revolution, the proportional baseline would be to make a simplistic model (like a logistic regression on top of a bag of words) to run the classification so it covers the majority of urgent messages. It would also need some engineering efforts (e.g., building data pipelines to fetch the training set and labeling it with the customer support team, connecting the model with customer service software to propagate these labels and visualize them in the UI, etc.).

A deviation into unreasonable simplicity would be to build a baseline with several ifs and regular expressions. This solution will definitely work for a while, but relatively soon your colleagues will be as unhappy as a customer who paid for the product twice because a glitch requested the company to roll back the transaction, and for some unclear reason, your model classified the ticket as “low priority.”

An overcomplicated solution can include many components united by the same word: irrelevant. You can bring multiple complicated natural language understanding models trained from scratch, cover them with some research to make them work in a few-shot scenario, and polish them with a calibration layer. Complexity can be not ML-specific but skewed into engineering as well: instead of having a couple of simple APIs, an engineer can use all the buzzwords they know of to make the system as fault-tolerant as spaceship firmware and as scalable as Google’s search engine, ignoring the fact it should only process dozens of messages per day.

All kinds of systems—including nonsoftware ones—suffer from complexity. The additional layer of software complexity is its constant evolution. When you’re building a house, you won’t be adding extra rooms on top of it a month after finishing the roof. Adding new features to a software system, though, is a much more commonly applied practice. The complexity of data-focused systems increases because of constant problems related to data drifts. ML systems are even more complex—some models are nonlinear black boxes by design, and this doesn’t make things easier.

At the very end of the book, we’d like to once again refer to chapters 2 and 3, covering problem understanding and preliminary research. Some complexity is unavoidable, but a big share of excessive complexity is caused by misses on these stages: poor understanding of the problem, surface-level research, setting irrelevant goals, and not covering obvious risks from the very beginning. These can lead to a snowball of poor decisions and increased complexity.

When you need to cross a river, a reliable paddle boat can be enough. You could even equip it with a motor to speed up your cross-river journeys, but there is no way of evolving this boat for transporting thousands of people across the ocean (see figure 16.4). At the same time, those who start assembling a cruise ship when the traveling distance does not exceed a couple of miles can never reach the goal. Many ML systems are doomed for reasons related to this metaphor: their designers either try to evolve a primitive system with bells and whistles to meet rather challenging goals or start building a spacecraft with no good reasoning and get fired even before the first phase of construction is ready.

figure
Figure 16.4 Just like a basic model will end up underperforming and failing to meet the project goals, an overkill solution will devour valuable resources of a business.

16.5 Maintenance and ownership: Supermegaretail and PhotoStock Inc.

Throughout the book, we have concluded each chapter with a case study illustrating the chapter’s relevant design-document section for Supermegaretail and PhotoStock Inc. This makes little sense in this case, as it would have required making up a couple dozen fake first and last names. We limit ourselves to general considerations about how maintenance would go in both examples.

Since Supermegaretail is a large corporation with thousands of employees and has a high level of bureaucratization and complex work processes, while its key business units are predominantly nontechnical, an important role in a project’s success lies in the successful coordination of all stages with stakeholders at each level of the hierarchy. Thus, the basis for the effective postrelease operation of the system would be a detailed RACI matrix that would include the company’s employees involved in the project, as well as lists of sign-offs from the delivery team and executive and business teams.

On the other hand, PhotoStock Inc. is a fairly small, purely tech-based company, and the specificity of custom search engines working with millions of images implies continuous load on the system with regular peak loads. The search engine is both an ML-heavy and infrastructure-heavy project. As highlighted before, it requires extensive cross-team collaboration. Given that there are four core components (API handlers, index, ranking models, internal tools), we would want to make sure that

With that in mind, two levels of on-call rotation would be the optimal solution:

As we said back in chapter 4, the design document is a living thing, and the maintenance section is to be added and filled in last, not only because, at this stage, you have a complete vision of your ML system but also because by this point you have managed to build connections with representatives of other teams and units and have an understanding which personalities will be included in the working group of your project.

Summary

Назад: 15 Serving and inference optimization
Дальше: index