1 Essentials of machine learning system design

This chapter covers

What machine learning (ML) system design is, why it is so difficult to define, and where you may first encounter it
Who we believe will benefit most from reading this book, what information we’re about to give you, and how it will be structured
What principles of ML system design can be helpful and the best time to apply them

Machine learning (ML) system design is a relatively new term that often gets people from the industry puzzled. Many find it hard to compile a certain scope of responsibilities behind this term, not to mention trying to find a proper name for a respective role or position. The job may be done with various efficiency by ML engineers, software engineers, or even data scientists, depending on the scope of their role.

While all of the positions are valid, we believe that to become a seasoned expert in ML system design, you have to encapsulate expertise from each of the backgrounds. Note that while some of the things we discuss in this book are specific to ML systems, others will be familiar to those readers who have already built non-ML software systems (you will find this information in chapters 2, 13, and 16). This is because ML system design, although a new phenomenon, is still based on the classic fundamentals of software development.

But first we need to discover what ML system design is as a whole. In this opening chapter, we will suggest our take on the definition of ML system design and support it with examples from personal experience, both our own and those of our colleagues; we will describe the perfect persona for the position and share cases from our personal experience of why a coherent and consistent approach to designing an ML system will save you a lot of time in the long run and will help in delivering short-term business wins, which is crucial to gain the trust of colleagues in this new working method from the early stages.

1.1 ML system design: What are you?

ML system design might sound familiar if you have ever tried interviewing at deep tech/big tech companies (the first term commonly stands for startups or R&D units within large corporations that either work with or develop cutting-edge technology, and the second term refers to the largest and most dominant tech companies of the world that are often known for their high bar in talent acquisition and advanced engineering culture) for ML engineer/manager positions. Both of us have vast, deep tech experience, so while planning to write this book, we were convinced the definition was clear enough to everyone, and there was no reason to dwell on it.

However, after reaching out to a variety of people for their opinions on the outline, we saw that the term itself caused discord in opinions and interpretations. Perhaps this is due to the fact that the industry has long had a definite list of job titles, which gives candidates a relatively clear understanding of what set of functions and responsibilities they are applying for. The positions of software engineer, research engineer, ML engineer, and so on each entail a certain classic set of functions enshrined in textbooks and eloquently stated in job descriptions.

So, is there a job title associated directly with the term ML system design? Currently, there’s no position completely tied to the scope we’ll be describing in this book, but if you meet a person fulfilling this scope, their position will most certainly be data scientist.

In our attempts to understand the nature of this connection, we reached various people working in data scientist positions and eventually realized that the role implies a rather vast and vague list of responsibilities. Indeed, you can find 10 different people working in 10 different companies as a data scientist and ask them what they do—and you’ll end up hearing about 10 completely different things:

Create pivot tables in Excel.
Set up a 10 PB distributed cluster.
Build a real-time computer vision system.
Deploy numerous chatbots.
Visualize data in Tableau/Metabase/Looker/PowerBI.
Write SQL scripts.
Run A/B tests.
Create recommender systems.
Handle communication with stakeholders.
Answer questions from top management.

As you can see, a short, crisp-sounding title carries a rather mottled set of functions, having grown into a unifying “jack-of-all-trades” term for anything that goes beyond the commonly accepted scope of work behind the roles of data engineer, ML engineer, and research engineer.

While contemplating this, we found that in the case of ML system design (or rather what later received such a name), the situation is quite the opposite: there is a phenomenon without a common name but with a clear set of functions and responsibilities, and what needs to be done is to organize them and bring them to a coherent structure of interrelated functions.

In the following chapters, we will be giving our own perspective on ML system design and even suggesting unconventional ideas and solutions, but before we dive in, we’d like to suggest our own definition:

Machine learning system design is a complex, multistep process of designing, implementing, and maintaining ML-based systems that involves a combination of techniques and skills from various fields and roles, including ML, software engineering, project management, product management, and leadership.

Figure 1.1 illustrates this definition.

Figure 1.1 The variety of skills one should possess to succeed in ML system design

The reason we’ve highlighted maintaining in italics is that we believe that ML system design does not end at the release of an ML system. Apart from providing accurate predictions and ensuring efficient decision-making, your system must be scalable and flexible enough to be easily adjusted to changing business environments or any other factors, both internal and external. Thus, right after you go live, maintenance and fine-tuning your ML system will secure its efficiency in the long run, which can be crucial, especially when working under strict budget or capacity limitations.

But it is not only the term machine learning system design itself that has been questioned by those who have seen this book’s synopsis or walked through the table of contents. We received a host of questions on various aspects of the book; the following are those we found the most notable:

“Data scientist, machine learning engineer, and software engineer are different roles; why are you fusing them together?”
“It confuses me a little that a book about ML systems covers things like data gathering and reporting, as this is exactly what separates classical machine learning from data science.”
“I was surprised there was no mention of MLOps in the outline, which is the common industry term for many of the components you’re describing (reproducibility, testing, pipelines, etc.).”

For us, these questions became an additional indicator of the confusion between ML and data science, as well as between ML engineers and data scientists in the general public. We have our own perspective on that, but first, let us try to clarify our statements.

Coming from deep tech companies, we got used to calling people who do ML machine learning engineers, but the difference between ML engineer and software engineer is getting slimmer, especially since some prominent people call machine learning software 2.0 (). At the same time, data scientist is a job title mostly tied to people who do product analytics and work with metrics, insights, etc. (Please note we are speaking about our experience in deep tech companies, but since these companies employ thousands of highly qualified pros and gradually set standards for the whole industry, we tend to take this approach as a benchmark.)

When people interview for an ML engineer position, they mostly walk through the software engineer hiring loop topped with additional sections, with ML system design being one of the most important. It is used to draw out signals about a candidate’s expertise, maturity, and ability to overview complex systems and decompose them into blocks of interdependent tasks. This is not an easy exercise, as a candidate has only 40–45 minutes to present their design of a system randomly picked by the interviewer.

Eliezer Yudkowsky, a modern AI writer and philosopher, wrote, “The most dangerous habit of thought taught in schools is that even if you don’t really understand something, you should parrot it back anyway” (). It is very applicable to the tech interview flow in some companies: the interviewer provides a puzzle and expects a particular answer to be parroted back. After the interviewee is hired and becomes an interviewer of their own, the bad practice gets reinforced, and the company continues hiring people with perfectly memorized knowledge fragmentarily drawn from various fields. There is no guarantee these people understand the whole picture, and this is what we came across while conducting interviews ourselves.

We interviewed and hired ML engineers for various companies. Some were at the start of their career, some were seasoned experts, and some were solid software engineers switching to ML. However, there was a specific commonality among those who didn’t get through the interview: while working on the ML system design section, they were concentrating too much on details, never getting to the bigger picture.

For us, these failures indicated an expectations mismatch; as young hiring managers, we were convinced that a person who knew all the algorithms, tools, and patterns would be a good fit for the role by default. But later we saw that sometimes people just couldn’t combine their pieces of knowledge into an integrated vision.

In addition, building systems in a real environment is overwhelmingly different from discussing them during interviews. One can learn dozens of popular ML system design questions (“How would you design a job recommendation system for a LinkedIn-like website”?) and be puzzled when a similar problem occurs in real work.

But let’s ignore the interview part for a while. ML experts get hired for a reason: companies need them to build, maintain, operate, and improve systems—and not just for the sake of writing some code or closing Jira tickets. Businesses need reliable ML systems to reach objectives and solve problems.

Building ML systems requires a wide scope of skills. To put it briefly, a person in charge must be able to answer three questions:

What are we building?
What is the purpose of the system?
How should it be built?

In practice, it requires a combination of skills from multiple roles: a bit of a product manager to understand the main goal and communicate it to peers and stakeholders, a fair share of ML researcher to empower the system, and, of course, a solid software engineering background to make the product usable, maintainable, and reliable. An ML system design expert should be able to think globally and dive deep enough locally if needed.

There are few people who can combine all these skills at the proper level. However, a lot of ML systems are being built these days, and someone has to design them. From our experience, it is common for an ML system to be designed by either a bright ML expert (because it’s ML) or an experienced software engineer (because it’s a system). They do the job but often struggle in the areas where they don’t shine.

To sum it up, the confusion around ML system design is more typical for candidates with less expertise on the one hand and hiring managers or recruiters who are looking for that jack of all trades on the other. However, if we look at it from the point of view of an executive officer or an expert, a much broader picture appears. They know that you hire these specialists to build, maintain, and improve ML systems, and their end performance working on ML systems becomes the ultimate benchmark of their career growth.

We believe that it’s the fusion of data scientist and software engineer with experience in academic ML that constitutes an expert in ML system design. People who end up designing ML systems may come from various backgrounds—software, practical ML, academic ML, data research—and we hope our hands-on experience aided by small bits of theory will help them close the gaps, systemize their knowledge in the areas they’re familiar with, and feel more confident in the areas where they’re lacking precious experience.

1.1.1 Why ML system design is so important

While you have MLOps as a set of tools to use for building and maintaining your ML system, you can consider ML system design a blueprint that you can rely on and refer to at any moment that will give you scalability and flexibility (a proper understanding of building blocks and their connections helps to identify bottlenecks and address other problems fluently). Most importantly, though, it provides a framework that will weld your whole system together.

Some projects are simple enough that they don’t require that thorough of an approach. Let’s take construction as an example. You could probably build a shed without an initial blueprint. But when your ambitions spread further to the level of a house or a skyscraper, you can’t get away with not using a prearranged, detailed plan. ML system design is an architectural approach to engineering ML systems that incorporates the experiences of hundreds of experts who have worked in dozens of companies on a multitude of projects.

1.1.2 Roots of ML system design

Building complicated software systems has always been a challenge, and organizations had to crystallize the process somehow. People used a general principle for managing complexity through abstraction: build low-level blocks with complexity encapsulated into them, treat them as magic black boxes, use them to build higher-level blocks, and so on.

This process worked, but it had a weak spot: someone had to decide the structure of all these blocks (what are the highest-level components, what’s their structure inside, and so on to the lowest level of implementation). The most responsible decisions were made by software architects—experienced engineers who worked with many systems.

This kind of approach is usually associated with the Waterfall methodology and Big Design Upfront paradigm. In other words, it assumes software projects start slowly and are deeply analyzed and documented before the first line of a real system code is written. This approach was and continues to be reliable but inertial and bureaucratic. In a world of rapid changes, the project could lose its initial sense before finishing.

Opponents of such slow but steady approaches are often enthusiasts of something called the agile software development paradigm. The authors of Manifesto for Agile Software Development () stated four main values:

Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan

In other words, these people fairly state that many software systems can’t be effective while trying to plan and document everything. Of course, sometimes such bureaucracy makes sense—e.g., for building software controlling medical devices or airplanes. But most software engineers work on other types of applications—office software, entertainment, websites, and mobile apps. That’s how the software architect’s role became associated with something slow and old school—the opposite of swift hackers changing the world rapidly without a software specification approved by the whole hierarchy of architects, managers, and other experts. This agile approach was popularized by the Silicon Valley hacker culture and thousands of successful startups. Even big companies like Meta try to keep such a culture—their internal motto is “Move fast and break things.”

Let’s summarize this little historical overview: at some point, industry faced a spectrum of software engineering processes, from a heavily regulated one led by software architects to the chaotic, anarchist “screw the hierarchy” hacker-style way of building things. And, as it often happens, things got mixed. More traditional companies tend to become more agile, and most anarchist startups mature, introducing processes and separate roles.

This mixture leads to a consensus that dominates tech companies these days: instead of delegating all the decisions to dedicated people like software architects, we will keep it the responsibility of regular software engineers; let them both design systems and write code for these systems. But this level of freedom didn’t wipe out the initial need for decisions: someone still has to have a final word on how things are structured. Someone has to be responsible for the system design. Every engineer may be involved here and there, but seeing the whole picture is critical.

Skills in implementing low-level pieces of a designed system are not the same as skills in designing a proper system. That’s why deep tech companies tend to have separate interview sections to check a candidate’s skills in writing effective code (aka algorithms section) and designing systems: it’s expected that engineers will wear both hats. The split between those two can be different: usually, junior engineers are silent readers of design documents, and senior engineers are authors or active contributors.

In a nutshell, there is a consensus: a solid software engineer should be able to operate on different levels of abstraction, from low-level implementation to high-level architecture decisions.

Everything we have said so far about the definition of system design is fair for any software—we didn’t mention something ML related. However, not everyone who can successfully design a software system will succeed in designing an ML system—it’s a very specific subset of systems. While designing an ML system, the person in charge should keep in mind many aspects that are not relevant to regular software. In this book, we’ll focus on these aspects; readers interested in more general system design questions can look into other literature.

1.2 How this book is structured

There are several books covering system design, but literature on ML system design is scarce. We decided to contribute to this field and bridge the gap between supply and demand. Our goal is to share our knowledge and experience to help you convert the many things you know into a holistic system.

This book is structured as a comprehensive practical guideline on how to build complex, properly functioning ML systems in various domains, regardless of the size of the company you work for. This guideline includes

The overall landscape with an overview of general structural principles and all the components that make up such systems, as well as the pitfalls you may get trapped in
Low-level checklists of the tools that might come in handy at each step, with a brief reminder of why they are important

The book structure tends to resemble that of a checklist or manual, with an infusion of campfire stories from our own experience. It can be read at once or used at any moment while working on a specific aspect of an ML system.

Each chapter is a high-level checklist mandatory for every ML system. Note that while not all the items must be fulfilled, each of them must be remembered and considered.

In addition, each chapter answers the question regarding why and when the given item is important. It also includes a description of the landscape (what techniques and tools are suitable for the item). The description is systematized (not just a list of 100 buzzwords), although not necessarily exhaustive, as we believe that an experienced reader will be able to compare the example case with something from their background and draw their own conclusions. At the same time, we try not to slip into a typical textbook or course on classic ML or deep learning.

We come from quite different (and therefore, very intercomplementary) backgrounds: both of us have been involved in ML projects with over 20 years of combined “mileage” in a variety of roles, companies, and environments—from pre-seed startups to multibillion-dollar international corporations. Sometimes we worked long hours as individual contributors. Other times, our work primarily implied rapid team growth and coaching talented and aspiring young engineers. We have witnessed and have been part of successes and failures, big acquisitions, and massive job cuts. And, of course, we’ve discussed a lot of the successes and failures of ML projects with our friends.

But no matter how different our backgrounds are, there’s one thing we strongly agree on: ML projects almost never fail because their participants can’t use algorithms properly. There may be multiple reasons for a failure: a misdirected or completely unnecessary task, sloppy data handling, an unscalable solution with no growth potential—this list could go on and on.

There is a pattern so popular that we’ll have to repeat some stories in different parts of the book: a deep expert in a narrow area focuses a lot on their area of expertise—maybe picks some similar areas but still doesn’t get the big picture. As a result, some important nuances are missed, and it leads to project failure, missed deadlines, and violated budgets.

While books on ML usually provide the “right” answers, our main objective is quite the opposite. What we’d like to do is to teach you how to ask the right questions. These might be the questions you ask yourself, your teammates, users, stakeholders—you name it. Each one of us, as tech industry professionals, accumulates tons of valuable information but can’t always connect the dots. This is where timely questions help structure all the knowledge around us.

We’ve split the book into four main parts so that its structure is in line with the life cycle of any system—research, creation, improvement, and maintenance.

The first two parts are based on the early stages of machine learning system design. Throughout part 1, we’ll focus on the overall awareness and understanding of the problem your system needs to solve and define the steps needed before system development has started. This phase rarely involves writing code and mostly focuses on small prototypes or proofs of concepts. Part 2 delves into the technical details of the early-stage work. This stage requires a lot of reading and communicating, which is crucial for understanding a problem, defining a landscape for possible solutions, and aligning expectations with other project participants. If we compare an ML system to a human body, it’s about forming a skeleton.

The third part is focused on intermediate steps. In this stage of a system life cycle, the schedule of responsible engineers is usually flipped. There is much less research and communication and more hands-on work implementing and improving the system. Here we focus on questions such as how to make the system powerful in multiple dimensions: solid, accurate, and reliable. Continuing our human body metaphor, the system grows its muscles.

The final part is all about integration and growth. For an inexperienced observer, the system may seem ready to go, but this impression is tricky. There are multiple (engineering, mostly) aspects that need to be taken into account before the system goes live successfully. In the software world, a system failure is rarely a disaster like in civil engineering, but it’s still an unwanted scenario. So, at this stage, you will learn how to make your system reliable, maintainable, and future-proof. If you’re not tired of human body metaphors, this is where the system gets a mind and gains wisdom because untamed strength can lead to nothing but trouble.

Overall, the opening chapters will contain more general information, which is nonetheless crucial for framing the problem and sets the core fundamentals for building a well-functioning ML system. The further you go into the book, though, the more complex and in-depth the material becomes, providing you with practical examples and exercises. Starting from the very next chapter, we will introduce two fictional cases, radically different from one another, that we will carry through the whole book. Both will require an ML system to solve their problems, and both will evolve as you continue exploring.

In every part of the book, we always prefer intuition over comprehensiveness. There are many aspects to building ML systems, and each one deserves a book of its own. However, we don’t plan on writing a separate book on data gathering and preparation, another one on feature engineering, and another one on metrics. Instead, we describe the top of the iceberg and review the landscape structure while supporting our thoughts and points with links to noteworthy papers so readers can both familiarize themselves with top-level examples and add their own specific knowledge to the provided framework. We also do not aim to explain details related to particular libraries or engines. We will mention notable examples in certain chapters, but they are only for illustrative purposes in favor of higher-level abstractions.

Real systems are always more complicated than examples we see in blog posts, conference talks, and, of course, interviews. For all of these scenarios, people talk about high-level abstractions, but in reality, the devil is in the details. That’s why we believe getting some intuition on problem-solving is so important: a successful ML system designer should be able not only to recognize some recipe from the cookbook and reproduce it but also to adapt themselves for company-specific details that can flip the table sometimes.

We hope this book will be useful for

People preparing for an interview for an ML engineer/manager position
Software engineers, engineering managers, and ML practitioners working with an existing complex system who want to either understand or improve it
People who plan to design their own ML system or have designed one already and want to be sure they didn’t forget anything critical

Due to the philosophy described here, the book is not beginner-friendly. We expect our readers to be familiar with ML basics (e.g., you can understand an ML textbook for undergraduate students) and fluent in applied programming (e.g., you’ve faced some real programming challenges outside the studying sandbox). Otherwise, this book is better read after studying basic material.

1.3 When principles of ML system design can be helpful

As we said earlier, applying these principles is critical to build a system complex enough to have multiple failure modes. Ignoring them leads to high chances of delivering something with feet of clay—a system that may work right now but is not sustainable enough to survive a challenge from the dynamic environment of reality. The challenge can be purely technical (what if we face 10 times more data?), product-related (how do we adapt for changed user scenarios?), business-driven (what if the system is to be included into a third-party software stack after an acquisition?), legal (what if the government puts forward a new regulation on personal data management?), or anything else. Recent years have only proved we can’t foresee every possible risk.

Improving the system is even more important. As we’ll describe in more detail in the upcoming chapters, building systems from scratch is a relatively rare event. People outside the industry may think software engineers spend most of their time writing code, while in reality, as we all know, way more time is dedicated to reading code. The same goes for systems: much more effort is usually spent improving and maintaining existing systems (which requires an in-depth understanding of system internals), not building them from scratch.

The difference between improving and maintaining is somewhat blurry. For the sake of clarity here, we define improvements as adding new functionality or changing existing functionality significantly and maintenance as keeping existing functionality working in a constantly changing environment (new customers, new datasets, infrastructure evolution, etc.).

Some principles included in the book are mostly focused on ML system improvement. They help identify weak spots and growth points of a system and even new applications sometimes.

Finally, some principles are more oriented toward ML system maintenance. The sad truth is that very often systems are maintained by teams who didn’t participate in building them. So it’s a double-edged sword: the building team should keep some principles in mind to simplify the lives of their followers, and the maintenance team should understand the principles to be able to understand the whole system logic in a timely manner and find proper workarounds to keep the system alive over a long period of time.

It is safe to say that close to 100% of ML projects that didn’t have a well-written design document have failed, whereas a sweeping majority of those systems that were thoroughly planned found success. Although it should not necessarily be a complex, multipage piece of documentation, and it is often enough to have several pages of condensed information, the design document, in this case, plays two major roles. Not only does it set proper priorities within a project, but it also helps explain whether you actually need this project in the first place and drags your gaze away from the core idea (you might be too focused on the project itself) to see the whole picture. Please see chapter 4 for details.

After working for multiple businesses, we can firmly say that once there’s structured documentation describing all aspects of your system functionality, any activity, from onboarding newly hired employees to applying core changes, is implemented many times faster. Instead of searching for the one and only loremaster who keeps all the knowledge to themselves (but still won’t guarantee precision), you can address a certain document in the library.

Campfire story from Arseny

A long time ago, I worked for a ride-hailing company. One of its ambitious projects was to build a system for ride fare estimates. The regular pricing model was exactly like the one old-school cabs used for charging passengers: fare = X * time + Y * distance. The company needed to estimate the fare before the actual ride happened to inform both the driver and the passenger.

The project seemed clear and straightforward from the very start. All we needed to do was to fit a simple model that used geo features from the map service and wrap it as a microservice. It seemed so simple I didn’t even think about writing a design document.

How the system initially looked in Arseny’s imagination: a simple step-by-step algorithm

In reality, there were multiple pitfalls (we will cover most of them respectively in the following chapters):

Geo features weren’t enough for precise estimation, and more complicated features required advanced infrastructure (aka feature store, although back in the day, this was not a popular wording or pattern). This will be covered in chapter 11.
As the model became more sophisticated, its predictions became less reliable (a certain number of results would turn out to be outliers—values that were either too big or too small).
Errors were not uniformly distributed, so the model was biased. We touch on this topic in chapter 9.
The executives wanted to override fare estimations sometimes with some promo activities or heuristics-based shortcuts. This subject is discussed in chapter 13.
Too much time was spent building a model that didn’t really solve the exact problem. We touch on this topic in chapter 2.
The whole problem was prone to distribution drift and thus required smart monitoring. We cover more of the subject in chapter 14.
The infrastructure was not ready for the scenario, which led to unacceptable latency in peak hours. This topic is covered in chapter 15.

Some other teams were not aware that the system was being developed, and it led to API mismatches. This topic is discussed in chapter 16.

How the system looked after several iterations

The system was not deployed after all—before all the problems were fixed, the market situation changed significantly, and the need for the initial system faded. While the original idea for the system was great (some competitors used similar ideas), my colleagues and I failed to implement it in a proper way: some key aspects, both tech and product-related, were totally missed and were discovered only in the late stages of the project when the price of changes skyrocketed. At the same time, if some aspects had been taken into account at earlier stages, addressing them would have been trivial. If only I, my boss, or my teammates had read a book like this, we could have avoided this failure.

Still, for every few stories of failures, there’s a story of success. The next story has less drama and might seem boring, but it’s worth sharing for the sake of balance. Back in the day, Valeriy used to work at the Russian tech giant Yandex, when it acquired a startup providing real-time recommendations. When mergers like this happen, it takes time to fine-tune cooperation between the existing and new units, onboard new staff, sync business processes, etc. In this case, however, he was amazed at how smoothly and seamlessly a new business was integrated into a massive corporation. The reason behind it was a well-built design document that made this transition possible.

To summarize, we strongly believe that arranging a design document, preceded by asking your business the right guiding questions and setting up proper goals, is the key to success for your ML system—or a reason to cancel the project at the earliest stage, which is also a positive outcome, considering how much time, effort, and money you can save by dropping an unwanted activity. We’ll dedicate at least three chapters to this stage of the project, as this is the most crucial part you’ll have to deal with.

Summary

While it’s a relatively new term, ML system design is based on the classic fundamentals of software development, incorporating the existing knowledge from related disciplines. In this book, we will try to reorganize this knowledge base into a set of working algorithms.
Whereas MLOps can be considered a set of tools for building and maintaining your ML system, think of ML system design as a framework that will weld the whole system together.
To succeed in designing ML systems, it is crucial to be equally experienced in such disciplines as ML, software engineering, project management, product management, and leadership.
Before designing an ML system, you should know what you are building, what the purpose of the system is, and how it should be built.
The pillars of a successfully designed ML system are a consistent approach, a well-planned roadmap, and a list of preliminary actions that will organize your work and save time in the long term.