In chapter 2, we discovered that identifying a problem is the key element to developing a successful machine learning (ML) system. The better and more precisely you describe the problem, the higher the probability of building a product that will efficiently meet business goals.
Now we will delve into several key aspects that mark the next important stage of designing a comprehensive and efficient ML system—the solution space. This chapter will tell you more about finding solutions that helped solve similar problems in the past, the always tough choice between building our components and buying third-party products, a proper approach to decompositioning the problem, and picking the optimal degree of innovation, depending on the main objectives of our future system.
If I have seen further, it is by standing on the shoulders of Giants.—Isaac Newton
Imagine you work for a taxi service like Uber or Lyft, and there is a worked-out fraud pattern: a legitimate driver starts working for the company, but later they pass their account to a person who can’t be a driver (they even may have no active driver’s license at all). Your goal is to do personal reidentification by taking a driver’s photo in the document they uploaded when signing up, prompting the driver to take a selfie from their car, and verifying it’s the same person as displayed on the driver’s license. At the same time, there are very reasonable nonfunctional requirements: for the sake of privacy, you would prefer to avoid uploading a driver’s photo from their device to your servers. One more aspect is the verification should be fast enough and resistant to various adversarial attacks (fraudsters can be so tricky!).
Let’s summarize this case based on this information:
All of these problems are commonly solved in the industry, but they are rarely dealt with in a single solution. The following are some examples. Big surveillance systems (like those used for airport security) do a lot of face recognition, but they are rarely limited in computing power, and their inference does not have to be squeezed into a phone. Many consumer entertainment apps, on the other hand, run inference on mobile phones, and their developers are very proficient in running models with limited resources. Finally, liveness detection is usually applied to biometric systems used for authentication (FaceID on the iPhone is the most common example).
Nothing beats experience, so if you’re lucky enough to have successfully coped with all three problems, go right ahead. If not, we recommend you dedicate time to looking through use cases in various ML domains, because breadth of mind is your best friend here. You usually can’t work with tens of production ML systems during a single year of your career, but studying this number of use cases is achievable and can compensate for the lack of experience.
While designing a system, it is useful to recall similar systems and use them as a reference. You’re not obliged to copy certain patterns directly, but they can serve as an inspiration. We also advise that you not neglect failure stories, as they can become a hint of what to avoid in your case. This approach somewhat overlaps with the antigoals concept that we will touch on in chapter 4.
As often happens in the software world, there are at least two aspects of similarity: the domain aspect and the technical aspect (as shown in figure 3.1).
The former is about finding systems that are as close as possible in terms of a business problem; with the latter, we should recall systems with close technical requirements (e.g., platform, latency, data model, volume, etc.).
We also encourage you to ask yourself why certain decisions have been made in system designs and solutions you find relevant. Such exercises are valuable when developing and eventually applying your own intuition while designing a complicated ML system, including the ability to solve such dilemmas, such as whether to build from scratch or look for ready-to-go offers.
Imagine you work for Slack, a team messenger with support for audio and video conversations. It has a feature: close to real-time speech recognition of audio conversations.
But Slack was initially designed as a text-first messenger, and probably the share of users who utilize it for voice conversations is considerably smaller. Text captions are used even less often, as this feature is not enabled by default, and its application is somewhat limited. At the same time, the requirements for speech recognition accuracy are high: such a feature will be useless or even harmful if the quality does not meet expectations.
The need for noncore functionality with proper quality may encourage your feature team to lurk through the market in search of ready-made solutions. We can’t ignore Slack’s scale: before the peak of the pandemic in 2019, it had 12 million daily average users. The currently claimed numbers have declined to 10 million, but it is still an impressive number. It means using a third-party tech provided by a vendor may cost too much, and kicking off the development of an internal, ideally tailored solution will be the optimal scenario. Which way will you choose?
There is a big dilemma related to complicated tech systems, including ML systems; it’s often called “build or buy.” When the problem is familiar, there is a good chance of finding a vendor already selling a solution as a service. Let’s capture the main angles of how to look at this dilemma.
Is the problem related to the core part of the business? It is a common practice to focus on key competitive advantages and use third-party services for commodities like infrastructure. Fifteen years ago, most companies had dedicated system administrators who managed massive servers in data centers; these days, most companies rent virtual machines from a cloud server provider. That’s an example of using a third-party service for a critical piece of infrastructure, which is, however, not crucial for winning over the market. Although there is an exception for companies where server infrastructure matters a lot (e.g., high-frequency trading, adtech, or cloud gaming), this area is subject to significant investments in R&D.
Many companies use third-party services for ML-related problems like machine translation, speech recognition, antifraud, and many more. Validating drivers’ selfies with their license photos is a popular example of something to be delegated to a vendor.
Another aspect of the dilemma is economic. Say there is a vendor for this problem, and its service is good enough in terms of metrics, but the reasonable-price criteria are not met. Maybe your company is great at hiring talents in low-cost living areas (with a respective salary range), and thus building a system from scratch is cheaper compared to using a third-party solution. If a vendor provides reasonable pricing for a California-based VC-backed startup, it doesn’t mean the very same price is still reasonable for a company bootstrapped in Eastern Europe or Asia.
You can switch to an open source solution, but the choice between that and a purchased option may not be obvious. You can’t say the cost of an open source solution is zero, as its maintenance is often associated with hidden costs related to infrastructural work and potential problem-solving. On the other hand, using a purchased solution allows the delegation of many of these problems to the vendor, which means you will need a preliminary estimate of potential spending before sticking to a certain option.
There is also an aspect that is often not disclosed publicly but is still very relevant to this dilemma: careerism. Not every decision is made in the interests of the business, and the bigger the company, the more common the pattern. Consequently, some employees may be interested in pushing the idea of building, not buying, to deliver a big-impact project and thus justify their way to promotion or add a fancy achievement to their resume. Of course, we do not support this way of solving the build-or-buy dilemma, but since these cases are not a rare thing in the business, we can’t but mention them.
Overall, the build-or-buy dilemma boils down to several key factors that form the context in which you’re working. Buying a ready-to-go solution means saving time on development (which may be a factor if you’re a startup and release deadlines are tight and strict) and avoiding recruiting extra specialists who may be indispensable at the production stage but will be hard to find work for after the software is released. It also means that you get a tested, time-proven platform. However, you’re tied to a vendor’s schedule when it comes to patches or new releases. Building your own solution guarantees you’re in control of the feature set, scalability, and release calendar and can fix critical bugs on the go without depending on the vendor. But having more control comes with a higher price in other aspects: you will need in-house support, and you will definitely require a solid team of experienced developers.
We recommend going for “buy” if
Be sure to look for a stable platform with a proven reputation in the market. We recommend going for “build” if
There is also an extremely important budgetary component, which you can’t neglect, but at the same time, it cannot be attributed to any of the previous lists. That’s because the budget can affect your decision in either direction. You want to choose the buy option if developing your own solution may lead to overspending. On the other hand, the build option is your pick if none of the off-the-shelf solutions fit within your budget. Whatever your case is, budget is a crucial element that always needs to be considered.
Let’s get back to the opening example with Slack. One of the ways to resolve the dilemma would be to start out with a vendor, make sure the functionality is appreciated by customers, highlight main usage scenarios, and kickstart an internal solution based on the gathered information.
Reminder: we have no idea how this feature was actually implemented. That’s just how we would approach it.
The ratio of build versus buy decisions tends to shift over time. For example, at least 9 out of 10 natural language problems that would have required a very custom solution in the 2010s are solvable by a simple large language model (LLM) API call in the 2020s, making building such models from scratch far less attractive.
Another dilemma may arise on the lower level of consideration, and that is open source tech versus enterprise-grade proprietary paid tech. At some point, you need to decide what database is used for storage or what inference server is preferable. It’s important to have extensive knowledge of nonfunctional requirements (like required uptime, latency, load tolerance, etc.) to answer this question. For an initial approximation, the logic is as follows: when you’re sure there is no need for urgent help from experts, the safe choice would be to use an open source solution. An opposite case would be when building a high-load, mission-critical system; it often makes sense to stand on the shoulders of a giant, such as a specific vendor. There are mixed scenarios as well—it is possible to buy enterprise-level support for open source solutions, and sometimes it can be a proper middle way.
It’s worth noting that the principles listed here are not ML specific—in fact, almost the same reasoning is applicable when we’re designing “regular” ML-free software.
One of the most useful tools in a software engineer’s toolbox is a “divide and conquer” approach, which is very applicable for ML, both on low-level algorithm implementation and the high-level system design level. That’s the first thing you can apply when facing a complicated problem that seems unsolvable at its existing scale.
A canonical example of problem decomposition is a search engine design. A user can query any wild set of words, including those that were never queried before (around 15% of Google search queries are new), and get a relevant result in a few hundred milliseconds.
At a high level, a search engine effectively does one thing, which is to provide relevant results from a database quickly. Let’s focus on two properties here: relevance and quickness. Would it be easier to fetch a somewhat relevant result quickly? We think so: just drop the sophisticated ranking algorithms and replace them with a simple “a document contains some of the queried words” heuristic. Scanning the whole database with such predicates is very doable. Would it be easier to find relevant results from a small subset of documents—thousands, not billions? Of course, on a small scale, we can apply sophisticated ML algorithms and big, although slow at inference, models.
We bet you’ve already guessed what we are leading to—it’s time to combine those steps and make a two-stage system. The first stage is fast candidate filtering, and the second stage is a more sophisticated ranking across the identified candidates. Such an approach has been used in many search engines for decades.
This example can be developed further: instead of one iteration of candidate filtering, there may be a cascade of them. So, based on the query language and user location, documents in other languages can be filtered out even before the candidate filtering, reducing the number of documents that need to be processed downstream, as seen in figure 3.2.
Similar multistep pipelines are very popular in the computer vision domain: a deep learning model is applied first, with postprocessing responsible for the final answer. Another bucket of applications is related to texts and other semistructured data: one step extracts structured data, and these structs are processed downstream with more constrained models.
We know six reasons for decomposition:
We know this list may be incomplete, but these are the six most obvious reasons we’ve stumbled upon throughout our careers.
Sometimes pipelines are not designed in that sequential manner from the very beginning, and the idea of adding a step may appear later as part of further improvement. It may not be the best pattern, though: stacking up pieces one by one trying to cover the problems of a recently revealed previous step leads to a non-robust design, which is error-prone and hard to maintain because it doesn’t follow a single idea. On the other hand, it is absolutely acceptable to leave dummy stubs in the initial design and even first implementations (“later there will be model-based candidate fetching, but for now we use random samples as proof of concept”).
The design principles of ML systems are being influenced by recent trends in the field. In the past, it was common to build pipelines featuring many small, sequential components. However, with the rise of deep learning models, the trend shifted toward an end-to-end single-model approach. It could potentially capture more complex relationships in the data, as they are not limited by assumptions and limitations of manual design, require less domain knowledge, and reduce the accumulation of errors between steps.
Speech processing is a good example of how an end-to-end approach changed the design. Before end-to-end, text-to-speech (TTS) models typically included two main components: one processed text input and converted it into linguistic atoms such as phonemes, stress marks, and intonation patterns, and another synthesized human speech with a predefined set of rules or a statistical model to map the linguistic information to the sound waves.
End-to-end TTS models, on the other hand, do not rely on explicit linguistic information as an intermediate representation. Instead, they directly map text input to an audio waveform using a single neural network model.
While end-to-end models were successful, they were not capable of containing knowledge on their own and often required the use of databases for many applications.
Recently, LLMs such as GPT-4 have achieved impressive zero-shot performance, meaning they can answer questions directly without any additional input or training. However, these LLMs are computationally expensive and are prone to hallucination (i.e., presenting false information as true; see “Survey of Hallucination in Natural Language Generation,” , for wider context), and their knowledge is implicit and not directly accessible for modification.
There is ongoing research in finding ways to combine the benefits of LLMs with the ability to use maintainable external sources of information. For example, the Bing AI and ChatGPT plugins () use additional online sources in a way similar to how people use search engines, and Galactica () by Meta AI was among the first to introduce the concept of a working memory token, which allows the model to generate a snippet of Python code that can be executed by an interpreter to provide a precise answer. These ideas are developed even further in Toolformer (), a model specifically trained to use various third-party APIs. Similar ideas are reflected in the quickly growing open source framework LangChain (). While these approaches are not yet widely used in production systems, they have the potential to change the way ML systems are decomposed.
Depending on their complexity and degree of novelty, ML systems may imply various levels of innovation. Some competitive areas require huge investments in research; in other domains, you can use a very basic ML solution. Let's find out how to define the level of innovation you need for your system.
Ask any stakeholder of any ML system this straightforward question: how good (aka accurate) should the final product be? The most common answers are usually “perfect,” “100%,” and “as good as possible.” But let's try to figure out what lies behind these straightforward yet ambiguous answers.
The answer “as good as possible” implicitly means “as soon as we meet other constraints.” The most obvious constraints are time and budget. Would they want a perfect ML system in 10 years? Most likely not. Is the “acceptable good” system shipped by the end of next quarter better? Most likely, yes.
We will elaborate on the topic of precise understanding of the difference between “good enough” and “perfect” later, in chapter 5. But even in the earliest stage, when the design process has just started, the exact metric is not important yet. It’s a rough understanding that is critical.
With the experience we’ve gained creating, maintaining, and improving ML systems with multiple scales and objectives, we’ve identified three different buckets of required perfection that all systems can be distributed between. Terms may vary, but to our mind, these are the most fitting:
A minimum viable system can be a very spartan solution with duct tape as the key bonding element. Aligned expectations from such a system would be “it mostly works,” and an observer will be able to detect various failure modes. Such systems are considered baselines and prototypes; no innovation is expected.
Human-level performance adds a certain bar. Many successful existing ML systems don’t even match human-level performance yet are valuable for companies. Thus, we can say that reaching this kind of performance requires a fair amount of research and innovation.
Finally, there is the best-in-class bucket. Some systems are hardly useful when they don’t beat a significant share of competitors—this is often the case in super-competitive domains like trading or adtech or global products like search engines. A tiny shift in accuracy may cost millions in profits or losses, and in such cases, ML systems are designed with the idea of reaching the best result possible.
Why do we even talk about innovation here? The bridge between the problem space and the solution space strongly depends on the level of innovation we assume from the very beginning. With the “minimum viable system” bucket, we have exactly zero innovation—we just use the simplest and fastest solution we know and move forward. On the other side of the spectrum, we get endless innovation, where a system is never ready, and the team is always looking for new improvements to implement in the next release.
Distributing problems between these three buckets would be a very powerful technique, but there’s one important factor we can’t ignore: the level of required innovation is not static. In many cases—especially in startups—things are built as minimalistic as possible to be upgraded later. And it makes sense: the company first evaluates if the functionality is required by customers (or internal users) and then addresses customer feedback to improve the system. If a shipped feature is unique to the market, even its minimalistic implementation brings so much value that competitors immediately get on to improving their own products. It moves the initial system from the first bucket to the second bucket or even closer to the state-of-the-art league. Many startups face problems with such transition, and cases of designing a system that can evolve from prototype to a world-class gem (which is the art of engineering) are extremely rare. The lite version of such art is designing a system that can be rebuilt while keeping as many existing building blocks as possible, and that’s a fairly high bar to aim for.
Knowing the level of innovation you need and some high-level structure of the system, you can look for implementation ideas on a lower level. When this chapter was being prepared, there were five popular sources of information to dive into.
arXiv () is a website distributing academic papers mostly in science, technology, engineering, and mathematics disciplines. Math and computer science, including its subdisciplines, make up a solid share of over 2 million papers published there.
arXiv is a good place to get familiar with academic perspectives on your problem. Other than just reading everything related to your keywords, we encourage you to use the citations and links mechanism: once you find a relevant paper, it’s likely you may be interested in getting familiar with older papers it mentions and newer papers citing it. arXiv is an ecosystem of its own kind—there are browser extensions and additional websites that can assist your search. A good start is to look for overview papers (often containing “survey” in their titles): usually they feature properly distilled wisdom on the topic.
arXiv on its own may seem a little too raw as a source of knowledge: it’s barely possible to read all new papers, and its search mechanism is somewhat primitive from a modern perspective. There are multiple popular tools on top of arXiv that simplify exploration. Currently, we recommend , a modern search engine on top of paper abstracts, although it is very possible that there will be another fancy tool by the time the book is published (previously, the most popular add-on was ).
As it is easy to guess, Papers with Code () is a compilation of academic ML-related papers that are accompanied by implementations in the code form. Papers are grouped by topics and ranked by performance when possible.
You can find the closest problem from the academic world and see the top N papers solving this problem, their metrics, some meta information (e.g., does this approach require additional data?), and—what’s very important—links to public implementations. This website is a real game changer for those who prefer repositories to formal academic writing.
Once we’ve mentioned code implementations, we can’t avoid GitHub (). The most popular platform for open source software, GitHub has repositories for any occasion. The downside derives from its scale: if you’re there to find something uncommon, you are effectively looking for a needle in a haystack.
GitHub is not specialized to the ML domain, but at the same time, most open-source ML projects are located there.
The Hugging Face model hub () is a major platform sharing numerous models and datasets. At the time of writing, the hub contained more than 560,000 publicly available ML models. Categorization and tags work quite precisely, with a huge portion of the models offering small interactive web-based demos to display their capabilities.
Hugging Face as a company started with a focus on natural language processing (NLP), and the hub has been the main platform for sharing NLP-oriented models. We recommend going there for research-related models if an ML problem you’re solving includes text processing.
Kaggle () is the most popular platform for competitive ML. Organizations use the platform to host challenges and lure the world’s best ML practitioners to fight for monetary prizes and, of course, glory. During competitions, participants share their ideas and code snippets related to a given challenge. At the end of a competition, winners and leaders usually reveal their secrets. Along with competitions, Kaggle serves as a hosting site for multiple datasets, so there is a good chance of finding a public dataset related to your problem.
Kaggle is the most exceptional piece on this list for several reasons. If a competition is organized poorly, the problem is somewhat ill-posed: instead of solving the real problem, competitors may try to look for shortcuts like data leakage. Also, final solutions are usually not applicable in practice: the models are gargantuan because latency limits can be off the table. Finally, the code snippets are rarely clean: contestants aim for rapid iterations, not for long-time maintenance.
Yet with all the mentioned disadvantages, Kaggle forums can be a source of great overviews for your problem, including both academic papers and hacker-style code that may become academic mainstream later. It’s also worth mentioning that there are websites aggregating the best Kaggle solutions, such as .
We would like to highlight the fact that the current stage still doesn’t require choosing solutions based on this research. It should give you more details on the landscape to make your decision-making process more reliable.
Let’s reiterate the points mentioned in the previous section using a single detailed example. Imagine you’ve joined a stock photo company. The business is effectively a marketplace: photographers join the platform and upload their shots, and customers who are looking for specific images for illustrative purposes (editors, designers, ad professionals) purchase rights for these photos. The marketplace makes money through commissions from sales. The company is highly interested in creating an effective search system on its website.
From one perspective, the photo stock is huge, featuring millions of images. When customers look for photos, they are often interested in something specific, which is hard to find with simple categorization or other naive taxonomy. So you’ve been hired to build a modern search tool that will be able to find the most relevant shots upon text queries from customers. How should you understand the landscape for the problem?
The build-or-buy question arises first. Let’s assume you’re guessing that companies of scale like yours usually design their own solutions, but that’s not always the case. Some reconnaissance would be suitable. You can easily reveal the fact that many vendors—both huge enterprises and young startups—provide search engines as a service. When you try to prospect those, it’s very likely that many solutions can turn irrelevant—your company needs a search engine for images based on text queries, which is not the most popular paradigm. There may be several tech providers suggesting something relevant, though, so let’s keep them in mind.
First, let’s consider the similar problems other companies solve:
Search engines are one of the most popular applications within the information retrieval discipline. Its practitioners were among the early adopters of many ML methods but didn’t limit themselves to ML-only approaches. Familiarizing yourself with the discipline (or refreshing your memories) on a high level, starting from Wikipedia, can be suitable for those who don’t feel confident in the domain. After learning more about information retrieval, you can dive deeper by reading more about image retrieval.
While reading documents on building search engines, you definitely see a pattern for decomposition: as in many search engines, not every document should be ranked in your scenario. From the very start, a user can specify requirements: for example, the photo should be provided as a raw file (as opposed to compressed JPEGs), be at least 5,000 pixels wide, and cost not more than $50. Such conditions can narrow the search candidates from millions to tens of thousands very quickly, while we didn’t touch upon image and query semantics at all. This optimization would be very valuable and may become a cornerstone of your future design.
Another thing you could find is the fact that under the hood, most search engines effectively do one thing. They calculate a relevancy score for the pair of user queries and potentially related items (documents) and rank items based on this score (see figure 3.4):
relevancy = f(encode_text(query), encode_image(item))
In the case of our scenario, this brings up multiple open questions:
f be? Each of these questions is wide and deserves a separate book (or at least multiple chapters), so, to be succinct, we only suggest that those questions should be kept in mind when you dive deeper into sources of information we mentioned before, from arXiv to Kaggle.
The next question is the degree of innovation you’re looking for. There are several thoughts here:
The first point clearly shows that the very basic minimum viable product is not applicable here because you already have one. At the same time, limited budgets suggest you can’t aim for a state-of-the-art solution at first. Thus, you need to design a solid system with a limited budget and the option to improve it further.
In this chapter, we have covered the most important elements of your preparation for writing a design document. You now know how to decompose the problem, what external and internal factors will influence your approach to the build-or-buy dilemma, what online sources are most helpful, and how to decide on the degree of innovativeness your solution should carry.
All this knowledge will be the basis not only for writing a design document for your system but also for using it to understand whether you need an ML system in the first place. The last point might sound intriguing and even controversial, so we will try to elaborate on it (as well as many others) in the next chapter.