2 Is there a problem?

This chapter covers

Problem space and solution space: which comes first?
Defining a problem as the most important step
Defining risks and limitations
Costs of a mistake

To succeed in machine learning (ML) system design, you literally need to be an expert in multiple fields, including project management, ML and deep learning, leadership, product management, and software engineering. However, when stripped down to the bones, even the most complex and sophisticated solutions in ML system design will have the same framework and fundamentals as any other product.

The variety and amount of sheer knowledge gained in recent years gives you unprecedented freedom to choose exactly the approach you want toward your ML system, but no matter how refined the instruments of your choice are, they’re no more than implementation mediums.

What are the business goals? How big is the budget? How flexible are the deadlines? Will the potential output cover and exceed overall costs? These are among the crucial questions that you need to ask yourself before scoping your ML project.

But before you start addressing these questions, there is a paramount action that will lay the foundation for successful ML system design, and it’s finding and articulating the problem your solution will solve (or help solve). This is a seemingly trivial point, especially for skilled engineers, but based on our own experience in the area, skipping this step in your preliminary work is deceptively dangerous. It goes even further when we realize that some problems cannot be solved on a proper level, due to either the current state of available technologies or the aleatoric uncertainty of the ill-posed problem. While in the first case, the problem can be a candidate for the future solution (e.g., today’s level of text generation would seem totally unachievable for an ML engineer in the early 2010s), the second case means the problem should not be tackled at all (e.g., one cannot build an algorithm that can beat casino roulette).

In this chapter, we will cover the importance of knowing the problem before developing a solution; we will highlight risks and limitations you may face while defining a problem; and we will touch on what consequences can follow a mistakenly defined problem.

2.1 Problem space vs. solution space

I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.
—Abraham Maslow, American psychologist

Imagine a boss coming to an engineer with an exciting new idea for a mind-blowing feature (we’ve all been there). For the sake of illustration, let’s make the example more specific. Steve works as an ML engineer in a growing SaaS company. Steve’s boss, Linda, just got back from a meeting with Jack, VP of sales, on a problem his team has been dealing with—too many customer leads with too few managers to handle them. Jack wonders if the ML team could come up with an AI solution that would automatically rank customer leads from best to worst based on potential profit for the company. This would help the sales team pick potential cash cows first and handle remaining leads residually. On paper, the feature looks stunning. It seems like a no-brainer!

Steve, a young but meticulous specialist, immediately has numerous questions regarding this project. What’s the due date for delivery? How big is the dataset of existing leads to build an ML model around? What’s the maximum time allowed to score a lead? What accuracy do we expect? What information do we have about each lead? How fast should the system be? What exactly does a “promising lead” imply? Which sales system do we integrate our solution with? After some back-and-forth Q&A, Steve knows the following:

The dataset is currently fairly small (the company is a young startup).
Jack wants the tool to integrate with the existing customer relationship management (CRM) so that the company doesn’t spend money on new software and there’s no need to retrain the team.
Luckily, there are no hard limits on processing time, which means a reliable real-time API is not required.
The due date is the usual “the sooner, the better.”

Steve gets back to his desk and starts scoping the project. “Okay, this looks easy. We can frame it as a ranking or classification problem, craft some features, train a model, expose an API, integrate, and deploy—that should be it.” However, two things still bother him:

What’s the best fitting method to handle classification problems of this kind?
How should he integrate his Python code with the CRM used by Jack’s team?

Three hours later, his browser is full of tabs with a few shot classification techniques and documentation on CRM API. He wants to suggest a precise time estimate on project delivery to his colleagues, but he’ll have a hard time doing that because of one crucial mistake that may cost a lot at the early stage: while thinking and asking questions, he focused on the solution space, not the problem space.

To Steve’s understanding, the information he received was more than enough to come up with a suitable solution, while in reality, it was just the tip of the iceberg. The remaining context could only be discovered by asking numerous specifying questions of multiple people involved in the project.

What are the problem space and solution space? These are two exploration paradigms that cover different perspectives of a problem. While both are crucial, the former should always precede the latter (figure 2.1).

Figure 2.1 An experienced engineer always handles the problem space first with specifying questions.

The problem space is often defined with “what?” and “why?” questions, often even with chains of such questions. There is even a popular technique named “Five Whys” that recommends stacking your “why?” questions on top of each other to dig to the very origin of the problem you’re analyzing. Typical questions often look like this:

Why do we need to build the solution?
What problem does it solve?
Why does the problem occur?
What are the alternatives we know?
Why do we want to make it work with given limitations (metrics, latency, number of training samples)?

After exploration, you are expected to have an understanding of what you should build and why.

The “what?” part, in its turn, is about understanding the customer and functional attributes (figure 2.2)—for example, “A tool that annotates customer leads with a score showing how likely it is that the deal will happen; it should assign the scores before sales managers plan their work at a Monday weekly meeting.”

Figure 2.2 The questions you must ask before starting your project and the crucial difference between them

In some companies, asking these questions is a job done solely by product managers. However, it’s not very productive for engineers to exclude themselves from problem space analysis, as a proper understanding of problems affects the final result immensely.

The solution space is somewhat the opposite. It’s less about the problem and customer needs and more about the implementation. Here, we talk about frameworks and interfaces, discuss how things work under the hood, and consider technical risks. However, implementation should never be done before we reach a consistent understanding of a problem.

Reaching a solid understanding before thinking of a technical implementation allows you to consider various workarounds, some of which may significantly reduce the project scope. Maybe there is a third-party plugin for CRM that is designed to solve this problem. Maybe the cost of errors for the ML part of such a problem is not really that important despite Jack’s first answer (stakeholders often start with the statement they need accuracy close to 100%!). Maybe the data shows that 95% of empty leads can be filtered out with simple rule-based heuristics. All of these assumptions lie outside the story, but if proven, each of them is an essential part of the overall context. It is unveiling this context that will give you insight into the problem.

There are two reasons why we began the chapter with Steve’s story. First, it’s common and will most probably resonate with you in one way or another. Second, it is applicable for any scenario, be it building a new system, modifying an existing solution, or passing an interview in a tech company.

Third, and most important, the scale and effect of consequences that derive from this kind of approach can be damaging to varying degrees:

Steve will have to rewrite quite a big chunk of the end system.
Linda will end up using a partial solution and compensating for the missing portion.
The solution may be completely abandoned.

All these cases require understanding the problem first.

2.2 Finding the problem

Organizations which design systems (in the broad sense used here) are constrained to produce designs which are copies of the communication structures of these organizations.
—Melvin E. Conway

Some old-school enterprise companies still keep the culture that encourages ordinary engineers to focus on low-level implementation (just coding) and leave the design (including problem understanding and decomposition) to architects and system analysts. From our experience, due to increasing flexibility requirements, this culture is disappearing rapidly, giving way to more horizontal structures with more problem understanding delegated to individual contributors.

This means engineers don’t have to be solid experts in the domain (it can be too complicated for a person without a proper background). The reason is simple: it’s hard to learn the nuances of building a stock exchange or manufacturing quality control between meetings, code reviews, and training new state-of-the-art neural networks. But having a broad understanding is a must before starting an ML system design.

We encourage you to write down a problem statement using an inverted pyramid scheme with a high-level understanding in its basement and nuances at the top. It is a common and effective top-down approach that will help you gather as much general information as possible, determine what data is most valuable to your project, and then, using point-by-point leading questions, delve into the specifics of a problem (figure 2.3).

Figure 2.3 The inverted pyramid scheme is an approach we recommend for gathering data required for a successful project launch.

On the very top level, you can formulate the helicopter-view understanding of the problem. That’s the level understandable to any C-level office of the organization where people don’t care too much about ML algorithms or software architecture—for example:

There are fraudsters in our mobile app who try to attack our legit users.
Our pricing model demonstrated extremely low profit margins for some products while being absolutely uncompetitive in other categories.
Customers complain that our software requires a lot of manual tuning before bringing value; it should be automagic.
App users aren’t engaged enough.

Having such a statement at the start gives many opportunities for the next exploration steps. Just try to question every word in a given sentence to make sure you can explain it to a 10-year-old child. Who are fraudsters? How do they attack? What report gave the initial insight about excessive prices? What bothers our customers the most? Where is the most time wasted? How do we measure user engagement? How are recommendations related to this metric? Ask yourself or your colleagues questions until you’re ready to build the next, broader block of the pyramid that expands the initial one.

This next pyramid block requires more specific, well-thought-out questions. One of the successful techniques is looking for the origin of the previous-level answers. How do we decide this behavior was fraudulent? What kind of manual tuning do our customers have to perform? How are user engagement and recommendation engine performance currently correlated?

An even more powerful technique involves looking for inconsistencies in answers; people tend to group objects based on their similarity and distinguish objects based on their differences. There may be similar users; some are considered spammers and should be banned, while others are still legit, even if their behavior may overlap with that of a person outside of the problem domain. For an uninformed observer, the same added margin for similar goods may be acceptable or not, but what are the criteria? An engineer here doesn’t have to find all the splitting criteria in a problem statement (they’re not decision trees), but that’s a good field to catch crucial signals and generate insights. This can be summarized by the following statement: trying to understand what people want is important; trying to understand what they need is critical.

Be sure to involve all the interested parties in the process. It’s not only your boss or product manager who cares about the project; you’re likely to have multiple stakeholders (it is crucial to understand which of the stakeholders is responsible for budgets and will be a point of approval for a given component of the system). Often, it is recommended to chat with experts at different levels to envelop both strategic and tactical perspectives. A high-level executive knows much about the goal of a given initiative. On the other hand, individual contributors who currently handle the absence of a designed system know tricks and details that may substantially affect the design.

Once you feel confident enough to explain the problem in simple terms, it’s time to wrap it up. We recommend writing down your problem understanding. Usually, it’s several paragraphs of text, but this text will eventually become the cornerstone of your design document. Don’t polish it too much for now; it’s just your first (although very important) step.

The importance of this step may vary, depending on the organization or environment. Sometimes the problem is easy to understand and, usually, very hard to solve—this is a common case in established competitive markets. Another side of the spectrum is startups that disrupt existing markets; here, the initial understanding of the disruption is rarely correct. One of the authors worked at a company where up to 50% of his time on a project was spent on defining goals and relevant context. After the context was clear, the ML engineering part of the project was smooth and straightforward.

Campfire story from Valerii

Back in 2019, I worked in a tech company that decided to expand its eCommerce department and “go offline.” The new project named Super Bill was based on the following idea:

Customers could use the Super Bill app as a price comparison tool for grocery stores and could also upload their receipts to get cash back, as well as buy in-store using an app.
Brands could use the app to provide cash back for buying specific items, promoting their goods as substitution items or recommended items.

The problem was that a single item could have multiple names (as printed on a receipt) in various grocery chains. “Mars small,” “M. bar small,” “Mars bar small,” etc., which could all be different spellings of the same stock-keeping unit (SKU), were to be mapped to the same item: Mars Candy Bar.

The initial idea was to train a deep-structured semantic model and perform search engine matching, where the name from a receipt would play the role of Query, and SKU was the Document.

I didn’t like that solution, as it was too complex and bulky for this kind of problem. It required gathering data, labeling it, training the model, assessing its quality, etc. So I decided to think it over. If I need to label data, I will do it only once per name in the receipt per grocery. Hence, I need no ML model for an already labeled sample, and this can occur quite often. Given that the item frequency follows Zipf’s law, we needed to label the most popular name-SKU pairs, constituting a small fraction of all unique pairs but a larger fraction of all pairs to label (based on this empirical law, when a list of measured values is sorted in decreasing order, the value of the nth entry is approximately inversely proportional to n).

The remaining fraction can be labeled by employees with access to the whole SKU database, but we probably do not want to share this database with a crowdsourcing hand-labeling platform. At the maximum, we can provide a list of candidates from this database and check which one of them (if any) is a match.

So what can we do here? We can try to predict/extract the brand and predict the sample category in the receipts. Narrowing down the list of candidates is a relatively easy classification task or distance-based task, as we have a limited number of categories and brands and can use very simple techniques for postprocessing, such as Levenshtein distance. As soon as we have candidates, we can send them with a sample from the receipt to label through our crowdsourcing process. How often do we need to do that? The answer is only once for each SKU-grocery chain pair. This makes it a much easier and quicker solution compared to the initial idea. After all, we were not building search engines for billions of queries per day but rather a limited matching system.

We were able to put this solution into production in less than three weeks, which is astonishingly quick. All it took was understanding what problem we wanted to solve and gathering context.

P.S. Later, when we were preparing a similar system for another deep tech company with tens of millions of SKUs, we replaced the last layer of postprocessing with a deep semantic similarity model to produce a more intelligent system. That was a benefit of designing an easy-to-adjust decoupled system.

Once the problem statement is explicit enough, it’s time to think about what we, as ML engineers, can do with it.

2.2.1 How we can approximate a solution through an ML system

Inexperienced or just hasty engineers often first try to drag the problem directly into a Procrustean bed of well-known ML algorithm families like supervised or unsupervised learning or a classification or regression problem. We don’t think it’s the best way to start.

For an external observer, an ML model is like a magic oracle: a universal machine that can answer any properly formulated question. Your job as an ML engineer would be to approximate its behavior—build this oracle using ML algorithms—but before mimicking it, we need to find the right question and teach users to ask it. In less metaphoric words, here we reframe a business problem into a software/ML problem.

Some questions may seem very straightforward:

For the fraud problem, we want the oracle to label a user a fraudster as soon as possible—in a perfect world, even before they did anything. This sounds like a sort of classification.
For the pricing model, we’d like to understand how much a customer is ready to pay for their goods without dropping the service in favor of a competitor (if we aim for a short-term problem only) or without thoughts like “This shop became too greedy; I should avoid them in the future” (if we care about long-term perspectives of the brand). That’s definitely similar to regression examples from textbooks.
For the recommendation system, we’d ask what we can suggest to the customer so they are happy with the service. This very much resembles the ranking problem.

Even with the metaphor of a magical oracle, we often had to leave multiple remarks that affected this potential answer. We’ll pay attention to similar details and remarks here and there in the book, but the highlight here is the following: there may be no single simple answer for the problem, and your ML system design must be aware of it in advance.

In our pricing example, there may be a spectrum of goals, from maximizing profit right here right now to growing the company in the long run. A good ML system would be able to adapt to a specific point in this spectrum. In the following chapters, we will discuss the tech aspects of doing so.

Many ML practitioners, including the famous Andrew Ng, a renowned AI expert, professor at Stanford University, and founder of Landing AI, suggest using a heuristic of a human expert: let’s build a system that answers in the same manner as the expert in the area would. It works for many domains (health care is a great example) and sets the bar of an early understanding of how solvable problems are with AI approaches. Unfortunately, it comes with disadvantages as well: there are problems where machines perform better than people. Such problems usually happen in domains when data is represented as a log of events (often a human behavior), not something carefully labeled. It’s easy to find such cases in the ad tech and finance industries. So human-level performance may be a fair bar to reach, but it’s not always the case.

And only after the question is clear does it make sense to dig into the way of algorithm approximation and draft a model capable of doing it. It doesn’t have to be a single model: a pipeline of various models or algorithms is often a legit tradeoff. We will cover problem decompositioning as part of a preliminary search covered in the next chapter.

2.3 Risks, limitations, and possible consequences

Imagine you’ve built a fraud detection system: it scores user activity and prevents malicious events by suspending risky accounts. It’s a precious thing—zero fraudsters have come through since its launch, and the customer success team is happy. But recently, the marketing team launched a big ad campaign, and your perfect fraud detector banned a fair share of new users based on their traffic source (it’s unknown and therefore somewhat suspicious, according to your algorithms). Negative effects on marketing could have been way more significant than the efficiency in detecting fraud activity.

You may find this example obvious and not worth attention. However, the reality is ruthless: situations like this often happen in companies where teams are misaligned, and that was one of the risks you should have kept in mind while designing the system. You shouldn’t think, “Our team is professional; a failure like that just can’t happen here.” So explicit thinking about risks is the way to go, as there’s a high chance of potential risks spreading beyond the project team or a single department.

With great power comes great responsibility—this popular proverb is very applicable to ML software. ML is no doubt powerful. But besides the power, it has one more important and dangerous attribute, which is opaqueness for most observers, especially when the model under the hood is complicated. Thus, professional system designers should be aware of potential risks and existing limitations.

Software development classics suggest the idea of functional and nonfunctional requirements. In short, functional requirements are about the functionality of a new feature or system, its value, and its user flow, while nonfunctional requirements are about aspects like performance, security, portability, and so on. In other words, functional requirements determine what we should design, and nonfunctional requirements shape the understanding of how it should work under the hood. So when we talk about potential risks and limitations, we effectively gather nonfunctional requirements.

The cornerstone of any defensive strategy is a risk model. Simply put, it’s an answer to the “What are we protecting from?” question. What are the worst scenarios possible, and what should we avoid? Answers like “incorrect model prediction” are not informative at all. A detailed understanding aligned with all possible stakeholders is absolutely required.

Campfire story from Valerii

Once I was building a dynamic pricing algorithm for another big tech company. It was a neat system able to optimize revenue, margin, or traffic, with constraints for the latter two. It could work on the user level, but as soon as you had user-level atomicity, it was possible to aggregate to any level you wanted. It could adapt to changes in user behavior very quickly and output prices for users in real time. It had a nice balance between exploration and exploitation, was able to take uncertainty into consideration, and was quick for training and inference. I even had a desire to write an article about it and came up with a name for it: “Double Bayesian Universal Contextual Bandits for Dynamic Pricing.” The figure outlines the algorithmic design of the system; isn’t that nice?

Initial design for the dynamic pricing system

However, possible risks, existing limitations, and undesired consequences have drastically changed the final design.

Risks: it turned out that you can’t discriminate against people with different prices for the same item. You can discriminate between locations, of course, but if it is the same city and the same item and you sell online, the price must be the same. Technically, we still can discriminate, but that bears the risk of customers suing the company and, consequently, the company losing tons of money and reputation. Fortunately, we were more or less prepared for that, as we could aggregate on any level we wanted. Do you recall user-level atomicity?

Limitations: the second strike came from the backend side. It turned out we could change the price only as often as every 6 hours (at best!), and thus our ability to change the price in real time didn’t matter that much. This was the final blow, forcing me to radically simplify the system, still with some ability to adapt, which will be covered in the next story.

It’s easy to see that limitations and risks shaped the final design, which I didn’t mind, as I enjoyed the process of creating it and was ready to change it (see the next figure). But what if that was my opus magnum? How would I feel after that? The moral is simple—you need to find out any possible risks and limitations as soon as possible; otherwise, you can be forced to discard all your hard work.

Final design for the dynamic pricing system

Understanding the risks and limitations will affect many future decisions, and we will cover this later in chapters dedicated to datasets, metrics, reporting, and fallback. Before we do, though, we’d like to give a couple of examples displaying how considering (or ignoring) valuable data can affect your goal setting.

2.4 Costs of a mistake

When talking about the costs of a mistake, we’d like to quote Steve McConnell, who precisely defines the difference between robustness and correctness in his book Code Complete (2nd ed., Microsoft Press, 2004) using examples of building an X-ray machine and a video game:

As the video game and X-ray examples show us, the style of error processing that is most appropriate depends on the kind of software the error occurs in. These examples also illustrate that error processing generally favors more correctness or more robustness. Developers tend to use these terms informally, but, strictly speaking, these terms are at opposite ends of the scale from each other. Correctness means never returning an inaccurate result; returning no result is better than returning an inaccurate result. Robustness means always trying to do something that will allow the software to keep operating, even if that leads to results that are inaccurate sometimes.

…

Safety-critical applications tend to favor correctness over robustness. It is better to return no result than to return a wrong result. The radiation machine is a good example of this principle. Consumer applications tend to favor robustness to correctness. Any result whatsoever is usually better than the software shutting down. The word processor I’m using occasionally displays a fraction of a line of text at the bottom of the screen. If it detects that condition, do I want the word processor to shut down?

This concept is even more applicable to ML systems, as they tend to be obscure for both developers and end users. A set of if and while statements is easier to keep in mind compared to enormous sequences of matrix multiplication in modern deep neural networks.

Imagine you’re building an entertainment app like an AR mask for Snap or TikTok. In the worst case, the added effect will look ugly for a frame—not a big risk, so robustness is a proper approach here. The opposite case is an ML solution for medical or transport needs. Would you prefer a self-driving car that just moves forward when it’s not sure if there is a pedestrian nearby? Definitely not: that’s why you want to opt for correctness here.

We’ll talk more about this tradeoff and practical aspects of it in the third part of the book. At this point, we should only mention that understanding the costs of mistakes is one of the critical points in gathering predesign information. This is effectively a quantitative development of the risks concept: for risks, we define what can go wrong and what we want to avoid and later try to assign numerical attributes. A numerical aspect may vary greatly depending on a problem and doesn’t have to be precise at this point, but it’s essential for shaping the landscape.

From our experience, people often tend to think more about positive scenarios, while in reality, negative outcomes require more attention. The logic is simple: usually any system has one (or a few) positive scenarios, and many failure modes are considered negative scenarios. Of course, the probability of each failure mode is usually way less compared to the probability of a good outcome, but it’s not always the case if we measure expected values. Imagine a trading system that makes a few cents in 99% of deals and loses the whole capital with the probability of 0.1% or, to be more dramatic, a medical diagnostic system that saves 3 minutes per patient for highly paid doctors but misses a serious but curable disease for every 1,000th patient.

Some mistakes, though, can be harmless or even positive. Back in 2018, Arseny worked in a company making an AR application—a virtual try-on for footwear. The app allowed the user to see how a pair of shoes looked on their feet before purchasing it. One of the first versions of the app contained an underfitted model responsible for foot detection and tracking. As a result, shoes were often rendered not only on human feet but also on top of pet paws and even toys. Many of the early users found it hilarious, so the cost of such a mistake was not significant. But, as the time went on, the effect disappeared after the model performance was improved for more conventional-user scenarios.

While estimating the cost of a mistake, you should also remember there may be second-order consequences. For example, your antifraud system might ban too many legitimate users today, and tomorrow they may spread this by word of mouth about your app (“Never use it; they banned me for nothing”), which may bury your growth potential. Your recommendation system provides unrelated suggestions, and later you end up training a new model based on logs of rare clicks on such a poor recommendation, thus falling into a negative feedback loop.

Another classic example of the cost of a mistake is credit risk scoring, a common task that can be found in almost any bank. Before being accepted or rejected, a borrower’s application is usually processed by an ML-based system to output the risk score. This risk score can be either 1/0 (with a specified threshold) or vary and be continuous between 0 and 1.

Obviously, the cost of giving a loan to a potentially defaulting client and providing no loan of the same amount to a customer who would repay it successfully is not the same. How many people does the system need to repay the loan to the bank to outweigh one person who would go bankrupt? Shall we count people/credits given or the amount of money lent? Do we expect this ratio to be constant over time? Answering all of these questions and taking this information into account greatly increases the chance of a project being considered successful.

What does it mean for a person designing an ML system? Identifying the risk landscape helps us understand what kind of problems are to be avoided. Some errors are almost harmless, some can greatly affect business, and some can be life-threatening. A proper understanding of the costs of a mistake with regard to the system being designed is critical for the next steps, as it shapes requirements for reliability and data gathering, suggests better metrics, and may affect other aspects of design.

Summary

The problem space always comes before the solution space. Doing otherwise will most probably cause backlash in the later stages of your project.
When gathering background info from stakeholders and involved employees, start gathering wide context with the possibility of diving deeper when needed.
When picking from a multitude of potential ML solutions, study their limitations and consider the risks these limitations may cause.
Always evaluate the potential costs of a mistake. If there is one, examine the potential side effects it may cause: some may even lead to positive outcomes.