Книга: Machine Learning System Design
Назад: 7 Validation schemas
Дальше: Part 3 Intermediate steps

8 Baseline solution

This chapter covers

  • What is the baseline?
  • Constant baselines
  • Model baselines and feature baselines
  • A variety of deep learning baselines
  • Baseline comparison
Everything should be made as simple as possible, but not simpler.
— Albert Einstein

When we start to think about the building blocks of our future machine learning (ML) system, the essential part of it, or core component of it, seems to be a model built using ML techniques. In some sense, this is so true that we may even think that “this is it: this is the primary point where I should spend most of my time, energy, and creative power.”

But in reality, this may turn out to be a trap that the majority of ML projects fall into and get bogged down in without ever reaching production. An ML model is not necessarily the most important thing in the context of an ML system and its design document. Although the temptation is great, you should always keep in mind that it is extremely easy to spend a lot of time, team effort, and, more importantly, money on building a cool, modern, and sophisticated AI model that doesn’t ever bring any value to users and your company. A mediocre model in production is usually better than a great model on paper.

One of the first versions of this book’s title was Machine Learning System Design That Works, and it corresponds to the primary goal of any ML project, which is to build a system that will work; only then, when it brings profit, will we start to iteratively improve it while gradually increasing its complexity (if needed). In this chapter, we will discuss the baseline solution, the first step in bringing our system to life. We will cover why baselines are needed, as well as the purpose of building them. We will go all the way from constant baselines to sophisticated specialized models and also through various feature baselines.

8.1 Baseline: What are you?

A baseline is the simplest possible (but working!) version of a model, feature set, or anything else in your system. It’s the minimum viable product (MVP) in the world of ML systems that brings value from the start without yet diving into complexity. Let’s elaborate a bit more on the MVP analogy by outlining key goals that may equally apply to both:

These three points form the grand basis of similarities between a baseline and an MVP. However, there are three more purely baseline-specific goals:

So what are the advantages of a well-chosen baseline? Simplicity automatically brings a lot of pros with it: it is robust, not prone to unexpected behavior and overfitting (due to fewer degrees of freedom), easy to build and maintain, and not too pressing on computing resources. Consequently, baselines are easy to scale. As an additional bonus, from non-ML colleagues’ perspective, simple models are easier to interpret and make it less difficult to understand what is going on under the hood. It can help increase trust in our ML product, which can be critical when the stakes are high. However, simplicity is not a goal by itself but a valuable property.

If we think of our ML system as a Lego model, a baseline is an opportunity to assemble all the other blocks as fast as possible. Still, we encourage you to make your system as modular (i.e., “orthogonal”) as possible by design. This will make later updates easier, including the transition to more complex models and features (initial design doesn’t dictate how fast you can update a system in the future). The initial system should be simple and agile, not trivial or restricted, with a baseline.

Still, despite all the advantages baselines provide without requiring a lot, they are not used as often as they deserve. The bitter truth is that, unfortunately, complexity sells better. There is a brilliant article from Eugene Yan that we strongly recommend reading. It’s called “Simplicity Is an Advantage But Sadly Complexity Sells Better” (), highlighting the main reasons why many choose complexity over simplicity, which are:

This leads to complexity bias, where we give undue credit to and favor complex ideas and systems over simpler solutions.

Of course, baselines are not a silver bullet, and there are reasonable cases when baselines are not necessary or are even irrelevant:

Still, we believe that even if complexity at an early stage can be justified in certain cases, it can’t be the go-to solution by default because it incentivizes people to make things unnecessarily complicated; it encourages the “not invented here” mindset, where people prefer to build from scratch and avoid reusing existing components even though it saves time and effort, and it wastes time and resources while often leading to poorer outcomes.

That is why we believe a baseline solution is the first thing to do, with incremental improvements where and when needed.

8.2 Constant baselines

A good metaphor for a baseline is building a bridge: sometimes you don’t need a team of bridge construction engineers, huge budgets, plans, or years to build it. Sometimes you just need a stably fixed log. A baseline is the very log that allows you to connect components and solve a given task at a minimal scale—in the case of a baseline, a temporary, primitive, easy-to-build solution (see figure 8.1).

The idea we want to convey before going into detail is simple: build a lean, operable ML system first, and improve it later. Think of the complexity of possible solutions as a continuum. Choose an appropriate initial point in this range based on the effort–accuracy tradeoff, and move ahead. Don’t spend too much time on modeling unless it’s necessary.

figure
Figure 8.1 Before building a complicated model, start with a primitive baseline, which may well be the most appropriate foundation for your future ML system.

Keeping in mind that analogy, let’s start the discussion with the most spartan solutions that look like a log bridge. When we initiate a search for a suitable baseline, we often ask ourselves, “What is the most straightforward ML model that could solve the problem?” or “What is the right ML model to start from?” but frequently these questions turn out to be the wrong ones to ask. We believe the right one could sound like, “Do we need ML at all to solve this problem?”

Sometimes we either don’t even need ML for the problem or at least should not reinvent the wheel on our own and can instead use a third-party vendor. We already discussed this alternative in chapter 3 (section 3.2).

But let’s say we decided to build our own model. Good modeling starts with no model at all: with trying to hack a defined metric by picking the most trivial and lazy solution from the solution space. It will be the very first approximation of our problem. You can argue that a constant baseline represents a model by itself. With a constant baseline, we approximate all dependencies and interactions by a constant.

To immediately give an idea of what we are talking about, here are a couple of examples that you already know:

In a way, a constant baseline is like the first term in the Taylor Series or the mean predictor as the first base estimator in gradient boosting (see figure 8.2). Neither even depends on variable x; they already do (although roughly) something related to our problem—no more, no less.

figure
Figure 8.2 A constant baseline is like the first term of the Taylor Series—the simplest approximation that sets the foundation for more complex models.

8.2.1 Why do we need constant baselines?

There are two goals for building such a baseline.

The first one is benchmarking. It is helpful to get a baseline value of a selected metric for a random prediction. A simple sanity check compares your model against simple rules of thumb. Indeed, it would be sad to do 2 weeks of hardcore ML modeling and then finally implement the most straightforward possible baseline in 5 minutes that beats your model. It sounds ridiculous, but the situation is quite common in real life.

There’s a cool story from Valerii about this case. He was lucky enough to work with an engineer who is a wonderful person and a great specialist. Once, she won an ML competition with the goal of predicting some factory time series with just a constant baseline—or, as she likes to correct him, a stepwise constant. Conducting ML competitions is usually a very straightforward process. Participants have a labeled dataset and an unlabeled dataset. Their goal is to build a model using the labeled dataset that will output predictions to the unlabeled dataset that are the closest to the actual ones (available to the organizers only). Now imagine the frustration of other participants who have been engineering dozens of features for months and tuning parameters of their gradient-boosting models.

This case inspired us to look for and start with the simplest models, and this is something we’d like to encourage everyone to do. Don’t limit yourself to them, though. It will give you a much more adequate understanding of your metric and target values from the beginning so you can get a vision of what can be done with given data and what cannot.

The second goal of constant baselines is to provide a bulletproof fallback. If your real ML model could not make a prediction during runtime, due to some raised error, because of running into response-time constraints, or because there is no history for calculating features (aka the cold start problem with new users and new items)—or it simply goes crazy (which sometimes happens)—your ML service should return at least something. So, in this case, a constant baseline is all you need.

At the same time, we can easily imagine situations where a constant baseline is too primitive and brings no value at all. So the simplest usable baseline should be more complicated, represented as a set of heuristics/regular expressions or as a shallow model. Constant baselines tend to work fine for simple regression/classification problems on tabular or small text data; however, it is impossible to build a chatbot or voice recognition system with a constant baseline.

8.3 Model baselines and feature baselines

If we move further along the complexity scale, our next stop is rule-based models, although it is not always possible to draw a clear line between constant baselines and rule-based ones, as in most cases we can define the last one as a constant on top of some grouping. But there is also another well-known and illustrative example of rule-based baselines: to start solving natural language processing problems with just regular expressions.

A couple of years ago, Arseny worked for a taxi aggregator company, where he was involved in developing a service that would predict the time it would take for the nearest car to get to a client. The problem was apparent: if we overpredicted, the client could decide not to wait and would look for another service; if we underpredicted, the client would wait longer than we promised, which would mean we disappointed them.

Arseny’s colleague, who was a senior engineer at the time, treated it like a standard regression task and started with models like “always predict 5 minutes” or “if borough == ‘manhattan’: return 4.” Long story short: these types of baselines were hard to beat with hardcore ML alchemy, and ironically, the latter was even in production for a while as a fallback.

The “if district == X: return Y” model is an excellent example of a rule-based baseline. We can generate similar models by taking the mean/median/mode of some category or several categories—in our example, the median by location.

The further we go down the progression, the more complex our model gets and the more connections between objects’ properties and the labels it can find.

A typical sequence of baselines in an ML problem would begin with the following: constant baseline, rule-based baseline, and linear model (see figure 8.3). We need something more sophisticated and specialized only if these baselines are insufficient for our task.

figure
Figure 8.3 A typical sequence of baselines at the early stage of designing your model

For example, when building a recommender system, we start with some constant retrieval, then try collaborative filtering (e.g., alternate least squares), factorization machines, and, finally, deep learning (e.g., deep structured semantic model) if needed.

Whatever problem you face, be aware of simple approaches in this field. They don’t necessarily perform worse than more sophisticated ones.

A vivid example of that can be found in the paper called “Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches” by Maurizio Ferrari Dacrema et al. () that took RecSys’ Best Paper Award in 2019 (). The paper is notable for unveiling strikingly impressive stats. The research group examined 18 algorithms that had been presented at top-level research conferences in the preceding years. After studying these algorithms, only seven of them could be reproduced with reasonable effort. Things got even more interesting when it turned out that six out of seven algorithms could often be outperformed using relatively simple heuristic methods (e.g., those based on nearest-neighbor or graph-based techniques). The only remaining algorithm clearly outclassed baselines; however, it could not consistently outperform a well-tuned nonneural linear ranking method.

A nonexhaustive list of examples from the already-mentioned Eugene Yan article includes the following:

Up to now, we have been talking about baselines focusing on models. But what about features? Features are effectively a part of the model and sometimes are even the most important part of it. In classic ML, we have to engineer features manually, and choosing features for a baseline is based on the same principles; we start with a small group of essential features (most likely, those that are easier to calculate). There are two ways of adding new features:

The sequence of baseline features we need to try should look like this: the original minimum set of features, all sorts of interactions and counters, then embeddings, and then something more complicated (see figure 8.4).

figure
Figure 8.4 A sequence of baseline features you need to try: from the simplest to the more complicated

What are the properties of a good bunch of features to start from? The answer is exactly the same as for the models, and we’ll discuss it further.

For a typical problem usually solved with deep learning methods, there can be a simple baseline built with shallow models. As we recall, deep learning is a part of representation learning, which means that instead of handcrafting features, we delegate this work to a neural network. However, for some problems like image or text classification, you can apply naive approaches (rule-based or linear model-based). For example, naive Bayes was a very strong baseline in a natural language processing world before BERT-like architectures emerged. For computer vision, some problems can be solved by using a histogram of pixel color (or even just mean/median value!) as features for a linear model. Having said that, for most scenarios, starting with a simple deep learning model—either foundational in a few-shot setup or a trained one—can be a better choice because these models have already proven themselves in a wide variety of tasks.

Arseny once designed a take-home exercise for candidates, where they were provided with a script solving anomaly detection problems on a simple image dataset. The script contained two baselines—one with a neural network and one with a color histogram—and candidates were instructed to improve either of them to beat some metric. Both baselines already performed on a similar level, and both were implemented specifically poorly, so the candidates had room for improvement. The majority of candidates preferred working on a more complicated deep learning solution, and only the most experienced of them noticed that it was possible to reach the required result with a single line of code changed for a histogram-based baseline.

8.4 Variety of deep learning baselines

When a problem is not trivial and suggests the use of deep learning because of the data structure (which can be applicable to most computer vision or language processing problems), the variety of baselines is slightly different. The most common are reusing pretrained models and training/fine-tuning the simplest models.

Reusing pretrained models is a common practice if a problem is not unique and there is a model that has been trained on a similar task. For example, if we want to train a model that can recognize breeds of pets, we can reuse a model that was trained on an ImageNet dataset. ImageNet is a dataset that contains images of 1,000 classes, and more than 100 of them are dog breeds. So, once your goal is to recognize cats and dogs, you can reuse a model that was trained on ImageNet dataset without retraining it. This is a common practice for many generic problems like speech recognition, object detection, text classification, sentiment analysis, end so on.

A slightly more advanced version of this approach is reusing features (also known as embeddings or representations) from a pretrained model to train a simple shallow model. For example, you can take a pretrained model that was trained on an ImageNet dataset and use its representations from the last backbone layer (before the final classification layer) to train a simple linear model that will classify images into a custom set of classes. This approach is especially useful when the dataset is small and the final task is more or less trivial (e.g., classification), so training large models from scratch is not likely to work. This approach is also known as a specific case of transfer learning. We have seen cases where such a baseline was literally unbeatable, and no fancy models were able to outperform it.

An even more specific version of using pretraining models is using pretrained models or third-party APIs capable of zero-shot or few-shot performance (meaning they require no or a few training samples to provide a result). A glorified example of such an API is the GPT family, but there are many APIs available for different tasks—for example, all major cloud vendors have a long list of AI solutions, such as Amazon Rekognition or Google Cloud Vision AI in the computer vision niche; detailed information about them is out of the scope of this book.

Using a third-party API by a major vendor as a baseline has a nice side effect: it is a bargaining chip in negotiations, such as selling a software product to a big enterprise or proving a startup tech is solid to potential investors. Potential customers may not know what a good metric is for a given problem (recall chapter 5), but comparing your tech to an AWS solution frames the problem in a proper way. Arseny knows at least three startups that have used this approach, bragging about how they beat alternatives such as Amazon, Google, and OpenAI. In all three cases, the companies’ claims were legit, and that’s expected, as major vendors aim to tailor one-size-fits-all solutions, while startups can offer more niche ML systems doing one thing just great.

Finally, if none of these options work, you can try to train a simple model from scratch. However, it is recommended to avoid recent state-of-the-art models and use something simpler and time proven. Recent models are often more “capricious” during training, while some older “stars” are already more researched, and recipes for stable training are well-known. A popular example of such models is the ResNet family for vision and the BERT family for natural language processing. Our personal heuristic is to start with models that are at least 2 years old, but it is not a strict rule. It depends on the level of innovation required for the system, as discussed in chapter 3.

It’s worth noting that there are multiple shades of fine-tuning between “training a shallow model on top of a pretrained model” and “training a model from scratch.” Choosing the right degree of fine-tuning is very case-specific and may require some experiments. For example, when training a text classification model based on BERT, you can gradually complicate the scope of training:

This variety leads to the question, “How do we choose a proper baseline?”

8.5 Baseline comparison

Let’s examined various features and model baselines, starting with the trivial ones. We can answer the central question, which is how to decide when to stop adding complexity and how to determine a suitable baseline for our system. There are multiple factors we should consider simultaneously; some of them are

The most fundamental is the tradeoff between a model’s accuracy (or other ML metric) and the effort it requires. The first component in the equation is accuracy. When you move from a constant baseline to a rule-based baseline, from a rule-based one to a linear model, or from original features to their aggregations and ratios, you already start to get a feeling for what increase in metrics these small changes give. Is it responsive or not? How difficult is it to significantly surpass your constant baseline? Is it reasonable to invest more time attempting to gain more accuracy?

In some sense, as an ML engineer, you do backpropagation by getting “feedback” from your training loop and updating your understanding of the problem with its data and accuracy distribution across the solution space.

The second component is effort. By “effort,” we mostly mean time and computing resources. No ML project has an infinite budget and, hence, an infinite amount of time. We consider the time required to implement a new model (or feature), train it, debug it, and test it. You should also pay attention to all the attendant complications and pitfalls that may arise on the way, especially infrastructural ones.

Let’s examine a constant baseline. It takes almost no time to implement, and it provides us with the lowest accuracy. So we will map it into the (0, 0)-point in time-accuracy coordinates (as shown in figure 8.5).

figure
Figure 8.5 Simple baselines are easy to build but sacrifice final system metrics (example for a time-series prediction).

Let’s take a look at the linear model. It requires more effort but also most likely provides us with better accuracy. We will probably find the corresponding point to the right and higher than the previous one, and so on. On the other hand, it is important to understand that as the model improves and evolves (and therefore gains in complexity), the cost–accuracy ratio begins to decrease. A striking example of such a drop in efficiency is gradient boosting, which we mentioned earlier. Based on our estimates and experience, gradient boosting requires more input than all the simpler models you would use at the earlier stages put together while giving no significant increase in accuracy.

We should estimate how long it would take to try a more complex model each time and how much additional accuracy it could provide. Once we understand that the next step requires too much effort for almost no significant score improvement, we should stop. This “early stopping threshold” differs depending on the concrete domain and problem.

But what if something goes wrong or some additional change in the model is required? What model would be easier to debug or update?

Which one would you prefer to face?

Maintenance, which we will touch upon in more detail in chapter 16, includes the amount of additional work that is necessary for debugging implemented features or a model. We could count maintenance as a part of the extra effort the more complex baseline requires.

Another essential property of a baseline is computation time. How does the computation time of our model and its features affect the response time? Does our baseline meet the service-level agreement? Is it the natural limit for solution space, especially when dealing with real-time systems? But even with no real-time requirements, computation time also determines how fast we will iterate during more thorough experimentation in the future.

Finally, we have interpretability. This parameter matters when we deal with the very first iteration of any ML system, especially for other teammates. When we deal, for example, with sensitive or medical data, it becomes a safety problem, too, not just a question of trust in the predictions of our model only. The general pattern is trivial: the simpler the baseline, the easier it is to explain how it works.

We’ll discuss this topic in detail later in chapter 11.

8.6 Design document: Baselines

As long as baselines can be part of your design document, we are going to fill this gap for our fictional companies, Supermegaretail and PhotoStock, Inc.

8.6.1 Baselines for Supermegaretail

Let’s start with the forecast system. Here, seasonality will be a huge factor when choosing a prediction model, so we can’t but consider it in our design document.

8.6.2 Baselines for PhotoStock Inc.

Now we switch to the PhotoStock Inc. case, where we are building an advanced search engine set to provide better, more accurate results and eventually increase sales.

Summary

Назад: 7 Validation schemas
Дальше: Part 3 Intermediate steps