8 Baseline solution

This chapter covers

What is the baseline?
Constant baselines
Model baselines and feature baselines
A variety of deep learning baselines
Baseline comparison

Everything should be made as simple as possible, but not simpler.
— Albert Einstein

When we start to think about the building blocks of our future machine learning (ML) system, the essential part of it, or core component of it, seems to be a model built using ML techniques. In some sense, this is so true that we may even think that “this is it: this is the primary point where I should spend most of my time, energy, and creative power.”

But in reality, this may turn out to be a trap that the majority of ML projects fall into and get bogged down in without ever reaching production. An ML model is not necessarily the most important thing in the context of an ML system and its design document. Although the temptation is great, you should always keep in mind that it is extremely easy to spend a lot of time, team effort, and, more importantly, money on building a cool, modern, and sophisticated AI model that doesn’t ever bring any value to users and your company. A mediocre model in production is usually better than a great model on paper.

One of the first versions of this book’s title was Machine Learning System Design That Works, and it corresponds to the primary goal of any ML project, which is to build a system that will work; only then, when it brings profit, will we start to iteratively improve it while gradually increasing its complexity (if needed). In this chapter, we will discuss the baseline solution, the first step in bringing our system to life. We will cover why baselines are needed, as well as the purpose of building them. We will go all the way from constant baselines to sophisticated specialized models and also through various feature baselines.

8.1 Baseline: What are you?

A baseline is the simplest possible (but working!) version of a model, feature set, or anything else in your system. It’s the minimum viable product (MVP) in the world of ML systems that brings value from the start without yet diving into complexity. Let’s elaborate a bit more on the MVP analogy by outlining key goals that may equally apply to both:

Reduce the maximum risk with the lowest amount of time, cost, and effort invested in a product. At the beginning of the product’s life, it is still unclear whether the market needs it, what use cases the product will have, whether the economy will converge, and so on. To a large extent, these risks are peculiar to ML products, too. In a way, a baseline (or MVP) is the easiest way to test a hypothesis that lies at the heart of your product.
Get early feedback. This is the fail-fast principle cut down to the product scale. If the whole idea of your ML system is wrong, you can see it at an early stage, rethink the entire plan, rewrite the design document with new knowledge, and start anew.
Bring user value as soon as possible. Each company aims to generate revenue by making its customers happy. If we can bring value to customers early with a baseline and then update it stage by stage while generating a predictable amount of money, why not do this? It will leave everyone in the equation satisfied.

These three points form the grand basis of similarities between a baseline and an MVP. However, there are three more purely baseline-specific goals:

A placeholder to check that components work properly —Baselines are like smoke tests. As Cem Kaner, James Bach, and Brett Pettichord once said in their Lessons Learned in Software Testing, the phrase “smoke test” comes from electronic hardware testing. You plug in a new board and turn on the power. If you see smoke coming from the board, turn off the power. You don’t have to do any more testing.
First you want to check whether the system works and, second, whether it works correctly. To “compile” the whole system, you don’t need a powerful ML model. You need something that predicts something with the required format, optionally, based on something. Why not choose the easiest possible alternative?

A thing to compare with —Shall we go further and think about how much our investment in the model could pay in the future? The baseline is a “base line.” It is the origin of the coordinate plane—something we compare new models with in terms of some metrics.
When working in the industry, the model’s performance is not the only metric by which we compare models. Others require effort, interpretability, maintainability, and so on. We’ll cover them in section 8.5.

A fallback answer —Unlike MVP, when we move on to its second and subsequent versions, we don’t throw out the baseline completely. It is good practice when it lives in parallel with the sophisticated model. The system switches to the baseline response when this primary model goes south while making a prediction.

So what are the advantages of a well-chosen baseline? Simplicity automatically brings a lot of pros with it: it is robust, not prone to unexpected behavior and overfitting (due to fewer degrees of freedom), easy to build and maintain, and not too pressing on computing resources. Consequently, baselines are easy to scale. As an additional bonus, from non-ML colleagues’ perspective, simple models are easier to interpret and make it less difficult to understand what is going on under the hood. It can help increase trust in our ML product, which can be critical when the stakes are high. However, simplicity is not a goal by itself but a valuable property.

If we think of our ML system as a Lego model, a baseline is an opportunity to assemble all the other blocks as fast as possible. Still, we encourage you to make your system as modular (i.e., “orthogonal”) as possible by design. This will make later updates easier, including the transition to more complex models and features (initial design doesn’t dictate how fast you can update a system in the future). The initial system should be simple and agile, not trivial or restricted, with a baseline.

Still, despite all the advantages baselines provide without requiring a lot, they are not used as often as they deserve. The bitter truth is that, unfortunately, complexity sells better. There is a brilliant article from Eugene Yan that we strongly recommend reading. It’s called “Simplicity Is an Advantage But Sadly Complexity Sells Better” (), highlighting the main reasons why many choose complexity over simplicity, which are:

Complexity signals effort.
Complexity signals mastery.
Complexity signals innovation.
Complexity signals more features.

This leads to complexity bias, where we give undue credit to and favor complex ideas and systems over simpler solutions.

Of course, baselines are not a silver bullet, and there are reasonable cases when baselines are not necessary or are even irrelevant:

Accuracy is crucial. In many cases, an error of a couple of percentage points will not even be noticed. But if we can’t afford decreased quality—for example, in some medical applications like cancer detection or when dealing with autonomous cars—a baseline will be a bad life-saving rope. In this case, explicit switching to manual control would be a better idea.
There is a high degree of certainty. We clearly understand what the user wants (for example, based on a competitor’s experience), or we have our own experience of implementing identical systems. In this case, we don’t need to reinvent the wheel and waste our time on gradual iterations if we have a plan already proven in battle and can just copy-paste the system.
We are rebuilding an already-working system. Suppose we already have a working search engine based on the deep semantic similarity model (DSSM) architecture. The whole pipeline is already implemented and tested. So when it consistently brings value to users, it is a good time for optimization in terms of speed and accuracy—for example, by switching to a Transformer-based model. It is not the right place to think about baselines because the old version is effectively a baseline.

Still, we believe that even if complexity at an early stage can be justified in certain cases, it can’t be the go-to solution by default because it incentivizes people to make things unnecessarily complicated; it encourages the “not invented here” mindset, where people prefer to build from scratch and avoid reusing existing components even though it saves time and effort, and it wastes time and resources while often leading to poorer outcomes.

That is why we believe a baseline solution is the first thing to do, with incremental improvements where and when needed.

8.2 Constant baselines

A good metaphor for a baseline is building a bridge: sometimes you don’t need a team of bridge construction engineers, huge budgets, plans, or years to build it. Sometimes you just need a stably fixed log. A baseline is the very log that allows you to connect components and solve a given task at a minimal scale—in the case of a baseline, a temporary, primitive, easy-to-build solution (see figure 8.1).

The idea we want to convey before going into detail is simple: build a lean, operable ML system first, and improve it later. Think of the complexity of possible solutions as a continuum. Choose an appropriate initial point in this range based on the effort–accuracy tradeoff, and move ahead. Don’t spend too much time on modeling unless it’s necessary.

Figure 8.1 Before building a complicated model, start with a primitive baseline, which may well be the most appropriate foundation for your future ML system.

Keeping in mind that analogy, let’s start the discussion with the most spartan solutions that look like a log bridge. When we initiate a search for a suitable baseline, we often ask ourselves, “What is the most straightforward ML model that could solve the problem?” or “What is the right ML model to start from?” but frequently these questions turn out to be the wrong ones to ask. We believe the right one could sound like, “Do we need ML at all to solve this problem?”

Sometimes we either don’t even need ML for the problem or at least should not reinvent the wheel on our own and can instead use a third-party vendor. We already discussed this alternative in chapter 3 (section 3.2).

But let’s say we decided to build our own model. Good modeling starts with no model at all: with trying to hack a defined metric by picking the most trivial and lazy solution from the solution space. It will be the very first approximation of our problem. You can argue that a constant baseline represents a model by itself. With a constant baseline, we approximate all dependencies and interactions by a constant.

To immediately give an idea of what we are talking about, here are a couple of examples that you already know:

For regression tasks, constant baselines are average or median predictions (in time-series forecasting, you can take both by last day/week/month/year) by the last available value (e.g., for the corresponding user or item). Also, this could be some user-defined constant that maximizes the metric.
For classification tasks, it will be prediction by the major class (let’s say, in the antifraud problem, we can assume that there is no fraud at all) or the constant prediction of the probability of a positive class.
For ranking, this could be either a random order of documents or sorting based on an irrelevant numerical property like document ID or a simple heuristic like “number of queried keywords contained in an item description.”

In a way, a constant baseline is like the first term in the Taylor Series or the mean predictor as the first base estimator in gradient boosting (see figure 8.2). Neither even depends on variable x; they already do (although roughly) something related to our problem—no more, no less.

Figure 8.2 A constant baseline is like the first term of the Taylor Series—the simplest approximation that sets the foundation for more complex models.

8.2.1 Why do we need constant baselines?

There are two goals for building such a baseline.

The first one is benchmarking. It is helpful to get a baseline value of a selected metric for a random prediction. A simple sanity check compares your model against simple rules of thumb. Indeed, it would be sad to do 2 weeks of hardcore ML modeling and then finally implement the most straightforward possible baseline in 5 minutes that beats your model. It sounds ridiculous, but the situation is quite common in real life.

There’s a cool story from Valerii about this case. He was lucky enough to work with an engineer who is a wonderful person and a great specialist. Once, she won an ML competition with the goal of predicting some factory time series with just a constant baseline—or, as she likes to correct him, a stepwise constant. Conducting ML competitions is usually a very straightforward process. Participants have a labeled dataset and an unlabeled dataset. Their goal is to build a model using the labeled dataset that will output predictions to the unlabeled dataset that are the closest to the actual ones (available to the organizers only). Now imagine the frustration of other participants who have been engineering dozens of features for months and tuning parameters of their gradient-boosting models.

This case inspired us to look for and start with the simplest models, and this is something we’d like to encourage everyone to do. Don’t limit yourself to them, though. It will give you a much more adequate understanding of your metric and target values from the beginning so you can get a vision of what can be done with given data and what cannot.

The second goal of constant baselines is to provide a bulletproof fallback. If your real ML model could not make a prediction during runtime, due to some raised error, because of running into response-time constraints, or because there is no history for calculating features (aka the cold start problem with new users and new items)—or it simply goes crazy (which sometimes happens)—your ML service should return at least something. So, in this case, a constant baseline is all you need.

At the same time, we can easily imagine situations where a constant baseline is too primitive and brings no value at all. So the simplest usable baseline should be more complicated, represented as a set of heuristics/regular expressions or as a shallow model. Constant baselines tend to work fine for simple regression/classification problems on tabular or small text data; however, it is impossible to build a chatbot or voice recognition system with a constant baseline.

8.3 Model baselines and feature baselines

If we move further along the complexity scale, our next stop is rule-based models, although it is not always possible to draw a clear line between constant baselines and rule-based ones, as in most cases we can define the last one as a constant on top of some grouping. But there is also another well-known and illustrative example of rule-based baselines: to start solving natural language processing problems with just regular expressions.

A couple of years ago, Arseny worked for a taxi aggregator company, where he was involved in developing a service that would predict the time it would take for the nearest car to get to a client. The problem was apparent: if we overpredicted, the client could decide not to wait and would look for another service; if we underpredicted, the client would wait longer than we promised, which would mean we disappointed them.

Arseny’s colleague, who was a senior engineer at the time, treated it like a standard regression task and started with models like “always predict 5 minutes” or “if borough == ‘manhattan’: return 4.” Long story short: these types of baselines were hard to beat with hardcore ML alchemy, and ironically, the latter was even in production for a while as a fallback.

The “if district == X: return Y” model is an excellent example of a rule-based baseline. We can generate similar models by taking the mean/median/mode of some category or several categories—in our example, the median by location.

The further we go down the progression, the more complex our model gets and the more connections between objects’ properties and the labels it can find.

A typical sequence of baselines in an ML problem would begin with the following: constant baseline, rule-based baseline, and linear model (see figure 8.3). We need something more sophisticated and specialized only if these baselines are insufficient for our task.

Figure 8.3 A typical sequence of baselines at the early stage of designing your model

For example, when building a recommender system, we start with some constant retrieval, then try collaborative filtering (e.g., alternate least squares), factorization machines, and, finally, deep learning (e.g., deep structured semantic model) if needed.

Whatever problem you face, be aware of simple approaches in this field. They don’t necessarily perform worse than more sophisticated ones.

A vivid example of that can be found in the paper called “Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches” by Maurizio Ferrari Dacrema et al. () that took RecSys’ Best Paper Award in 2019 (). The paper is notable for unveiling strikingly impressive stats. The research group examined 18 algorithms that had been presented at top-level research conferences in the preceding years. After studying these algorithms, only seven of them could be reproduced with reasonable effort. Things got even more interesting when it turned out that six out of seven algorithms could often be outperformed using relatively simple heuristic methods (e.g., those based on nearest-neighbor or graph-based techniques). The only remaining algorithm clearly outclassed baselines; however, it could not consistently outperform a well-tuned nonneural linear ranking method.

A nonexhaustive list of examples from the already-mentioned Eugene Yan article includes the following:

Tree-based models (random forest, gradient boosting) in most cases beat deep neural networks on tabular data, especially on small/medium datasets (say, < 1 million) ().
Greedy algorithms outperform graph neural networks on combinatorial graph problems ().
Simple averaging is often not worse than complex optimizers on multitask learning problems ().
A dot product of embeddings outperforms neural collaborative filtering in item recommendation and retrieval ().

Up to now, we have been talking about baselines focusing on models. But what about features? Features are effectively a part of the model and sometimes are even the most important part of it. In classic ML, we have to engineer features manually, and choosing features for a baseline is based on the same principles; we start with a small group of essential features (most likely, those that are easier to calculate). There are two ways of adding new features:

Engineering new features, which is challenging and time-consuming and requires building new ETL pipelines
Generating features that derive from ones that already exist

The sequence of baseline features we need to try should look like this: the original minimum set of features, all sorts of interactions and counters, then embeddings, and then something more complicated (see figure 8.4).

Figure 8.4 A sequence of baseline features you need to try: from the simplest to the more complicated

What are the properties of a good bunch of features to start from? The answer is exactly the same as for the models, and we’ll discuss it further.

For a typical problem usually solved with deep learning methods, there can be a simple baseline built with shallow models. As we recall, deep learning is a part of representation learning, which means that instead of handcrafting features, we delegate this work to a neural network. However, for some problems like image or text classification, you can apply naive approaches (rule-based or linear model-based). For example, naive Bayes was a very strong baseline in a natural language processing world before BERT-like architectures emerged. For computer vision, some problems can be solved by using a histogram of pixel color (or even just mean/median value!) as features for a linear model. Having said that, for most scenarios, starting with a simple deep learning model—either foundational in a few-shot setup or a trained one—can be a better choice because these models have already proven themselves in a wide variety of tasks.

Arseny once designed a take-home exercise for candidates, where they were provided with a script solving anomaly detection problems on a simple image dataset. The script contained two baselines—one with a neural network and one with a color histogram—and candidates were instructed to improve either of them to beat some metric. Both baselines already performed on a similar level, and both were implemented specifically poorly, so the candidates had room for improvement. The majority of candidates preferred working on a more complicated deep learning solution, and only the most experienced of them noticed that it was possible to reach the required result with a single line of code changed for a histogram-based baseline.

8.4 Variety of deep learning baselines

When a problem is not trivial and suggests the use of deep learning because of the data structure (which can be applicable to most computer vision or language processing problems), the variety of baselines is slightly different. The most common are reusing pretrained models and training/fine-tuning the simplest models.

Reusing pretrained models is a common practice if a problem is not unique and there is a model that has been trained on a similar task. For example, if we want to train a model that can recognize breeds of pets, we can reuse a model that was trained on an ImageNet dataset. ImageNet is a dataset that contains images of 1,000 classes, and more than 100 of them are dog breeds. So, once your goal is to recognize cats and dogs, you can reuse a model that was trained on ImageNet dataset without retraining it. This is a common practice for many generic problems like speech recognition, object detection, text classification, sentiment analysis, end so on.

A slightly more advanced version of this approach is reusing features (also known as embeddings or representations) from a pretrained model to train a simple shallow model. For example, you can take a pretrained model that was trained on an ImageNet dataset and use its representations from the last backbone layer (before the final classification layer) to train a simple linear model that will classify images into a custom set of classes. This approach is especially useful when the dataset is small and the final task is more or less trivial (e.g., classification), so training large models from scratch is not likely to work. This approach is also known as a specific case of transfer learning. We have seen cases where such a baseline was literally unbeatable, and no fancy models were able to outperform it.

An even more specific version of using pretraining models is using pretrained models or third-party APIs capable of zero-shot or few-shot performance (meaning they require no or a few training samples to provide a result). A glorified example of such an API is the GPT family, but there are many APIs available for different tasks—for example, all major cloud vendors have a long list of AI solutions, such as Amazon Rekognition or Google Cloud Vision AI in the computer vision niche; detailed information about them is out of the scope of this book.

Using a third-party API by a major vendor as a baseline has a nice side effect: it is a bargaining chip in negotiations, such as selling a software product to a big enterprise or proving a startup tech is solid to potential investors. Potential customers may not know what a good metric is for a given problem (recall chapter 5), but comparing your tech to an AWS solution frames the problem in a proper way. Arseny knows at least three startups that have used this approach, bragging about how they beat alternatives such as Amazon, Google, and OpenAI. In all three cases, the companies’ claims were legit, and that’s expected, as major vendors aim to tailor one-size-fits-all solutions, while startups can offer more niche ML systems doing one thing just great.

Finally, if none of these options work, you can try to train a simple model from scratch. However, it is recommended to avoid recent state-of-the-art models and use something simpler and time proven. Recent models are often more “capricious” during training, while some older “stars” are already more researched, and recipes for stable training are well-known. A popular example of such models is the ResNet family for vision and the BERT family for natural language processing. Our personal heuristic is to start with models that are at least 2 years old, but it is not a strict rule. It depends on the level of innovation required for the system, as discussed in chapter 3.

It’s worth noting that there are multiple shades of fine-tuning between “training a shallow model on top of a pretrained model” and “training a model from scratch.” Choosing the right degree of fine-tuning is very case-specific and may require some experiments. For example, when training a text classification model based on BERT, you can gradually complicate the scope of training:

Train only the last layer.
Train some blocks using adapter methods such as low-rank adaptation ().
Train some encoder blocks.
Train normalization layers.
Train embedding layer.
Train full model.
Train full model + tokenizer.

This variety leads to the question, “How do we choose a proper baseline?”

8.5 Baseline comparison

Let’s examined various features and model baselines, starting with the trivial ones. We can answer the central question, which is how to decide when to stop adding complexity and how to determine a suitable baseline for our system. There are multiple factors we should consider simultaneously; some of them are

Accuracy
Effort (mostly, time of development)
Interpretability
Computation time

The most fundamental is the tradeoff between a model’s accuracy (or other ML metric) and the effort it requires. The first component in the equation is accuracy. When you move from a constant baseline to a rule-based baseline, from a rule-based one to a linear model, or from original features to their aggregations and ratios, you already start to get a feeling for what increase in metrics these small changes give. Is it responsive or not? How difficult is it to significantly surpass your constant baseline? Is it reasonable to invest more time attempting to gain more accuracy?

In some sense, as an ML engineer, you do backpropagation by getting “feedback” from your training loop and updating your understanding of the problem with its data and accuracy distribution across the solution space.

The second component is effort. By “effort,” we mostly mean time and computing resources. No ML project has an infinite budget and, hence, an infinite amount of time. We consider the time required to implement a new model (or feature), train it, debug it, and test it. You should also pay attention to all the attendant complications and pitfalls that may arise on the way, especially infrastructural ones.

Let’s examine a constant baseline. It takes almost no time to implement, and it provides us with the lowest accuracy. So we will map it into the (0, 0)-point in time-accuracy coordinates (as shown in figure 8.5).

Figure 8.5 Simple baselines are easy to build but sacrifice final system metrics (example for a time-series prediction).

Let’s take a look at the linear model. It requires more effort but also most likely provides us with better accuracy. We will probably find the corresponding point to the right and higher than the previous one, and so on. On the other hand, it is important to understand that as the model improves and evolves (and therefore gains in complexity), the cost–accuracy ratio begins to decrease. A striking example of such a drop in efficiency is gradient boosting, which we mentioned earlier. Based on our estimates and experience, gradient boosting requires more input than all the simpler models you would use at the earlier stages put together while giving no significant increase in accuracy.

We should estimate how long it would take to try a more complex model each time and how much additional accuracy it could provide. Once we understand that the next step requires too much effort for almost no significant score improvement, we should stop. This “early stopping threshold” differs depending on the concrete domain and problem.

But what if something goes wrong or some additional change in the model is required? What model would be easier to debug or update?

On the left of the spectrum, we have linear regression with an exact form solution.
A deep neural network with sophisticated training and inference pipelines is on the right.

Which one would you prefer to face?

Maintenance, which we will touch upon in more detail in chapter 16, includes the amount of additional work that is necessary for debugging implemented features or a model. We could count maintenance as a part of the extra effort the more complex baseline requires.

Another essential property of a baseline is computation time. How does the computation time of our model and its features affect the response time? Does our baseline meet the service-level agreement? Is it the natural limit for solution space, especially when dealing with real-time systems? But even with no real-time requirements, computation time also determines how fast we will iterate during more thorough experimentation in the future.

Finally, we have interpretability. This parameter matters when we deal with the very first iteration of any ML system, especially for other teammates. When we deal, for example, with sensitive or medical data, it becomes a safety problem, too, not just a question of trust in the predictions of our model only. The general pattern is trivial: the simpler the baseline, the easier it is to explain how it works.

We’ll discuss this topic in detail later in chapter 11.

8.6 Design document: Baselines

As long as baselines can be part of your design document, we are going to fill this gap for our fictional companies, Supermegaretail and PhotoStock, Inc.

8.6.1 Baselines for Supermegaretail

Let’s start with the forecast system. Here, seasonality will be a huge factor when choosing a prediction model, so we can’t but consider it in our design document.

Design document: Supermegaretail

V. Baseline solution

i. Constant baseline

As a constant baseline for Supermegaretail’s demand forecasting system, we plan to use the actual value of the previous day per SKU per grocery. Knowing that data sometimes could appear with delay and that grocery sales experience strong weekly seasonality, we will go 1 full week back instead of going 1 day back. As a result, our prediction for a specific item on September 8, 2022, will be the actual sales value for this item on September 1, 2022.

ii. Advanced constant baseline

Chapter 5 mentioned quantile losses of 1.5th, 25th, 50th, 75th, 95th, and 99th percentiles. We can calculate the same with our baseline using a yearly window.

iii. Linear model baseline

We will use a basic set of features to use linear regression with quantile loss; for a start, we can only use target variables but with multiple lags and aggregations like sum/min/max/avg/median or corresponding quantiles for the last 7/14/30/60/90/180 days or different rolling windows of different sizes. The same magic could be done with other dynamic data beyond sales date, like price, revenue, average check, or number of unique customers.

iv. Time-series-specific baseline

Autoregressive integrated moving average (ARIMA) and seasonal ARIMA (SARIMA) are both autoregressive algorithms for forecasting; the second one considers any seasonality patterns.

Both require fine-tuning multiple hyperparameters to provide satisfying accuracy. To avoid this, we may prefer a state-of-the-art forecasting procedure that works great out-of-the-box and is called Prophet (). The nice advantage of Prophet is that it’s robust and doesn’t require a lot of preprocessing: outliers, missing values, shifts, and trends are handled automatically.

v. Feature baselines

What additional information can some baselines and possible future models benefit from?

We will include extra static info about products (brand, category), shops (geo features), and context (time-based features, seasonality, day of the week)—all of them with preprocessing and encoding appropriate for a chosen model.

Features that are also suitable for the baseline are counters and interactions. Examples include

The difference between current and average price (absolute and relative)
Penetration: the ratio of product sales to sales of a category (of levels 1, 2, 3) for rolling windows of different sizes
Number of days since the last purchase
Number of unique customers

8.6.2 Baselines for PhotoStock Inc.

Now we switch to the PhotoStock Inc. case, where we are building an advanced search engine set to provide better, more accurate results and eventually increase sales.

Design document: PhotoStock Inc

V. Baseline solution

We suggest three approaches to our baseline model for the PhotoStock Inc. search engine problem.

i. Non-ML solution as a baseline

Currently, PhotoStock Inc. uses a simple non-ML solution for its search engine. It is a keyword-based search engine with the ElasticSearch database capable of fuzzy search. It doesn’t require any training, and it is already deployed to production, so it is a solid candidate for a baseline model.

It has two drawbacks, though: it doesn’t use images in the search, only metadata (e.g., tags, descriptions, etc.), and it’s not too easy to embed it into a new ML pipeline for comparison. However, it’s still very useful to have it as a baseline model because it will allow us to compare the performance of an ML model with the performance of a non-ML solution.

ii. Simple ML solution as a baseline

Following the previous example, we can use a simple ML model as a baseline. It will not use images but only metadata. Such a model can use queries and metadata as raw input, transform them into features using a naive term frequency—inverse document frequency (TF-IDF) vectorizer, and then use a simple linear model to predict a relevance score. On top of that, it will be easy to implement and train, and its outstanding simplicity can help with early-stage debugging and understanding of the problem.

iii. Pretrained model as a baseline

Finally, we can use a pretrained model as a baseline. Given the problem’s origin, we need a solution capable of unifying visual and text domains, and the most famous one is CLIP (). CLIP was released in 2021 and proved to be useful across various tasks. There are also several CLIP successors available, so they can be reviewed for future iterations if needed.

CLIP, in a nutshell, is an image encoder and a text encoder trained to predict which images were paired with proper text descriptions. It was trained on a huge dataset and thus demonstrates reasonable performance on a variety of tasks. CLIP is open source and is distributed under the MIT license, so it can be used for commercial purposes.

To make it work for our use case, we can start by using its output for a pair of queries (images) as a relevancy score. This approach doesn’t use metadata and text descriptions, so it can be either combined with a previous approach or developed further to use a two-component score—for example:

relevancy_score = distance(query, image) + distance(query, description)

Both distances can be computed using CLIP—one using both text and image encoders and the other using only a text encoder.

As a first step, we may avoid any training at all. As for the next steps, we can start fine-tuning the model or its components on our dataset.

Summary

Consider baselines an integral point of ML system design, as they effectively solve technical, ML, and product-related problems (interconnecting components, setting up a metric to compare with, and understanding product UX with a weak model inside).
Even though baselines are perceived as something as easy as ABC, the skill of recognizing where to start, both in terms of features and models, is underestimated.
As you dive deeper into your project, your common progression will lean toward the following progression: constant baseline, then rule-based baseline, then linear model, then something more complicated.
While building a complex model from the start may seem tempting, always consider kicking off with a constant baseline; this will save you resources and time and point to whether you’re moving in the right direction with minimum expenses.
When a problem implies the use of deep learning, the most common practice is reusing pretrained models, training/fine-tuning the simplest models, or training a simple model from scratch if neither of the first two approaches works.
When choosing between various baseline options, consider accuracy, effort (mostly time of development), interpretability, and computation time as key factors, with the accuracy–effort tradeoff being especially important.
As your model evolves and gains in complexity, the cost accuracy inevitably decreases. This is especially the case with switching to gradient boosting, which, as practice shows, requires more input than all previous models put together while giving no significant increase in accuracy.