Книга: Machine Learning System Design
Назад: 10 Training pipelines
Дальше: 12 Measuring and reporting results

11 Features and feature engineering

This chapter covers

  • The iterative process of feature engineering
  • Analyzing feature importance
  • Selecting appropriate features for your model
  • Pros and cons of feature stores

It is often said that a mediocre model with great features will outperform a great model with poor features. From our experience, this statement couldn’t be more true. Features are the critical inputs for your system; they drive your algorithms, provide essential patterns for the model, and feed the data it needs to learn and make predictions. Without good features, the model is blind, deaf, and dumb.

While the role of feature engineering is not crucial for a system designed with a deep learning core in mind, no machine learning (ML) practitioner can ignore their role. In a sense, framing some fancy multimodal data into a deep learning model or even making a prompt for a large language model is a specific way of feature engineering, and that’s why classic feature-related techniques like feature importance analysis are still very relevant.

This chapter explores the art and science of creating effective features. We will cover tools that help determine the most valuable features for the system, the engineering challenges we can face, the factors and tradeoffs we should consider while selecting the right subset of features, and how we can ensure that the selected features are reliable and robust.

11.1 Feature engineering: What are you?

Feature engineering is an iterative process that involves creating and testing new features or transforming existing features to improve the model’s performance. This process requires domain expertise, creativity, and data engineering skills to build new data pipelines for the system. Given its time-consuming and iterative nature, feature engineering often devours a significant portion of resources allocated to modeling.

To secure a fruitful and streamlined modeling process, you should always make sure you assemble an effective feature engineering strategy while designing a system. This plan will become a compass to guide the team through identifying and engineering the most impactful features while minimizing the risk of wasted efforts. By prioritizing iterations in the proper order and charting the course, we can avoid potential pitfalls and ensure our actions add value to the end goal.

Feature engineering in ML is similar to crafting prompt structures in generative models, such as large language models and text-to-image generators. Both features and prompts serve as enhanced inputs that guide the model’s “attention focus” (literally or figuratively) toward the most relevant data aspects.

Speaking of powerful deep learning models, in certain domains, such as audio and image processing, feature engineering used to be a complicated problem. Then the deep learning revolution happened, and its practitioners were delighted because instead of engineering endless, barely reliable features, they could now delegate to a deep learning model trained end-to-end. There are even ML practitioners who never engineered features outside of study projects! This trend may be interpreted as a signal that this chapter can be safely skipped. However, we believe that even deep learning-based pipelines may benefit from feature engineering and related techniques. A great example of that comes from Arseny’s experience.

11.1.1 Criteria of good and bad features

Let’s break down some of the feature characteristics as well as the tradeoffs we should keep in mind:

Feature engineering is aimed to be continuous, as business goals and data distributions change over time. We must constantly evaluate and update our feature set to ensure it remains relevant and effective in solving business problems. When developing features, it is important to keep track of the changes made to each feature, including their versioning and mutual dependencies. It makes the system reproducible and maintainable.

11.1.2 Feature generation 101

With the mentioned criteria and limitations serving as our compass, we are ready to discover common ways of generating new features.

The most obvious way to fetch a new feature is to add a new data source to your data pipeline or use a column that previously was not incorporated into the dataset. This data source can be either internal (e.g., an existing table in the database) or external (e.g., buying data from a third-party provider). On the one hand, these new features are low-hanging fruit with a valuable contribution to the model’s performance. On the other hand, they require most of the data engineering efforts, take a lot of time to manage, and may cause infrastructural problems, as greater complexity always requires more effort for maintenance.

If new sources are not used, there are two alternatives—to transform the existing features or to generate new features based on a combination of two or more of the existing ones.

Transforming numeric features includes scaling, normalization, and mathematical functions (e.g., using logarithms to improve distribution skewness). The type of model dictates the appropriateness of transformation. For instance, a common thing will be finding no increase in gradient boosting metrics after applying monotonic transformations to its features because the core element of the algorithm—a decision tree—is invariant under monotonic transformations of inputs.

When dealing with time-series data, it’s common to utilize transformations such as lags (shifting the feature’s values backward in time to create new features), aggregates (calculating measures like mean, max, or min over a specific time window), or generating statistical features from past data, such as the standard deviation or variance over different time periods.

Quantile bucketing (or quantization) is a specific case of transformation. It converts continuous features into categorical features by grouping them into discrete buckets based on their values. For example, Uber applies this approach in its DeepETA network (; see figure 11.1).

figure
Figure 11.1 An overview of the DeepETA model pipeline: example of combining base feature engineering and a deep learning model (source: )

This network employs the transformer architecture to predict the estimated time of arrival, processing a diverse array of tabular data. The data, which includes continuous, categorical, and geospatial features, is all transformed into discrete tokens and subsequently into learnable embeddings suitable for the transformers. You can read more about DeepETA in the paper “DeeprETA: An ETA Post-Processing System at Scale” by Xinyu Hu et al. ().

Categorical features often necessitate transformations, which can be accomplished through methods like one-hot encoding, mean target encoding, ordinal encoding (this method ranks categories based on some inherent order), or the hashing trick, which allows handling large-scale categorical data. It is important to note that while being powerful, mean target encoding can easily lead to data leakage if not properly implemented, as it uses information from the target variable to create new features.

For sequential data like text, we can use techniques such as bag-of-words, term frequency-inverse document frequency (TF-IDF), and BM25 to transform the data into a form that can be processed by ML algorithms. It is worth noting that these methods lose information about word order; this disadvantage can be partially addressed by using longer N-grams instead of single words (unigrams). We can also use pretrained language models such as BERT to represent input data in a low-dimensional embedding space, which we can feed to the final model.

Remember that we can represent almost any sequential data as tokens, not texts. For example, in industries like online retail and media streaming services, we can interpret a user session as a sequence of visited product pages or watched videos. Each visited page will have its learnable representation (an embedding). Afterward, we can use these embeddings in our recommendation system as a prompt in the “next page prediction task” to get an idea of what product/video the user is looking for.

If we want to use product embeddings in the tabular dataset, one of the common options is to utilize the distances between products. Examples of features here would be

Although these sophisticated features do add to the complexity of the training and inference pipelines, the signals they provide may lead to a major advancement in the model’s performance.

What about merging signals from multiple features into one? When we have multiple features in our dataset, we can combine them to create a feature that is more informative or meaningful for our model. For example, instead of having separate features for the number of clicks and the number of purchases a user has made on an eCommerce site, we can combine them to create a new feature such as “purchase-to-click ratio,” which might be a better indicator of the user’s buying intent.

In the case of a taxi aggregator company, instead of having separate features for the “number of rides” and “total distance traveled,” we could combine them to create a new feature like “average distance per ride,” which might provide more valuable insights into drivers’ and passengers’ behavior.

We should also consider the relationship between the existing features. For example, absolute product sales for a certain period may provide little information. However, comparing them to sales of other products in the same category or sales in previous periods may reveal valuable patterns or trends. Combining signals from multiple features can create new features that capture more complex relationships in the data and improve the model’s performance.

The technique of combining multiple features is usually referred to as feature interactions or feature cross. This technique is especially important for linear models because such features may unlock the linear separability of data points.

11.1.3 Model predictions as a feature

As we discussed earlier, if a feature depends on another feature, any changes or updates to the latter may require corresponding changes or updates to the former. It creates maintenance/debugging challenges and increases the system’s complexity over time.

Model predictions can be thought of as a specific case of a feature where the output of the model is used as an input to another model or system. This approach is sometimes called model stacking. While using model predictions as features can be powerful and effective, it poses some engineering challenges and risks.

The simplest example of using model predictions as a feature is target encoding (). In this approach, the categorical feature is encoded by the mean target value (with a certain degree of regularization) and is used as a feature in the model. However, there is a risk of data leakage where the encoding is based on information from the training data that is not available during inference. This can result in overfitting and poor performance on the new data if we don’t use advanced validation techniques like nested cross-validation (see chapter 7).

Another example is using third-party models (e.g., weather forecasts as a feature in a demand prediction model). While weather data can be highly informative, there is a risk that forecasts may need to have the necessary historicity. In such cases, forecasts with the necessary historicity are preferable to forecasts with higher precision. Besides, relying on external data sources can introduce additional dependencies and risks beyond the ML team’s control.

Finally, using third-party or open source models as feature extractors in deep learning systems can pose risks, too. While the generated embeddings can absorb useful patterns in the data, there is a danger of model drift or instability if the external model is updated without proper versioning and vice versa—with no updates, the model may lose its value due to a drift in your data. This can result in unexpected behavior and drastically drop the performance of your ML system.

To mitigate these risks and challenges, it is important to design the feature engineering pipeline carefully and have robust testing and monitoring procedures in place (the former is described in chapters 10 and 13; the latter can be found in chapter 14). This can include using cross-validation and other techniques to prevent leakage, validating external data sources and models, and having processes in place for monitoring and updating features over time.

11.2 Feature importance analysis

Once the initial set of baseline features has been selected for the model, understanding which features affect the model’s predictions the most can provide valuable insights into how the model makes decisions and where further improvements can be made.

ML models can often be seen as black boxes that provide no insight on how they arrive at their predictions. This lack of transparency can be problematic for engineers, stakeholders, or end users who must understand the rationale behind decisions provided by a given model.

In pursuing model transparency, we employ two key concepts: interpretability and explainability. Both are aimed at demystifying the workings of an ML model:

Feature importance analysis serves as a tool for achieving both interpretability and explainability, as it helps pinpoint the features that greatly contribute to the model’s predictions. The results of feature importance analysis are collected as part of the training pipeline artifacts and may play a role in the model verification procedure, which delivers the “to deploy, or not to deploy” verdict to a freshly trained version of the model (you can find more details in chapter 13). A good example here is a system that determines the cost of a trip in a taxi aggregator application, as shown in figure 11.2.

figure
Figure 11.2 An example of a taxi aggregator app’s UI that clarifies why its dynamic pricing algorithm chooses this particular price in this area and during this time of the day

Under the hood, the app works with all the crucial features and analyzes the current live data like traffic density, weather conditions, and so on, when determining the end price. What it also does, however, is provide the rationale behind the suggested price in a convenient and user-friendly form. With this delivery, the user understands why a typically cheap ride they take on a regular basis suddenly goes up in price.

In addition, feature importance analysis can increase trust in our ML system. This is particularly important in high-stakes domains such as medicine and financial fields. While the General Data Protection Regulation () does not strictly enforce explainability, it does suggest a level of transparency in automated decision-making, which could be beneficial or even essential in many ML applications ().

Identifying the most important features explains the variables driving the model’s predictions and the reasons behind them. This information can help optimize those features to boost the model’s performance and remove irrelevant or redundant features to improve efficiency. Additionally, it can guide us through debugging by, for example, detecting overfitting or evaluating the usefulness of newly added features.

11.2.1 Classification of methods

Let’s explore methods of feature importance analysis and how they can be applied to improve the transparency and performance of an ML system.

Navigating the terrain of feature importance analysis can be daunting, but having a map of available methods can show us the right direction. These methods can be broadly classified based on their properties, such as type of model, level of model interpretability, and type of utilized features.

Classical ML vs. deep learning

The methods applied for feature importance analysis can differ substantially between classical ML and deep learning models. For classical ML models, where features are often manually selected based on domain knowledge or statistical analysis, determining feature importance is straightforward—we can either directly inspect model weights and decision rules or exclude/modify a separate feature to investigate its contribution to the model’s prediction.

On the other hand, deep learning models, which automatically learn feature representations from data, present unique challenges for importance analysis. Given the complex, nonlinear transformations and the high level of abstraction, there is more involved than simply looking at the model’s parameters to understand feature importance. Instead, we rely on advanced techniques like saliency maps, activation maximization (read more in “Visualizing Deep Convolutional Neural Networks Using Natural Pre-Images” by Aravindh Mahendran et al., ) or layer-wise relevance propagation (read more in “Layer-wise Relevance Propagation for Neural Networks with Local Renormalization Layers” by Alexander Binder et al., ) and its successors to make sense of what is happening inside neural networks. Please bear in mind that the list of examples is not exhaustive, and none of them is truly universal, as the problem of explainable deep learning is not solved in general and remains in an active research stage.

Model specific vs. model agnostic

Model-specific methods use the structure and parameters of the model to estimate feature importance. For example, in tree-based models, we can count how many times we split particular features during training time or the total gain it gives among all splits. Similarly, we can look at the magnitude and sign of the coefficients assigned to each feature for linear models.

In their turn, model-agnostic methods treat a model as a black box. They often involve perturbing the input data and observing the effect on the model’s output. Examples of model-agnostic methods include

Individual prediction vs. entire model interpretation

Another important distinction is whether the methods are designed for individual predictions or for interpreting the entire model (see figure 11.3). Methods that focus on individual predictions estimate the importance of features for a particular input, delving into why the model has made a particular decision. On the other hand, methods that interpret the entire model estimate the importance of features in a more general sense, elaborating on the overall behavior of the model.

Some examples of methods that focus on individual predictions include local interpretable model-agnostic explanations (LIME); see the paper "Why Should I Trust You?" by Marco Tulio Ribeiro et al. (), which approximates the decision boundary around a particular input using a simpler, more interpretable model (e.g., linear regression), and anchor explanations (learn more in the paper “Anchors: High-Precision Model-Agnostic Explanations” by Marco Tulio Ribeiro et al. (), which identify a rule that sufficiently “anchors” a decision, making it interpretable by humans.

figure
Figure 11.3 A taxonomy of model explanation methods

Often we use a combination of methods to mitigate the limitations of individual approaches and gain a more complete understanding of the model. Keep in mind, though, that no one-size-fits-all method can provide a definitive answer to all feature importance questions, and the choice of methods should be tailored to the specific problem and context.

11.2.2 Accuracy–interpretability tradeoff

Highly interpretable models may sacrifice accuracy in favor of transparency, and vice versa—models that achieve high accuracy often do so at the cost of interpretability (see figure 11.4). Modern large language models based on transformers, such as GPT, provide a vivid example. They have revolutionized the field of ML by achieving state-of-the-art performance in a wide range of natural language processing tasks. However, they are also often highly complex, with billions of parameters, making it difficult to understand how they arrive at their decisions.

figure
Figure 11.4 The more sophisticated the model we use (and therefore, usually the more accurate), the less explainable it becomes.

The accuracy–interpretability tradeoff remains a challenging problem as new unexplored architectures arrive. The choice of a method should be tailored to the specific problem and context, considering factors such as the importance of interpretability, the complexity of the model, and the desired level of accuracy.

11.2.3 Feature importance in deep learning

Feature importance analysis for models that work with tabular data is a comprehensible problem with a clear solution; we have easily separable features and well-known tools to measure how each of them influences the model, target variable, or final metric.

However, in the context of deep learning, especially with data types such as images, audio, or text, feature importance can become less clear and more challenging. Deep learning models, by nature, automatically learn hierarchical representations from the data, often in a highly abstract and nonlinear manner. In these cases, a “feature” can refer to anything from a single pixel in an image to a single word or character in a text or a specific frequency in an audio signal to complex attributes, like the location of an object in an image, the sentiment of a sentence in a text, or a specific sound pattern in a voice record.

Despite that, it is still possible to gain insights into what the model considers important in raw input data and which patterns it pays attention to. Let’s explore a few techniques for feature importance analysis in deep learning:

figure
Figure 11.5 Examples of output saliency maps generated by different methods (source: )
figure
Figure 11.6 Visualization of an attention head in the Encoder-Decoder Transformer

As you can see, we start to observe the parallels with classical ML. Although deep learning models present unique challenges in feature importance analysis, there are still methods that can provide insights into how the model makes decisions and which patterns are the essential predictors for the target variable.

11.3 Feature selection

Perfection is finally attained not when there is no longer anything to add, but when there is no longer anything to take away.
— Antoine de Saint-Exupéry

In the previous sections, we’ve learned about the art of feature engineering and how to transform raw data into meaningful features. However, not all features are equally useful; some may be irrelevant, redundant, or too complex for our model to handle effectively.

This is where feature selection comes in. By carefully selecting the most informative features, we can improve our system’s performance and interpretability while reducing its complexity and training time. We will explore the techniques, best practices, and potential pitfalls of feature selection and learn how to choose the right features for our specific ML problem.

11.3.1 Feature generation vs. feature selection

The feature generation and feature selection processes in ML can be compared to gardening. Similar to gardeners who plant various seeds in the soil, we generate a range of features, explore new data sources, experiment with different feature transformations, and brainstorm new ideas that might improve the model’s performance.

However, just as not all plants in the garden will thrive, not all features will benefit the model, and at some point, we will have to prune away dead or unproductive plants (in our case, discard irrelevant or redundant features) to sustain healthy growth. This cycle of nurturing and pruning, of adding and reducing, is a constant in the life of ML systems as we continually refine and improve our feature sets.

The ancient Greek philosopher Heraclitus once said, “Opposition brings concord. Out of discord comes the fairest harmony.” This also holds true in ML, where we achieve optimal performance by keeping the balance between generating new features and carefully selecting the most informative ones.

11.3.2 Goals and possible drawbacks

You may ask, “Okay, but why care so much about feature selection in the first place?” There are certain benefits to it:

For convenience, we have gathered all the benefits of feature selection into a single scheme in figure 11.7.

figure
Figure 11.7 Reasons for feature selection

In real-time applications, the need for speed often takes precedence, even if it means compromising the model’s accuracy to fulfill SLAs. For instance, in speech recognition systems like those used in virtual assistants, users expect instant and accurate transcription of their spoken words into text. Even the most minor delays could disrupt user experience and make the system appear less efficient.

Perfect personalization becomes worthless if it slows prediction by 300 ms, causing negative perception. Therefore, lightweight personalization with moderate quality is more appropriate than a model that accumulates all possible inputs from a user but makes them wait.

There are also potential risks and drawbacks of feature selection besides the balancing between computational time and accuracy:

Besides these problems, if done regularly, feature selection adds a computationally intensive stage to the training pipeline that we should also consider, especially when greedy wrapper methods are used.

11.3.3 Feature selection method overview

There are various methods available for feature selection, each with its pros and cons (see figure 11.8). The most common approaches are filter, wrapper, and embedded methods. Let’s take a closer look at each of the three.

figure
Figure 11.8 Families of feature selection methods

Filter methods work by filtering features independently from the model, using simple ranking rules based on the statistical properties of a single feature (univariate methods) or the correlation with other features (multivariate methods). These methods are easily scalable (even for high-dimensional data) and perform quick feature selection before the primary task.

The order in which characteristics are ranked in univariate filter methods is determined by the intrinsic properties, such as feature variance, granularity, consistency, correlation with the target, etc. Afterward, we leave top-N features as our subset and either fit the model or apply more advanced, computationally intensive feature selection methods as the second feature selection layer.

In multivariate methods, we analyze features in comparison with each other (e.g., by estimating their rank correlation or mutual information). If a pair of features represent similar information, one of them can be omitted without affecting the model’s performance. For example, the feature interaction score (regardless of the way it is measured) can be incorporated into an automatic report. When the score is high, it triggers a warning for potential reduction in the model’s performance before the training begins.

Wrapper methods focus on feature subsets that will help improve the quality of the model’s results used for selecting based on a chosen metric. We call them this because the learning algorithm is literally “wrapped” by these methods. They also require designing the right validation schema nested into outer validation (to choose the right validation schema, please see chapter 7).

Wrapper algorithms include sequential algorithms and evolutionary algorithms. Examples of sequential algorithms are the sequential forward selection, where features are included one by one starting from the empty set; SBE (sequential backward elimination), where features are excluded one by one; and their hybrids—floating versions when we allow inclusion of an excluded feature, and vice versa. In evolutionary algorithms, we stochastically sample subsets of features for consideration, effectively “jumping” through the feature space. A common example of an evolutionary algorithm is to run a variation of differential evolution in a binary feature mask space where “1” indicates an included feature and “0” denotes an excluded one.

The main disadvantage of these methods is that all of them are computationally intensive and often tend to converge to local optima. Despite that, they provide the most accurate evaluation of how the subset affects the target metric. Use them carefully, especially if your hardware specs are limited.

With embedded methods, we use an additional “embedded” model (which may or may not be of the same class as our primary model) and make decisions based on its feature importance. A good example is the Lasso regression, due to the ability of the L1-regularization to turn the coefficients to zero if they are not relevant to the target variable, as shown in figure 11.9.

figure
Figure 11.9 Lasso regression eliminates features one by one by reducing their coefficients to zero as L1 regularization term grows.

Another widely used feature selection algorithm is recursive feature elimination (RFE), which consists of training and removing the worst K features on each step based on the embedded model’s feature importances.

Embedded methods lie somewhere between filter methods and wrapper methods as far as the required computation and selection quality are concerned.

When it comes down to choosing the methods to start with, we prefer rough cutoffs for the initial reduction of the number of features rather than using Lasso selector or RFE, depending on which one outputs more meaningful subsets. Computationally expensive methods may have better performance, but faster methods tend to be good enough for initial feature pruning, especially if you suspect that some features are total garbage.

There are plenty of dummy and still-useful methods that also serve reasonable feature selection baselines. For example, we can take a feature, shuffle its values, concatenate it to the initial dataset, and train a new model. If the importance of a given feature is less than the importance of this random feature (often called a shadow feature), it is likely to be irrelevant to the problem. We can label this algorithm as a trivial instance of the wrapper methods.

11.4 Feature store

Now we have arrived at a powerful design pattern encompassing many techniques mentioned in the chapter into a single entity—a feature store. It enables teams to calculate, store, aggregate, test, document, and monitor features for their ML pipelines in a centralized hub.

Imagine investing weeks of effort into engineering sophisticated features only to stumble upon a conversation with a colleague during a coffee break where you discover that another team has already implemented and tested exactly the same features. Alternatively, a colleague approaches you, seeking an implementation of a specific feature. While you confirm its existence, you realize that your code lacks proper documentation and reusability due to heavy dependencies on other code in your repository. Consequently, your colleague decides that an easier way is to reinvent the wheel and develop their own implementation. This inconsistency may lead to each engineer implementing and calculating each feature on their own, causing the company to waste an unacceptably large amount of resources (see figure 11.10).

figure
Figure 11.10 Not having a feature store in place can lead to inconsistency and excessive spending.

These scenarios illustrate only a tiny fraction of the challenges that can be addressed by using a feature store. By adopting it, we step away from a fragmented approach where each team independently implements and calculates features. Instead, we embrace a unified system that maximizes the reusability of features, as depicted in figure 11.11.

figure
Figure 11.11 A feature store that maximizes the reusability of features implemented in a unified manner

11.4.1 Feature store: Pros and cons

Designing, building, and managing a feature store may present certain challenges, but its benefits could far outweigh the drawbacks. Let’s explore some of the advantages of having a feature store (figure 11.12 illustrates those ML areas that benefit the most from having feature stores):

figure
Figure 11.12 A landscape of ML problems with various needs for a feature store

The disadvantages of having a feature store are straightforward:

Let’s focus in detail on the last disadvantage. Not all ML problems can be optimized through a feature store. A typical beneficial area for having a feature store is mainly tabular data (structured data) with multiple data sources of various granularity and SLA. Most pure deep learning problems (typically those having one or two sources of unstructured data, like text or images) are less suitable for feature stores. However, given that multimodal data usage is becoming more popular these days, the concept of a feature store has become more universal. Imagine a typical online marketplace; years ago, ML systems would be based on tabular data like the history of sales and clicks, whereas now it is a common pattern to include items’ images, descriptions, and user reviews. The simplest way to include such data is to extract embeddings via a pretrained neural network (e.g., CLIP for images or SentenceTransformers for short texts) and treat them as features. This approach closes the loop: such features can be stored in a feature store as well for the same reasons as “classic” features, thus saving processing time and ensuring consistency across the system. As a bonus, storing such features in a centralized storage unlocks additional usage patterns. For example, using a vector database (such as Qdrant or Faiss) for storage allows you to fetch similar items quickly and use them in downstream models.

The best way to start is to analyze the existing extract, transform, and load pipelines of all teams. The following are questions you should be ready to ask:

If we conclude that numerous teams would benefit from having a feature store, it’s a sign to invest efforts into designing and building a centralized feature store.

Here we would like to stress once again that building your own custom feature store is a huge and expensive project. That’s one of those cases when you should consider reusing an open source solution or a product from a third-party vendor. Some popular options are Feast, Tecton, Databricks Feature Store, and AWS Sagemaker Feature Store.

11.4.2 Desired properties of a feature store

In this section, we will touch upon useful patterns and properties in designing feature stores and highlight important problems you must address. Not all of them will necessarily occur in your feature store in particular, but our goal is to make sure each is well covered in the book.

Read-write skew

Writing and reading are two essential sides of any feature management system, so during the design stage, we need to know the load size in terms of read and write operations (and the amount of data). What latency of reading is critical for us? How often should we recalculate existing features? Often, calculating a feature in runtime is faster or comparable to fetching one from storage.

Writing is commonly done in batches. We don’t prefer to compute a whole dataset when we can do it simultaneously, although this is a widely used antipattern. Moreover, updating the features for the last few days helps us overwrite some temporary corruptions or unavailability on the data warehouse side that may occur just before our daily feature update. It is worth noting that “commonly” is not a common case—features can be appended or updated on different schedules. For example, some are computed by a long job on a daily basis, and some are lightweight enough to be streamed in near real-time.

The critical aspect of reading features is usually latency. We must ensure that the infrastructure we are building for our feature store meets our nonfunctional requirements. Sometimes we can combine precomputed features with real-time features (those that require the most recent events) during read operations, as shown in figure 11.13.

figure
Figure 11.13 Online features and batch features are written in a feature store in different ways but are consumed in the same manner.

Precalculation

The DRY principle in programming stands for “don’t repeat yourself.” This principle leads us to the fundamental heuristic behind any optimization: if it’s possible to avoid computing something, it should be avoided.

In particular, one of the most straightforward patterns is calculating features, which should be done in advance but not when we ask the feature store to gather a dataset. For example, a good time to update features is when our database finishes processing orders for the previous day.

A closely related optimization technique is to split the calculation into multiple steps:

  1. We preaggregate raw data (e.g., clicks, prices, revenue) into item-day sums.
  2. We aggregate these sums into desired windows (e.g., 7/14/30/60 days).

This approach helps us reuse features calculated yesterday or N months ago (not to run almost the same computation every day) and merge computation of similar features with partly the same lineage or overlapping aggregation windows, as shown in figure 11.14.

figure
Figure 11.14 A hierarchy of aggregating used in a feature store

Feature versioning

A good rule of thumb is that each feature’s update should be considered either a new feature (which is not the best solution and will generate many similar features after each tiny optimization and mess up your code) or a new version of an old feature. But why is it important at all?

Suppose some engineer implements a new version of a feature, and your system treats it as if it’s the same feature as before, with no differentiaton. You’ll be lucky if the calculation is exactly the same but done faster, for example. But if the calculation principle is changed (even a bit)—or worse, if the engineer changes the data source for the same feature—it leads to inconsistency in the precalculated feature. The values of the feature before and after the update can significantly differ from each other, and you don’t want to mix values from old and new calculation methods. A well-designed feature store will automatically overwrite the updated feature or, even better, write it to a new table while backfilling all previous values with available history.

Each dataset should capture in its meta info not only which features it contains for a given range of timestamps but also versions of these features at the time of calculation. It allows us to easily roll back to an old version of a dataset and completely reproduce the results of an old model that was developed, for example, 2 months ago. This pattern is similar to the libraries’ version freezing for our application.

Feature dependencies or feature hierarchy

Not all features are easy to compute from the raw data of our data warehouse. It may cause computationally expensive queries and, again, does not reuse the results of previous computations. This leads us to the concept of feature dependencies, or feature hierarchy, where each feature depends on other features and/or data sources.

A pattern we discussed earlier, preaggregation, can be considered a parent feature for the final features. We highlight them (level 1 features) along with their child features (level 2 features, etc.) in figure 11.15.

figure
Figure 11.15 A graph of feature dependencies

The way we obtain each feature from its data sources is called a lineage, which is effectively a directed acyclic graph (we discussed them in chapter 6). We track the lineage of each feature to know which order to run feature calculations in, whether we need to update its child features after changing the implementation of one of their parents (which triggers a wave of feature version updates), or whether there is a need for any kind of other feature version update, corruption, or deletion.

Lineage tracking also helps engineers and analysts rapidly explore the source of each feature, thus simplifying debugging and improving their understanding of the origin of outliers or other surprising behavior.

Feature bundles

We often apply the same transforms and filters to closely related columns from the same data sources (e.g., price before discount, price after discount, and price after applying promo). These similar features have the same key and the same lineage.

It means there is no need to work with features as isolated columns and write a separate data pipeline for each. We naturally prefer to implement our feature store in such a way that it consolidates computations for features derived from the same data sources. Thus, a single entity of computations would be a batch of features, not a single one:

<day, user_id, item_id, f1, f2, f3, f1 / f2, …>

Despite merging their computational graphs, we prefer to operate on sets of similar features as a whole in API (or UI) to add them to the dataset simultaneously.

11.4.3 Feature catalog

In relation to the UI, the final secret ingredient for a feature store is a feature catalog. A feature catalog is a service with a web UI where ML engineers, analysts, or even your nontechnical colleagues can search for features and examine their implementation details.

Other things that can be shown to users are feature importance, value distribution, category, the owner, update schedule (daily, hourly), key (user, item, or user-item, or item-day, category-day), feature lineage, ML services that consume this feature, and other meta info.

11.5 Design document: Feature engineering

As we mentioned in the introduction to this chapter, features are the backbone of your ML system’s prediction ability, and for this reason alone, they deserve their spot in the design document. We will cover them in both our design documents.

11.5.1 Features for Supermegaretail

After building our baseline solution, we need to determine the next steps for its improvements. One of the primary ways to do this is to use features that will help the model extract useful patterns and relationships from the raw data.

11.5.2 Features for PhotoStock Inc.

A potential set of features for PhotoStock Inc. will be completely different from that for Supermegaretail.

Summary

Назад: 10 Training pipelines
Дальше: 12 Measuring and reporting results