11 Features and feature engineering

This chapter covers

The iterative process of feature engineering
Analyzing feature importance
Selecting appropriate features for your model
Pros and cons of feature stores

It is often said that a mediocre model with great features will outperform a great model with poor features. From our experience, this statement couldn’t be more true. Features are the critical inputs for your system; they drive your algorithms, provide essential patterns for the model, and feed the data it needs to learn and make predictions. Without good features, the model is blind, deaf, and dumb.

While the role of feature engineering is not crucial for a system designed with a deep learning core in mind, no machine learning (ML) practitioner can ignore their role. In a sense, framing some fancy multimodal data into a deep learning model or even making a prompt for a large language model is a specific way of feature engineering, and that’s why classic feature-related techniques like feature importance analysis are still very relevant.

This chapter explores the art and science of creating effective features. We will cover tools that help determine the most valuable features for the system, the engineering challenges we can face, the factors and tradeoffs we should consider while selecting the right subset of features, and how we can ensure that the selected features are reliable and robust.

11.1 Feature engineering: What are you?

Feature engineering is an iterative process that involves creating and testing new features or transforming existing features to improve the model’s performance. This process requires domain expertise, creativity, and data engineering skills to build new data pipelines for the system. Given its time-consuming and iterative nature, feature engineering often devours a significant portion of resources allocated to modeling.

To secure a fruitful and streamlined modeling process, you should always make sure you assemble an effective feature engineering strategy while designing a system. This plan will become a compass to guide the team through identifying and engineering the most impactful features while minimizing the risk of wasted efforts. By prioritizing iterations in the proper order and charting the course, we can avoid potential pitfalls and ensure our actions add value to the end goal.

Feature engineering in ML is similar to crafting prompt structures in generative models, such as large language models and text-to-image generators. Both features and prompts serve as enhanced inputs that guide the model’s “attention focus” (literally or figuratively) toward the most relevant data aspects.

Note By developing proper features and prompts, we inject a specific perspective, context, or “inductive bias” into the model, leading it to favor specific outcomes. Despite their different nature, features and prompts share a single goal, which is to contextualize the model with our domain knowledge and direct it toward the outcomes we aim for.

Speaking of powerful deep learning models, in certain domains, such as audio and image processing, feature engineering used to be a complicated problem. Then the deep learning revolution happened, and its practitioners were delighted because instead of engineering endless, barely reliable features, they could now delegate to a deep learning model trained end-to-end. There are even ML practitioners who never engineered features outside of study projects! This trend may be interpreted as a signal that this chapter can be safely skipped. However, we believe that even deep learning-based pipelines may benefit from feature engineering and related techniques. A great example of that comes from Arseny’s experience.

Campfire story from Arseny

I was once working on an ML system with a deep learning model under the hood. The system would take an image, process it with a deep learning model, and apply ML-free postprocessing to output a certain number. However, there were inaccuracies within the first step that greatly affected the final result, and the system was not considered production-ready due to poor performance. Improving the first component was hard because of severe data limitations (the system had to work in a few-shot setup—a formal way to say the model should be functional using only a few labeled samples per class). But the trick that eventually saved the day was a simple regression model that refined the final output. Thanks to utilizing handcrafted features, the model was not that data hungry. As a result, a combination of deep learning for the heavy lifting job and a simple feature-based model for inductive bias was powerful enough to make the system production-ready and eventually actively used.

11.1.1 Criteria of good and bad features

Let’s break down some of the feature characteristics as well as the tradeoffs we should keep in mind:

Model’s performance —Features should align with the business problem and capture relevant aspects of the data to provide meaningful dependencies with the target variable. Working with domain experts helps develop the right set of features, as well as generate new ones. When it comes down to feature importance analysis, it helps to precisely evaluate the contribution of a given feature and get insights into which features bring us closer to the project’s goal.
Amount of historical data —Limited historical data can result in missing values, reducing the overall quality of the dataset. Re-creating missing data through a one-time data engineering effort retrospectively seems like a good solution, but it is not always feasible. On the other hand, the lack of a sufficient amount of historical data can prevent forecasting models from capturing trends and seasonality in feature value.
Tradeoffs between the quantity and the quality of features —While having more features improves the predictive power of an ML model, too many irrelevant or redundant features will lead to overfitting and eventually to decreased performance. We always prefer the model to focus on a small set of strong and diverse features rather than spreading its focus on many generated and correlated ones.
Interpretability and explainability —These are the crucial factors to consider when designing features. While complex features may improve the model’s performance, they can also reduce interpretability and make it difficult to explain the reasoning behind predictions. On the other hand, simple features may be more interpretable, but they can only capture a part of the nuances in the data. Striking a balance between interpretability and performance is essential and may vary depending on a specific system and domain.
Development complexity of features —Complex features may require more time to develop. They either depend on other features and sophisticated data pipelines or rely on new data sources, which makes them more challenging to implement and maintain. They require more data engineering efforts to create and maintain data pipelines. Therefore, it’s important to carefully consider the cost and benefits of each feature and decide whether additional complexity is worth the investment.
Feature cost and service-level agreement (SLA) —In addition to considering the computational complexity of individual features, you must also consider the overall time required to compute all the features, as well as the required RAM to support the constantly growing load. This includes the time required to compute each feature and any dependencies between them. Feature interactions dictate the order of feature calculations and the possibility of their parallelization. For example, for real-time applications, features that require much time to compute may not be feasible due to SLAs. Moreover, it is important to consider data availability for training and inference pipelines. If we cannot get all the necessary data during serving, the model will end up with inaccurate or incomplete predictions.
Risks of poorly designed features —Fragile features lead to the fragility of the whole system. They can be sensitive to data or model changes, causing unstable or unpredictable behavior. Features that rely on external data sources or APIs may be subject to changes or disruptions, affecting the model’s reliability. To prevent it, we should carefully test and validate features before integrating them into the model and monitor their performance over time to ensure they continue to provide business value.

Feature engineering is aimed to be continuous, as business goals and data distributions change over time. We must constantly evaluate and update our feature set to ensure it remains relevant and effective in solving business problems. When developing features, it is important to keep track of the changes made to each feature, including their versioning and mutual dependencies. It makes the system reproducible and maintainable.

11.1.2 Feature generation 101

With the mentioned criteria and limitations serving as our compass, we are ready to discover common ways of generating new features.

The most obvious way to fetch a new feature is to add a new data source to your data pipeline or use a column that previously was not incorporated into the dataset. This data source can be either internal (e.g., an existing table in the database) or external (e.g., buying data from a third-party provider). On the one hand, these new features are low-hanging fruit with a valuable contribution to the model’s performance. On the other hand, they require most of the data engineering efforts, take a lot of time to manage, and may cause infrastructural problems, as greater complexity always requires more effort for maintenance.

If new sources are not used, there are two alternatives—to transform the existing features or to generate new features based on a combination of two or more of the existing ones.

Transforming numeric features includes scaling, normalization, and mathematical functions (e.g., using logarithms to improve distribution skewness). The type of model dictates the appropriateness of transformation. For instance, a common thing will be finding no increase in gradient boosting metrics after applying monotonic transformations to its features because the core element of the algorithm—a decision tree—is invariant under monotonic transformations of inputs.

When dealing with time-series data, it’s common to utilize transformations such as lags (shifting the feature’s values backward in time to create new features), aggregates (calculating measures like mean, max, or min over a specific time window), or generating statistical features from past data, such as the standard deviation or variance over different time periods.

Quantile bucketing (or quantization) is a specific case of transformation. It converts continuous features into categorical features by grouping them into discrete buckets based on their values. For example, Uber applies this approach in its DeepETA network (; see figure 11.1).

Figure 11.1 An overview of the DeepETA model pipeline: example of combining base feature engineering and a deep learning model (source: )

This network employs the transformer architecture to predict the estimated time of arrival, processing a diverse array of tabular data. The data, which includes continuous, categorical, and geospatial features, is all transformed into discrete tokens and subsequently into learnable embeddings suitable for the transformers. You can read more about DeepETA in the paper “DeeprETA: An ETA Post-Processing System at Scale” by Xinyu Hu et al. ().

Categorical features often necessitate transformations, which can be accomplished through methods like one-hot encoding, mean target encoding, ordinal encoding (this method ranks categories based on some inherent order), or the hashing trick, which allows handling large-scale categorical data. It is important to note that while being powerful, mean target encoding can easily lead to data leakage if not properly implemented, as it uses information from the target variable to create new features.

For sequential data like text, we can use techniques such as bag-of-words, term frequency-inverse document frequency (TF-IDF), and BM25 to transform the data into a form that can be processed by ML algorithms. It is worth noting that these methods lose information about word order; this disadvantage can be partially addressed by using longer N-grams instead of single words (unigrams). We can also use pretrained language models such as BERT to represent input data in a low-dimensional embedding space, which we can feed to the final model.

Remember that we can represent almost any sequential data as tokens, not texts. For example, in industries like online retail and media streaming services, we can interpret a user session as a sequence of visited product pages or watched videos. Each visited page will have its learnable representation (an embedding). Afterward, we can use these embeddings in our recommendation system as a prompt in the “next page prediction task” to get an idea of what product/video the user is looking for.

If we want to use product embeddings in the tabular dataset, one of the common options is to utilize the distances between products. Examples of features here would be

How close are the top five neighbors to product X?
What is the average/minimum price of the top five closest products for product X?
What is the absolute/relative difference between the prices of product X and product Y?

Although these sophisticated features do add to the complexity of the training and inference pipelines, the signals they provide may lead to a major advancement in the model’s performance.

What about merging signals from multiple features into one? When we have multiple features in our dataset, we can combine them to create a feature that is more informative or meaningful for our model. For example, instead of having separate features for the number of clicks and the number of purchases a user has made on an eCommerce site, we can combine them to create a new feature such as “purchase-to-click ratio,” which might be a better indicator of the user’s buying intent.

In the case of a taxi aggregator company, instead of having separate features for the “number of rides” and “total distance traveled,” we could combine them to create a new feature like “average distance per ride,” which might provide more valuable insights into drivers’ and passengers’ behavior.

We should also consider the relationship between the existing features. For example, absolute product sales for a certain period may provide little information. However, comparing them to sales of other products in the same category or sales in previous periods may reveal valuable patterns or trends. Combining signals from multiple features can create new features that capture more complex relationships in the data and improve the model’s performance.

The technique of combining multiple features is usually referred to as feature interactions or feature cross. This technique is especially important for linear models because such features may unlock the linear separability of data points.

11.1.3 Model predictions as a feature

As we discussed earlier, if a feature depends on another feature, any changes or updates to the latter may require corresponding changes or updates to the former. It creates maintenance/debugging challenges and increases the system’s complexity over time.

Model predictions can be thought of as a specific case of a feature where the output of the model is used as an input to another model or system. This approach is sometimes called model stacking. While using model predictions as features can be powerful and effective, it poses some engineering challenges and risks.

The simplest example of using model predictions as a feature is target encoding (). In this approach, the categorical feature is encoded by the mean target value (with a certain degree of regularization) and is used as a feature in the model. However, there is a risk of data leakage where the encoding is based on information from the training data that is not available during inference. This can result in overfitting and poor performance on the new data if we don’t use advanced validation techniques like nested cross-validation (see chapter 7).

Another example is using third-party models (e.g., weather forecasts as a feature in a demand prediction model). While weather data can be highly informative, there is a risk that forecasts may need to have the necessary historicity. In such cases, forecasts with the necessary historicity are preferable to forecasts with higher precision. Besides, relying on external data sources can introduce additional dependencies and risks beyond the ML team’s control.

Finally, using third-party or open source models as feature extractors in deep learning systems can pose risks, too. While the generated embeddings can absorb useful patterns in the data, there is a danger of model drift or instability if the external model is updated without proper versioning and vice versa—with no updates, the model may lose its value due to a drift in your data. This can result in unexpected behavior and drastically drop the performance of your ML system.

To mitigate these risks and challenges, it is important to design the feature engineering pipeline carefully and have robust testing and monitoring procedures in place (the former is described in chapters 10 and 13; the latter can be found in chapter 14). This can include using cross-validation and other techniques to prevent leakage, validating external data sources and models, and having processes in place for monitoring and updating features over time.

11.2 Feature importance analysis

Once the initial set of baseline features has been selected for the model, understanding which features affect the model’s predictions the most can provide valuable insights into how the model makes decisions and where further improvements can be made.

ML models can often be seen as black boxes that provide no insight on how they arrive at their predictions. This lack of transparency can be problematic for engineers, stakeholders, or end users who must understand the rationale behind decisions provided by a given model.

In pursuing model transparency, we employ two key concepts: interpretability and explainability. Both are aimed at demystifying the workings of an ML model:

Interpretability revolves around comprehending the internal mechanics of a model, shedding light on how and why it generates its predictions.
Explainability, however, is about articulating the model’s behavior in terms that are comprehensible to humans, even when the internal mechanics of the model are complex or opaque ().

Feature importance analysis serves as a tool for achieving both interpretability and explainability, as it helps pinpoint the features that greatly contribute to the model’s predictions. The results of feature importance analysis are collected as part of the training pipeline artifacts and may play a role in the model verification procedure, which delivers the “to deploy, or not to deploy” verdict to a freshly trained version of the model (you can find more details in chapter 13). A good example here is a system that determines the cost of a trip in a taxi aggregator application, as shown in figure 11.2.

Figure 11.2 An example of a taxi aggregator app’s UI that clarifies why its dynamic pricing algorithm chooses this particular price in this area and during this time of the day

Under the hood, the app works with all the crucial features and analyzes the current live data like traffic density, weather conditions, and so on, when determining the end price. What it also does, however, is provide the rationale behind the suggested price in a convenient and user-friendly form. With this delivery, the user understands why a typically cheap ride they take on a regular basis suddenly goes up in price.

In addition, feature importance analysis can increase trust in our ML system. This is particularly important in high-stakes domains such as medicine and financial fields. While the General Data Protection Regulation () does not strictly enforce explainability, it does suggest a level of transparency in automated decision-making, which could be beneficial or even essential in many ML applications ().

Identifying the most important features explains the variables driving the model’s predictions and the reasons behind them. This information can help optimize those features to boost the model’s performance and remove irrelevant or redundant features to improve efficiency. Additionally, it can guide us through debugging by, for example, detecting overfitting or evaluating the usefulness of newly added features.

11.2.1 Classification of methods

Let’s explore methods of feature importance analysis and how they can be applied to improve the transparency and performance of an ML system.

Navigating the terrain of feature importance analysis can be daunting, but having a map of available methods can show us the right direction. These methods can be broadly classified based on their properties, such as type of model, level of model interpretability, and type of utilized features.

Classical ML vs. deep learning

The methods applied for feature importance analysis can differ substantially between classical ML and deep learning models. For classical ML models, where features are often manually selected based on domain knowledge or statistical analysis, determining feature importance is straightforward—we can either directly inspect model weights and decision rules or exclude/modify a separate feature to investigate its contribution to the model’s prediction.

On the other hand, deep learning models, which automatically learn feature representations from data, present unique challenges for importance analysis. Given the complex, nonlinear transformations and the high level of abstraction, there is more involved than simply looking at the model’s parameters to understand feature importance. Instead, we rely on advanced techniques like saliency maps, activation maximization (read more in “Visualizing Deep Convolutional Neural Networks Using Natural Pre-Images” by Aravindh Mahendran et al., ) or layer-wise relevance propagation (read more in “Layer-wise Relevance Propagation for Neural Networks with Local Renormalization Layers” by Alexander Binder et al., ) and its successors to make sense of what is happening inside neural networks. Please bear in mind that the list of examples is not exhaustive, and none of them is truly universal, as the problem of explainable deep learning is not solved in general and remains in an active research stage.

Model specific vs. model agnostic

Model-specific methods use the structure and parameters of the model to estimate feature importance. For example, in tree-based models, we can count how many times we split particular features during training time or the total gain it gives among all splits. Similarly, we can look at the magnitude and sign of the coefficients assigned to each feature for linear models.

In their turn, model-agnostic methods treat a model as a black box. They often involve perturbing the input data and observing the effect on the model’s output. Examples of model-agnostic methods include

Permutation feature importance —Measures the importance of each feature by randomly permuting its values in the dataset and observing the resulting decrease in model performance
SHapley Additive exPlanations (SHAP) values —Estimate the contribution of each feature to a specific prediction by averaging over all possible combinations of features

Individual prediction vs. entire model interpretation

Another important distinction is whether the methods are designed for individual predictions or for interpreting the entire model (see figure 11.3). Methods that focus on individual predictions estimate the importance of features for a particular input, delving into why the model has made a particular decision. On the other hand, methods that interpret the entire model estimate the importance of features in a more general sense, elaborating on the overall behavior of the model.

Some examples of methods that focus on individual predictions include local interpretable model-agnostic explanations (LIME); see the paper "Why Should I Trust You?" by Marco Tulio Ribeiro et al. (), which approximates the decision boundary around a particular input using a simpler, more interpretable model (e.g., linear regression), and anchor explanations (learn more in the paper “Anchors: High-Precision Model-Agnostic Explanations” by Marco Tulio Ribeiro et al. (), which identify a rule that sufficiently “anchors” a decision, making it interpretable by humans.

Figure 11.3 A taxonomy of model explanation methods

Often we use a combination of methods to mitigate the limitations of individual approaches and gain a more complete understanding of the model. Keep in mind, though, that no one-size-fits-all method can provide a definitive answer to all feature importance questions, and the choice of methods should be tailored to the specific problem and context.

11.2.2 Accuracy–interpretability tradeoff

Highly interpretable models may sacrifice accuracy in favor of transparency, and vice versa—models that achieve high accuracy often do so at the cost of interpretability (see figure 11.4). Modern large language models based on transformers, such as GPT, provide a vivid example. They have revolutionized the field of ML by achieving state-of-the-art performance in a wide range of natural language processing tasks. However, they are also often highly complex, with billions of parameters, making it difficult to understand how they arrive at their decisions.

Figure 11.4 The more sophisticated the model we use (and therefore, usually the more accurate), the less explainable it becomes.

The accuracy–interpretability tradeoff remains a challenging problem as new unexplored architectures arrive. The choice of a method should be tailored to the specific problem and context, considering factors such as the importance of interpretability, the complexity of the model, and the desired level of accuracy.

11.2.3 Feature importance in deep learning

Feature importance analysis for models that work with tabular data is a comprehensible problem with a clear solution; we have easily separable features and well-known tools to measure how each of them influences the model, target variable, or final metric.

However, in the context of deep learning, especially with data types such as images, audio, or text, feature importance can become less clear and more challenging. Deep learning models, by nature, automatically learn hierarchical representations from the data, often in a highly abstract and nonlinear manner. In these cases, a “feature” can refer to anything from a single pixel in an image to a single word or character in a text or a specific frequency in an audio signal to complex attributes, like the location of an object in an image, the sentiment of a sentence in a text, or a specific sound pattern in a voice record.

Despite that, it is still possible to gain insights into what the model considers important in raw input data and which patterns it pays attention to. Let’s explore a few techniques for feature importance analysis in deep learning:

SHAP values —Similar to classical features, SHAP values estimate the contribution of each token or pixel in the model’s outcome, providing a model-agnostic explanation of which individual parts of input affect the model the most.
Saliency maps —Saliency maps are a form of local explanation highlighting the regions in the input image sensitive to the model’s output. Essentially, they compute the gradient of the output for the input, resulting in a heatmap where each pixel indicates how much changing that pixel would affect the output. Among examples of saliency map methods are GT, SalNet, SALICON, and SalClassNet. Figure 11.5 is taken from the paper “Top-Down Saliency Detection Driven by Visual Classification” by Francesca Murabito et al. (), which can provide basic intuition on how saliency maps work.

Figure 11.5 Examples of output saliency maps generated by different methods (source: )

Perturbation-based techniques —These methods determine the importance of a feature by observing the effect on the model’s output when the feature is altered or removed. A good example of a perturbation-based technique is occlusion, which is applied primarily in vision models where portions of an image are systematically occluded (covered up), and the subsequent changes in the output are tracked. The idea of occlusion first appeared a while ago in a work by Matthiew D. Zeller et al., “Visualizing and Understanding Convolutional Networks” () and was actively used in newer developments like “RISE: Randomized Input Sampling for Explanation of Black-Box Models” by Vitali Petsiuk et al. (). This helps visualize which parts of an image are considered the most relevant by the model for its predictions.
Attention in transformers —In transformers, an attention mechanism is a natural form of feature importance. Attention scores indicate the weight the model gives each token in the sequence for making a prediction. These attention weights can be visualized and interpreted as the model’s focus, emphasizing how the model (specifically, a particular attention head) “reads” and understands the input text (see figure 11.6).

Figure 11.6 Visualization of an attention head in the Encoder-Decoder Transformer

As you can see, we start to observe the parallels with classical ML. Although deep learning models present unique challenges in feature importance analysis, there are still methods that can provide insights into how the model makes decisions and which patterns are the essential predictors for the target variable.

11.3 Feature selection

Perfection is finally attained not when there is no longer anything to add, but when there is no longer anything to take away.
— Antoine de Saint-Exupéry

In the previous sections, we’ve learned about the art of feature engineering and how to transform raw data into meaningful features. However, not all features are equally useful; some may be irrelevant, redundant, or too complex for our model to handle effectively.

This is where feature selection comes in. By carefully selecting the most informative features, we can improve our system’s performance and interpretability while reducing its complexity and training time. We will explore the techniques, best practices, and potential pitfalls of feature selection and learn how to choose the right features for our specific ML problem.

11.3.1 Feature generation vs. feature selection

The feature generation and feature selection processes in ML can be compared to gardening. Similar to gardeners who plant various seeds in the soil, we generate a range of features, explore new data sources, experiment with different feature transformations, and brainstorm new ideas that might improve the model’s performance.

However, just as not all plants in the garden will thrive, not all features will benefit the model, and at some point, we will have to prune away dead or unproductive plants (in our case, discard irrelevant or redundant features) to sustain healthy growth. This cycle of nurturing and pruning, of adding and reducing, is a constant in the life of ML systems as we continually refine and improve our feature sets.

The ancient Greek philosopher Heraclitus once said, “Opposition brings concord. Out of discord comes the fairest harmony.” This also holds true in ML, where we achieve optimal performance by keeping the balance between generating new features and carefully selecting the most informative ones.

11.3.2 Goals and possible drawbacks

You may ask, “Okay, but why care so much about feature selection in the first place?” There are certain benefits to it:

Greater accuracy and less overfitting —Picking the most informative features helps the model focus on the most important signals. Removing irrelevant or redundant features reduces the risk of overfitting when the model becomes too complex and fits noise in the training data rather than the underlying patterns.
Easier to explain —A decision made by 10 meaningful features is easier to interpret and understand than one produced by 100, even if the model’s performance is higher in the latter case.
Easier to build and debug —If, during the feature selection stage, we gathered the insight that three out of five data sources we tried are now redundant, we can save a lot of time and computing resources by removing them from the training and serving pipelines. A simpler data pipeline takes less effort to maintain and troubleshoot.
Faster training and serving time —As we reduce the number of features, the model’s complexity decreases, leading to shorter training times and lower computational costs.

For convenience, we have gathered all the benefits of feature selection into a single scheme in figure 11.7.

Figure 11.7 Reasons for feature selection

In real-time applications, the need for speed often takes precedence, even if it means compromising the model’s accuracy to fulfill SLAs. For instance, in speech recognition systems like those used in virtual assistants, users expect instant and accurate transcription of their spoken words into text. Even the most minor delays could disrupt user experience and make the system appear less efficient.

Perfect personalization becomes worthless if it slows prediction by 300 ms, causing negative perception. Therefore, lightweight personalization with moderate quality is more appropriate than a model that accumulates all possible inputs from a user but makes them wait.

There are also potential risks and drawbacks of feature selection besides the balancing between computational time and accuracy:

Loss of potentially valuable signals —Removing features may lead to losing important information that can improve the model in the future. We may overlook some reasonable preprocessing or aggregation and hastily conclude that the feature has no useful signal.
Unforeseen interactions —Removing certain features can create unforeseen interactions between features, leading to unexpected behavior and reduced model performance. It is essential to consider the relationships between features and the potential interactions that may arise when selecting them.
Bias —Certain features may be more heavily weighted than others, leading feature selection to biased predictions. Imagine if we only select highly correlated features with the target variable. In this case, we might introduce bias into the model and fail to capture important information that is not highly correlated but can still be relevant to the prediction task.
Risk of overfitting —Similar to hyperparameter optimization, feature selection is a learning procedure. And any learning procedure that ingests a target variable requires a proper validation schema (see chapter 7 for details). Suppose we use the same data to select features and evaluate the model’s performance. In that case, there is a high risk of overfitting the test data, leading to overly optimistic performance estimates.

Besides these problems, if done regularly, feature selection adds a computationally intensive stage to the training pipeline that we should also consider, especially when greedy wrapper methods are used.

11.3.3 Feature selection method overview

There are various methods available for feature selection, each with its pros and cons (see figure 11.8). The most common approaches are filter, wrapper, and embedded methods. Let’s take a closer look at each of the three.

Figure 11.8 Families of feature selection methods

Filter methods work by filtering features independently from the model, using simple ranking rules based on the statistical properties of a single feature (univariate methods) or the correlation with other features (multivariate methods). These methods are easily scalable (even for high-dimensional data) and perform quick feature selection before the primary task.

The order in which characteristics are ranked in univariate filter methods is determined by the intrinsic properties, such as feature variance, granularity, consistency, correlation with the target, etc. Afterward, we leave top-N features as our subset and either fit the model or apply more advanced, computationally intensive feature selection methods as the second feature selection layer.

In multivariate methods, we analyze features in comparison with each other (e.g., by estimating their rank correlation or mutual information). If a pair of features represent similar information, one of them can be omitted without affecting the model’s performance. For example, the feature interaction score (regardless of the way it is measured) can be incorporated into an automatic report. When the score is high, it triggers a warning for potential reduction in the model’s performance before the training begins.

Wrapper methods focus on feature subsets that will help improve the quality of the model’s results used for selecting based on a chosen metric. We call them this because the learning algorithm is literally “wrapped” by these methods. They also require designing the right validation schema nested into outer validation (to choose the right validation schema, please see chapter 7).

Wrapper algorithms include sequential algorithms and evolutionary algorithms. Examples of sequential algorithms are the sequential forward selection, where features are included one by one starting from the empty set; SBE (sequential backward elimination), where features are excluded one by one; and their hybrids—floating versions when we allow inclusion of an excluded feature, and vice versa. In evolutionary algorithms, we stochastically sample subsets of features for consideration, effectively “jumping” through the feature space. A common example of an evolutionary algorithm is to run a variation of differential evolution in a binary feature mask space where “1” indicates an included feature and “0” denotes an excluded one.

The main disadvantage of these methods is that all of them are computationally intensive and often tend to converge to local optima. Despite that, they provide the most accurate evaluation of how the subset affects the target metric. Use them carefully, especially if your hardware specs are limited.

With embedded methods, we use an additional “embedded” model (which may or may not be of the same class as our primary model) and make decisions based on its feature importance. A good example is the Lasso regression, due to the ability of the L1-regularization to turn the coefficients to zero if they are not relevant to the target variable, as shown in figure 11.9.

Figure 11.9 Lasso regression eliminates features one by one by reducing their coefficients to zero as L1 regularization term grows.

Another widely used feature selection algorithm is recursive feature elimination (RFE), which consists of training and removing the worst K features on each step based on the embedded model’s feature importances.

Embedded methods lie somewhere between filter methods and wrapper methods as far as the required computation and selection quality are concerned.

When it comes down to choosing the methods to start with, we prefer rough cutoffs for the initial reduction of the number of features rather than using Lasso selector or RFE, depending on which one outputs more meaningful subsets. Computationally expensive methods may have better performance, but faster methods tend to be good enough for initial feature pruning, especially if you suspect that some features are total garbage.

There are plenty of dummy and still-useful methods that also serve reasonable feature selection baselines. For example, we can take a feature, shuffle its values, concatenate it to the initial dataset, and train a new model. If the importance of a given feature is less than the importance of this random feature (often called a shadow feature), it is likely to be irrelevant to the problem. We can label this algorithm as a trivial instance of the wrapper methods.

Campfire story from Valerii

One of Valeri’s previous projects, which was related to dynamic pricing, used a model that needed to be improved. The reason was simple—the model didn’t perform as precisely as desired, and after doing basic error analysis, he realized that the majority of errors were caused by SKUs with large numbers of items sold. Further investigation revealed that while some features, specifically those based on price history, were critical, other features were barely significant, and filtering them out using Lasso regression simplified the model. When the number of features was reduced, it made using simple feature interactions (polynomial combinations of the survived features) much more feasible due to the reduced number of overall features (10^2 is 100, while 20^2 is 400). It didn’t help at first due to specific preprocessing, but once the preprocessing was changed, it provided a noticeable positive effect; input data was normalized to (0..1), and some interactions could turn into zeros because one component is equal to zero, regardless of the second component, so 0*0 and 1*0 produce the same output, but actually they are very different. So by adapting the scaling range to (1..10) to address the multiply-by-zero problem, converting numbers to float16 for smaller RAM consumption, applying polynomial feature interactions (much easier to do on numbers in the range 1–10 from a memory perspective), and then scaling it again to 1–10 and training a simple Ridge regression on top of new features, I was able to reduce the error by 30%. It is worth noting that previous attempts at model improvement had been focused on using more complicated models like gradient boosting and neural nets, but investing in feature engineering appeared to be a shorter path, beating more sophisticated approaches.

Campfire story from Arseny

Arseny was once working on a system that was effectively a text classification: given the transaction description (a cryptic half-truncated string full of acronyms) and its additional attributes (e.g., amount of money transferred), the system needed to categorize the transaction. Transformer-based models demonstrated their power in text processing; however, working with the extra attributes was not that straightforward.

The final solution was based on a transformer BERT-like model with a multicomponent prompt as an input. This prompt contained both text input and various features handcrafted from transaction attributes. Working with these features (including feature importance analysis and feature selection) helped improve the system even more in terms of target metrics than typical deep learning model improvements like backbone pretraining or sophisticated loss functions.

11.4 Feature store

Now we have arrived at a powerful design pattern encompassing many techniques mentioned in the chapter into a single entity—a feature store. It enables teams to calculate, store, aggregate, test, document, and monitor features for their ML pipelines in a centralized hub.

Imagine investing weeks of effort into engineering sophisticated features only to stumble upon a conversation with a colleague during a coffee break where you discover that another team has already implemented and tested exactly the same features. Alternatively, a colleague approaches you, seeking an implementation of a specific feature. While you confirm its existence, you realize that your code lacks proper documentation and reusability due to heavy dependencies on other code in your repository. Consequently, your colleague decides that an easier way is to reinvent the wheel and develop their own implementation. This inconsistency may lead to each engineer implementing and calculating each feature on their own, causing the company to waste an unacceptably large amount of resources (see figure 11.10).

Figure 11.10 Not having a feature store in place can lead to inconsistency and excessive spending.

These scenarios illustrate only a tiny fraction of the challenges that can be addressed by using a feature store. By adopting it, we step away from a fragmented approach where each team independently implements and calculates features. Instead, we embrace a unified system that maximizes the reusability of features, as depicted in figure 11.11.

Figure 11.11 A feature store that maximizes the reusability of features implemented in a unified manner

11.4.1 Feature store: Pros and cons

Designing, building, and managing a feature store may present certain challenges, but its benefits could far outweigh the drawbacks. Let’s explore some of the advantages of having a feature store (figure 11.12 illustrates those ML areas that benefit the most from having feature stores):

Reusability and collaboration —The feature store promotes reusability by enabling teams to share and reuse features across different projects and pipelines. This saves valuable time and effort and fosters collaboration among teams, as they can use each other’s work and build upon existing feature implementations.
Streamlined workflow —Rather than starting from scratch with every new project, teams can build upon a foundation of reusable features, accelerating the development process. This streamlined workflow allows for faster iteration and experimentation, leading to quicker insights and improved model performance. The feature store empowers teams to focus on delivering models by minimizing repetitive tasks and providing a structured framework.
Consistency and standardization —With a feature store, there is a unified and standardized approach to feature engineering. This ensures consistency in feature calculation, reducing the risk of inconsistencies or discrepancies across different models, pipelines, or stages of pipelines. By adhering to predefined standards, teams can work together more seamlessly and improve overall system stability.
Documentation and transparency —A feature store facilitates proper documentation (or autodocumentation) of features, including their data sources and methods used for calculation. It enhances transparency, making it easier for teams to discover and assess available features. It also aids in troubleshooting and debugging, as the documentation provides valuable insights into the feature engineering process.
Scalability and maintainability —A well-designed feature store architecture allows for scalability, accommodating large volumes of data and evolving requirements. It simplifies adding new features or modifying existing ones, enabling teams to adapt to changing needs without major disruptions. Additionally, the centralized nature of the feature stores facilitates easier maintenance and monitoring of features, improving the overall reliability of ML pipelines.

Figure 11.12 A landscape of ML problems with various needs for a feature store

The disadvantages of having a feature store are straightforward:

It takes time to collect requirements among different teams, design a feature store that meets all the needs, and implement it (or integrate a third-party solution like Tecton, Feast, Feathr, or Databricks Feature Store).
It reduces flexibility in how we work with features while increasing the dependence of ML teams on each other.
There is a high cost in the case of development from scratch. We do not recommend reinventing the wheel and suggest resorting to off-the-shelf solutions (e.g., Tecton).
A feature store may be inappropriate for your particular project.

Let’s focus in detail on the last disadvantage. Not all ML problems can be optimized through a feature store. A typical beneficial area for having a feature store is mainly tabular data (structured data) with multiple data sources of various granularity and SLA. Most pure deep learning problems (typically those having one or two sources of unstructured data, like text or images) are less suitable for feature stores. However, given that multimodal data usage is becoming more popular these days, the concept of a feature store has become more universal. Imagine a typical online marketplace; years ago, ML systems would be based on tabular data like the history of sales and clicks, whereas now it is a common pattern to include items’ images, descriptions, and user reviews. The simplest way to include such data is to extract embeddings via a pretrained neural network (e.g., CLIP for images or SentenceTransformers for short texts) and treat them as features. This approach closes the loop: such features can be stored in a feature store as well for the same reasons as “classic” features, thus saving processing time and ensuring consistency across the system. As a bonus, storing such features in a centralized storage unlocks additional usage patterns. For example, using a vector database (such as Qdrant or Faiss) for storage allows you to fetch similar items quickly and use them in downstream models.

The best way to start is to analyze the existing extract, transform, and load pipelines of all teams. The following are questions you should be ready to ask:

Which data sources does each team use? How many of them?
What are their intersections with features and usage patterns?
Which kinds of features do they calculate or would they like to?
Which teams need a real-time response (“online features”)?

If we conclude that numerous teams would benefit from having a feature store, it’s a sign to invest efforts into designing and building a centralized feature store.

Here we would like to stress once again that building your own custom feature store is a huge and expensive project. That’s one of those cases when you should consider reusing an open source solution or a product from a third-party vendor. Some popular options are Feast, Tecton, Databricks Feature Store, and AWS Sagemaker Feature Store.

11.4.2 Desired properties of a feature store

In this section, we will touch upon useful patterns and properties in designing feature stores and highlight important problems you must address. Not all of them will necessarily occur in your feature store in particular, but our goal is to make sure each is well covered in the book.

Read-write skew

Writing and reading are two essential sides of any feature management system, so during the design stage, we need to know the load size in terms of read and write operations (and the amount of data). What latency of reading is critical for us? How often should we recalculate existing features? Often, calculating a feature in runtime is faster or comparable to fetching one from storage.

Writing is commonly done in batches. We don’t prefer to compute a whole dataset when we can do it simultaneously, although this is a widely used antipattern. Moreover, updating the features for the last few days helps us overwrite some temporary corruptions or unavailability on the data warehouse side that may occur just before our daily feature update. It is worth noting that “commonly” is not a common case—features can be appended or updated on different schedules. For example, some are computed by a long job on a daily basis, and some are lightweight enough to be streamed in near real-time.

The critical aspect of reading features is usually latency. We must ensure that the infrastructure we are building for our feature store meets our nonfunctional requirements. Sometimes we can combine precomputed features with real-time features (those that require the most recent events) during read operations, as shown in figure 11.13.

Figure 11.13 Online features and batch features are written in a feature store in different ways but are consumed in the same manner.

Precalculation

The DRY principle in programming stands for “don’t repeat yourself.” This principle leads us to the fundamental heuristic behind any optimization: if it’s possible to avoid computing something, it should be avoided.

In particular, one of the most straightforward patterns is calculating features, which should be done in advance but not when we ask the feature store to gather a dataset. For example, a good time to update features is when our database finishes processing orders for the previous day.

A closely related optimization technique is to split the calculation into multiple steps:

We preaggregate raw data (e.g., clicks, prices, revenue) into item-day sums.
We aggregate these sums into desired windows (e.g., 7/14/30/60 days).

This approach helps us reuse features calculated yesterday or N months ago (not to run almost the same computation every day) and merge computation of similar features with partly the same lineage or overlapping aggregation windows, as shown in figure 11.14.

Figure 11.14 A hierarchy of aggregating used in a feature store

Feature versioning

A good rule of thumb is that each feature’s update should be considered either a new feature (which is not the best solution and will generate many similar features after each tiny optimization and mess up your code) or a new version of an old feature. But why is it important at all?

Suppose some engineer implements a new version of a feature, and your system treats it as if it’s the same feature as before, with no differentiaton. You’ll be lucky if the calculation is exactly the same but done faster, for example. But if the calculation principle is changed (even a bit)—or worse, if the engineer changes the data source for the same feature—it leads to inconsistency in the precalculated feature. The values of the feature before and after the update can significantly differ from each other, and you don’t want to mix values from old and new calculation methods. A well-designed feature store will automatically overwrite the updated feature or, even better, write it to a new table while backfilling all previous values with available history.

Each dataset should capture in its meta info not only which features it contains for a given range of timestamps but also versions of these features at the time of calculation. It allows us to easily roll back to an old version of a dataset and completely reproduce the results of an old model that was developed, for example, 2 months ago. This pattern is similar to the libraries’ version freezing for our application.

Feature dependencies or feature hierarchy

Not all features are easy to compute from the raw data of our data warehouse. It may cause computationally expensive queries and, again, does not reuse the results of previous computations. This leads us to the concept of feature dependencies, or feature hierarchy, where each feature depends on other features and/or data sources.

A pattern we discussed earlier, preaggregation, can be considered a parent feature for the final features. We highlight them (level 1 features) along with their child features (level 2 features, etc.) in figure 11.15.

Figure 11.15 A graph of feature dependencies

The way we obtain each feature from its data sources is called a lineage, which is effectively a directed acyclic graph (we discussed them in chapter 6). We track the lineage of each feature to know which order to run feature calculations in, whether we need to update its child features after changing the implementation of one of their parents (which triggers a wave of feature version updates), or whether there is a need for any kind of other feature version update, corruption, or deletion.

Lineage tracking also helps engineers and analysts rapidly explore the source of each feature, thus simplifying debugging and improving their understanding of the origin of outliers or other surprising behavior.

Feature bundles

We often apply the same transforms and filters to closely related columns from the same data sources (e.g., price before discount, price after discount, and price after applying promo). These similar features have the same key and the same lineage.

It means there is no need to work with features as isolated columns and write a separate data pipeline for each. We naturally prefer to implement our feature store in such a way that it consolidates computations for features derived from the same data sources. Thus, a single entity of computations would be a batch of features, not a single one:

<day, user_id, item_id, f1, f2, f3, f1 / f2, …>

Despite merging their computational graphs, we prefer to operate on sets of similar features as a whole in API (or UI) to add them to the dataset simultaneously.

11.4.3 Feature catalog

In relation to the UI, the final secret ingredient for a feature store is a feature catalog. A feature catalog is a service with a web UI where ML engineers, analysts, or even your nontechnical colleagues can search for features and examine their implementation details.

Other things that can be shown to users are feature importance, value distribution, category, the owner, update schedule (daily, hourly), key (user, item, or user-item, or item-day, category-day), feature lineage, ML services that consume this feature, and other meta info.

11.5 Design document: Feature engineering

As we mentioned in the introduction to this chapter, features are the backbone of your ML system’s prediction ability, and for this reason alone, they deserve their spot in the design document. We will cover them in both our design documents.

11.5.1 Features for Supermegaretail

After building our baseline solution, we need to determine the next steps for its improvements. One of the primary ways to do this is to use features that will help the model extract useful patterns and relationships from the raw data.

Design document: Supermegaretail

VIII. Features

Our key criteria for selecting the right features (outside of prediction quality) are

Prediction quality —The more accurate forecasts we get, the better.
Interpretability and explainability —We prefer features that are easy to describe and explain (“black box” solutions are neither transparent nor trustworthy, especially in the initial phases of the project).
Computation time (and computational complexity) —Features that take a lot of time to compute (as well as features from multiple data sources and with complex dependencies) are less preferred unless the improvement in prediction quality they provide is worth it. That’s because they slow down the training cycle and reduce the number of hypotheses we can test.
Risks (and feature stability) —Features that require external/multiple data sources, auxiliary models (or simply poorly designed features), and features based on data sources with low data quality make the pipeline more fragile, which should be avoided.

If a feature adds a statistically significant improvement to the model’s performance but violates one of the other criteria (e.g., it takes 2 days to compute), we prefer not to add this feature into a pipeline.

Primary sources of new features are

Adding more internal and external data sources (e.g., monitoring competitors)
Transforming and combining existing features

The following is a list of features we will experiment with that will guide our further steps of model improvements after initial deployment:

Competitors’ prices and how they differ from our prices (external sources)
Special promotion and discount calendars
Prices (old price, discounted price)
Penetration (ratio between sales of an SKU and sales of a category)
SKU’s attributes (brand, categories of different levels)
Linear elasticity coefficient
A sum/min/max/mean/std of sales of SKU for previous N days
A median/quantiles of sales of SKU for previous N days
Predicted weather (external sources)
Store’s traffic (external sources)
Store’s sales volume
Sales for this SKU 1 year ago
Economic indicators (external sources)

We formulate them as a hypothesis. For example, using a promo calendar will help the model capture an instant increase in demand during marketing activities, which will decrease overstock in that period.

We will use model-agnostic (SHAP, LIME, shuffle importance) and built-in methods (linear model’s coefficients, number of splits in gradient boosting) to measure feature importance. The main goal is to understand the contribution of each feature to the model’s outcome. If a feature doesn’t contribute much, we drop it.

For automatic feature selection during the first stages (when we haven’t determined the basic feature set yet), we use RFE.

Also, we include feature tests in the training pipeline before and after training the model:

Test feature ranges and outlier detectors (e.g., 0.0 <= discount < X)
Test that correlation between any pair of features less than X
Test that feature’s coefficient/number of splits > 0
Test that computation time is less than 6 hours for any feature

To compute and access features easier, we can reuse a centralized feature store that collects data from different sources in the data warehouse and, after different transformations and aggregations, merge it into one datamart (SKU, store, day). It recalculates features on a daily basis, making it easy to experiment with new ones and track their versions, dependencies, and other meta info.

11.5.2 Features for PhotoStock Inc.

A potential set of features for PhotoStock Inc. will be completely different from that for Supermegaretail.

Design document: PhotoStock Inc.

VIII. Features

As stated earlier in the baseline section, we aim to start with a pure content relevancy system by measuring distances between the query and document (image + its description):

relevancy_score = distance(query, image) + distance(query, description)

While it may lead to the conclusion that no feature engineering is involved, that is not exactly correct for at least two reasons:

Image description and metadata should be somehow transformed to be used as a model input. Thus, we need to suggest a robust and extendable way to do it.
We may want to introduce additional sources of signal representing users and documents at a later stage. Examples of such document features are document click-through rate, average purchase rate, and time spent on a document page. Examples of such user features can be aggregates of their click history, explicit settings of their profile, or features calculated by a collaborative filtering-like approach. It is a typical scenario for search engines. Finally, we may want to use features related to photo authors—for example, their average rating or number of items sold to implicitly promote our core contributors. However, it’s a significant scope of work, so we don’t want to do it right away. Still, we want to design a system that will be easy to extend later.

Given that we expect new types of features to appear, we should design a system that will effectively use signals of multiple origins and output a unified relevancy score—for example:

relevancy_score = distance_function(query_to_image_distance, query_to_ description_distance, user_features, document_features, any_other_ features_we_want_to_add_later)

In this baseline example, we suggested that the distance function is a simple sum of distances. However, it’s not the only option. A straightforward option is to use a weighted sum, which effectively suggests training a small linear model on top of the distances.

i. Encoding the photo metadata

Given that we consider a multimodal CLIP-style encoder, we can use the fact it can use any text as input. Thus, we can encode the metadata as text and feed it to the encoder. We suggest gathering all the significant attributes of the photo and concatenating them into a single string—for example, following this template:

“Description: {description}, tags: {tags}, location: {location}”

Generating the prompt like that is a universal approach. We’re sure description and tags are vital parts of the metadata, but we can probably craft more. For example, here we suggested using location if coordinates are part of photo EXIF data. There may be more low-hanging fruit like that—for example, crafting features from date (so it reflects season), camera models, etc. Also, we may need to filter tags—for example, by trimming the list of tags to the most informative ones.

As discussed earlier, there is some flexibility in features to be created from the metadata. However, the more features we have, the higher the complexity is. Even if adding a dummy feature doesn’t affect the model’s performance in terms of metrics, it may increase both training and inference time as transformer-based models are quadratic in terms of the input length. Thus we need to be careful with the number of features we use and apply feature selection techniques to keep the complexity under control. Given the nature of the features, we can’t use filtering methods as is, and given a small number of them, we suggest using something precise though slow (e.g., sequential greedy feature selection).

ii. Feature importance

We need to have access to feature importance at least in two scenarios:

We need an overall understanding of the model to set priorities for future work. For example, if we see that the model relies heavily on metadata features, we may want to invest more in the metadata feature engineering pipeline. At the same time, we would probably like all the components to contribute to the final score to reduce the chance of overfitting and the chance of the creators exploiting the model weaknesses (e.g., overoptimizing the tags of the photos they upload).
There will be complaints about the ranking quality, and we need to understand the reason for that. So we need to be able to explain the ranking for a particular query and a particular document. It may reveal new opportunities for future improvements and detect some systematic problems.

Luckily, we don’t need a new component like a feature store for the solution described here. However, the need for it may emerge if/when we start using user behavior data as a source of features. In that case, we need to precalculate and store the features somewhere, and a feature store is a natural choice.

Summary

Running feature importance analysis will help achieve both interpretability and explainability, pinpointing the features that significantly contribute to the model’s predictions and signaling whether the model is ready for deployment.
Thorough feature selection will allow you to sharpen your model’s prediction abilities, eventually coming up with a solution that is accurate, is easy to interpret and explain, and boasts faster training and serving time.
Do not ignore the idea of feature engineering even when working with multimodal data such as images or texts.
Consider using a feature store, as it enables teams to calculate, store, aggregate, test, document, and monitor features for their ML pipelines in a centralized hub.
Having a feature store will be especially beneficial if you are working with multiple sources of tabular data. On the other hand, pure deep learning problems are unsuitable for feature stores.

11 Features and feature engineering

This chapter covers

11.1 Feature engineering: What are you?

Campfire story from Arseny

11.1.1 Criteria of good and bad features

11.1.2 Feature generation 101

Figure 11.1 An overview of the DeepETA model pipeline: example of combining base feature engineering and a deep learning model (source: )

11.1.3 Model predictions as a feature

11.2 Feature importance analysis

Figure 11.2 An example of a taxi aggregator app’s UI that clarifies why its dynamic pricing algorithm chooses this particular price in this area and during this time of the day

11.2.1 Classification of methods

Classical ML vs. deep learning

Model specific vs. model agnostic

Individual prediction vs. entire model interpretation

Figure 11.3 A taxonomy of model explanation methods

11.2.2 Accuracy–interpretability tradeoff

Figure 11.4 The more sophisticated the model we use (and therefore, usually the more accurate), the less explainable it becomes.

11.2.3 Feature importance in deep learning

Figure 11.5 Examples of output saliency maps generated by different methods (source: )

Figure 11.6 Visualization of an attention head in the Encoder-Decoder Transformer

11.3 Feature selection

11.3.1 Feature generation vs. feature selection

11.3.2 Goals and possible drawbacks

Figure 11.7 Reasons for feature selection

Case study

11.3.3 Feature selection method overview

Figure 11.8 Families of feature selection methods

Figure 11.9 Lasso regression eliminates features one by one by reducing their coefficients to zero as L1 regularization term grows.

Campfire story from Valerii

Campfire story from Arseny

11.4 Feature store

Figure 11.10 Not having a feature store in place can lead to inconsistency and excessive spending.

Figure 11.11 A feature store that maximizes the reusability of features implemented in a unified manner

11.4.1 Feature store: Pros and cons

Figure 11.12 A landscape of ML problems with various needs for a feature store

11.4.2 Desired properties of a feature store

Read-write skew

Figure 11.13 Online features and batch features are written in a feature store in different ways but are consumed in the same manner.

Precalculation

Figure 11.14 A hierarchy of aggregating used in a feature store

Feature versioning

Feature dependencies or feature hierarchy

Figure 11.15 A graph of feature dependencies

Feature bundles

11.4.3 Feature catalog

11.5 Design document: Feature engineering

11.5.1 Features for Supermegaretail

Design document: Supermegaretail

VIII. Features

11.5.2 Features for PhotoStock Inc.

Design document: PhotoStock Inc.

VIII. Features

i. Encoding the photo metadata

ii. Feature importance

Summary