6 Gathering datasets

This chapter covers

Data sources
Turning raw data into datasets
Distinguishing data from metadata
Defining how much is enough
Solving the cold start problem
Looking for properties of a healthy data pipeline

In the preceding chapters, we’ve covered the inherent steps in the preparation for building a machine learning (ML) system, including the problem space and solution space, identifying risks, and finding the right loss functions and metrics. Now we will talk about an aspect your ML project simply won’t take off without—datasets. We will compare them with vital elements of our lives. Just like you’ll need fuel to start your car or a nutritious breakfast to get a charge before a busy day at work, an ML system needs a dataset to function properly.

There is an old popular quote about real estate: the three most important things about it are location, location, and location. Similarly, if we were to choose only three things to focus on while building an ML system, those would be data, data, and data. Another classic quote from the computer science world says “garbage in, garbage out,” and we can’t doubt its correctness.

Here we’ll break down the essence of working with datasets, from finding and processing data sources to properly cooking your dataset and building data pipelines. As a culmination of the whole chapter, we will look at datasets as a part of design documents, using the examples of Supermegaretail and PhotoStock Inc.

6.1 Data sources

You can use absolutely any source to find data for your dataset. The availability and quality of these sources will depend on your work environment, your company’s legacy, and many other factors. Which should be addressed in the first place will depend mostly on the goals of your ML system. Here we list the most popular data sources or their categories while accompanying them with real-world examples:

Global activities —This is a huge category of data sources that includes any activity in a single entity that is regularly recorded and stored on an ongoing basis. As an example, in the stock trading business, traders around the globe act on the stock market, and the result of their actions (trades and prices) later becomes available for other parties.
Physical processes —These are global changes happening on the planet as we speak. They can be environmental shifts monitored on various levels, from satellite images to microsensors on farming lands.
External databases —Certain third-party companies thrive on collecting domain-specific data using their proprietary methods and know-how. You will want to address them, if their data meets your requirements and needs.
Local business processes —Here we are going from global to local. The business itself can generate a huge amount of data as it operates and grows. If you work in eCommerce, purchase history can be your primary source of data.
Labeling by a dedicated team —Your company can hire a team of experts to generate labels for a specific problem.
Labeling by end system users —This is a similar approach, where the company may provide a user interface for end users. There, they will specify the inputs for your ML system.
Artificially generated datasets —This is where data is created by scientific simulators, rendered environments, or other synthetic sources. Items created with generative AI (e.g., image generators or large language models) can be attributed to this category as well.

Some data sources are unique, and having access to them may be a significant competitive advantage. Many big tech companies like Google and Meta are successful mainly because of their valuable user behavior data used for ad targeting. On the other hand, other datasets are easy to acquire; the information is either free to download or can be created for a nonrestrictive price (many datasets sold by data providers are relatively cheap). This doesn’t mean that cheap equals low quality, as it all depends on what kind of data you need. It might turn out that this particular free-of-charge source fits your ML system perfectly. There are also intermediate points on this spectrum, though, and not in terms of price. Access to data can be limited in some regions or be in a “gray zone” legality-wise. Mature companies tend to follow the law (and we recommend you do so as well!), while young startups with the YOLO mentality sometimes consider minor violations.

Some datasets become valuable when enriched or annotated/labeled. Annotation means combining a raw dataset with proper labels or, in other words, creating a closely tied dataset and connecting it with a new one. It is a popular pattern of mixing a unique proprietary dataset with a public data source and getting a much more valuable dataset as a result.

Ad tech companies, mentioned earlier, may benefit from joining datasets as well. Let’s take a look at a classic example. A company operates a free-to-play game, which means it has a lot of players, and only some of them pay money. The number and list of paying customers is kept secret and available only to the game’s publisher. At the same time, its partnering ad network has millions of detailed user profiles derived from behavior. When these datasets are combined (see figure 6.1), it opens a great marketing opportunity: the company can target its new ad to potential customers who are similar to its paying players. Data exchanges like this increase the efficiency of online marketing and thus are one of the powers driving the modern web.

Figure 6.1 Combining data sources of two completely different businesses can eventually benefit both.

When we’re talking about joining datasets, it’s not always the case of a straightforward connection between similar datasets (like a SQL join between two tables). An important concept here is multimodality, which is an intersection of various modalities in a dataset. In simple words, modality is a kind of information we receive; it is common to describe the world as multimodal for humans (we hear sounds, see colors, feel motions, etc.). In ML-related literature, multimodal datasets are those that combine various kinds of data to describe one problem. Think of images of an item being sold and its text description. Combining datasets of different origins and modalities is a powerful technique.

Speaking about combining data sources, Arseny once helped kickstart a startup as an advisor. The company worked in the agri-tech area, helping farmers and related companies to increase their operating efficiency, and its secret sauce was based on datasets. The way it worked was as follows. One data source was public, using satellite images of the planet available thanks to several space initiatives like Landsat by NASA and Copernicus by ESA (where you can fetch countless images of agricultural lands). But having those images alone could not bolster the efficiency of the startup, as it lacked information describing these agricultural lands. The main problem is most agricultural companies are far from innovative, and there is no single solid source of data on what crops were grown, the yield results, and more. Such data is poorly digitized but is really valuable for multiple business needs: it can be used to reduce the amount of fertilizer used, to estimate future prices for food commodities, and so on. Eventually, the team implemented smart ways of gathering such data and merging it with the huge photo database. An ML system built on these joint datasets helped the company to grow rapidly.

Defining data sources and the way they interconnect is the first cornerstone of solving the data problem for the system. But raw data is often almost useless until we make it available for the system and ML model, filter it, and preprocess it in other ways.

6.2 Cooking the dataset

Experienced engineers who work with data know that in the vast majority of cases, the data in its raw form is too raw to work effectively. Hence the name raw data, meaning a chaotically compiled, unorganized giant clump of information. Thus, initial raw datasets are rarely in good enough shape to use them as is. You need to cook the dataset to apply it to your ML system in the most efficient way possible.

We’ve gathered a list of techniques you can use to properly cook your dataset, with each technique presented in a separate subsection. This list is not strictly ordered, and the order of actions may vary depending on your domain, so there is no single universal answer. In some cases, you can’t filter data before labeling, while sometimes filtering happens multiple times throughout the cooking process. Let’s briefly touch on these techniques.

6.2.1 ETL

ETL, which stands for “extract, transform, load,” is a data preparation phase, such as fetching information from an external data source and tailoring its structure to your needs. As a simplified example, you can fetch JSON files from a third-party API and store them in local storage as CSV or Parquet files.

Note The high-level goal of ETL is to solve the data availability problem.

Data availability at this point implies two things:

Data can be fetched easily and effectively for a training process (e.g., if the target dataset is a product of multiple interactions from multiple sources, it is useful to understand how to make it fetchable with a single click, command, or call).
Data will be available for training and runtime phases. We don’t care if fetching data is effective for inference at this point (we will discuss it later, in chapter 11), but we need to guarantee the same data sources are available for inference.

Note Designing an effective ETL process is an art of its own, as it requires a good understanding of various data storages and data processing tools. We only scratch the surface of this topic in our book and recommend consulting other sources for a more in-depth understanding.

The important question here is: “Should I care about data storage and structure at this point?” The answer to this question lies in the following spectrum:

Sometimes datasets are small enough, and the chances of a sudden growth by multiple orders of magnitude are minuscule. It means you can choose almost any storage (e.g., the one that is already actively used in your organization or the one you’re most familiar with). You would be surprised how often inexperienced ML engineers tend to overengineer things here, like designing a distributed multicluster storage for a static tabular dataset with thousands of rows and tens of columns.
Sometimes it’s clear from the very beginning that your dataset will be huge and will grow rapidly, so you should pay close attention as you design the data model. We don’t consider ourselves world-class experts in data engineering of this kind and recommend checking out other books if this is the case for your system. Our favorite work on the subject is Designing Data-Intensive Applications by Martin Kleppmann.

6.2.2 Filtering

No data source is absolutely perfect. By perfect, we mean clean, consistent, and relevant to your problem. In the short story at the beginning of this chapter, we mentioned APIs that provide satellite images, but the company only needed those that weren’t too cloudy and were related to agricultural regions; otherwise, storing irrelevant data would significantly increase the costs. It meant that a good chunk of preliminary work of selecting appropriate satellite photos had to be done as the very first step.

Data filtering is a very domain-specific operation. In some cases, it can be done fully automatically based on a set of rules or statistics; in others, it requires human attention at scale. Experience shows that the end approach normally lies somewhere between those two extremes. A combination of the human eye and automated heuristics is what works best, with the following algorithm being a popular approach: look through a subset of data (either randomly or based on some initial insight or feedback), find patterns, reflect them in the code to extend coverage, and then look through a narrower subset.

While the absence of data filtering leads to extensive noise in the dataset and thus worse performance of the whole system, overly aggressive filtering may have a negative effect as well: in some cases, it may distort data distribution and lead to a worse performance on real data.

6.2.3 Feature engineering

Feature engineering means transforming the data view in a manner so it’s most valuable for an ML algorithm. We’ll cover the topic in more detail later in the book when it’s time to discuss intermediate steps, as it’s rarely detailed during the early stages of ML system design. At this stage, we tend to focus on the question of how to get initial data to make a baseline model, which requires some level of abstraction at the current stage.

Sometimes features are not engineered manually but created by a more complicated model; unlike “regular” features, they can be non-human-readable vectors. In these cases, it’s more precise to use the term representations, although at some level of abstraction, they’re the same thing: inputs to the ML model itself.

For example, in a big ML-heavy organization, there may be a core team building a model that generates the best possible representations of users, goods, or other items at scale. Their model doesn’t solve business problems directly, but applied teams can use it to generate representations for their specific needs. That is a popular pattern when we’re talking about such data as images, videos, texts, or audio.

6.2.4 Labeling

In many scenarios, a dataset itself is not too valuable, but adding extra annotations, often known in the ML world as labels, is a game changer. Deciding what kind of labels should be used is extremely important, as it dictates many other choices down the road.

Let’s imagine you’re building a medical assistance product, a system that helps radiologists analyze patients’ images. It is one of the most popular ML applications in medicine that is still very complicated to design properly. A use case may seem simple: a doctor looks at an image and judges if there is a malignancy, so you want doctors to label the dataset.

There are numerous ways to label it (see figure 6.2), including

Binary classification style—Is there a malignancy in the image?
Multiclass classification style—What kind of malignancy is in the image, if any?
Detection style—What region of the image contains a malignancy, if any?
Segmentation style—Exactly which pixels of the image represent malignancy, if any?

Figure 6.2 Various approaches to data labeling

These ways of labeling data require different labeling tools and different expertise from the labeling team. They also vary in terms of time required. The method affects what kind of model you can use for the problem, and, most importantly, it limits the product. Making the right decision in this case is impossible unless you follow the steps from chapter 2 and gather enough information about product goals and needs.

In some situations, labeling data with the proper level of detail is nearly impossible. For example, it could take too much time, there may not be enough experts to cover enough data, or the toolset is not ready. In these cases, it is worth considering using weakly supervised approaches that allow the use of inaccurate or partial labels. If you’re not familiar with this branch of ML tricks, we recommend “A Brief Introduction to Weakly Supervised Learning” () for an overview; more links for further reading are at Awesome-Weak- Supervision ().

Efficient data labeling requires a choice of a labeling platform, proper decomposition of tasks, task instructions that are easy to read and follow, easy-to-use task interfaces, quality control techniques, an overview of aggregation methods, and pricing. Additionally, best practices should be followed when designing instructions and interfaces, setting up different types of templates, training and examining performers, and constructing pipelines for evaluating the labeling process.

When datasets need to be labeled manually, there are two main ways to go: an in-house labeling team or third-party crowdsourcing services. In a nutshell, the former provides a more controlled and consistent labeling process, while the latter can scale up the annotation process quickly, possibly at a lower cost. The simplest heuristic of choosing one over another is based on the labeling complexity: once it requires specific knowledge or skills, it makes sense to develop an in-house team of curated experts; otherwise, using an external service is a good option.

There are dozens of crowdsourcing platforms available; the most popular is probably Amazon Mechanical Turk. As we opt for width over depth in this book, we will not focus on features of different platforms. Instead, we focus on generic properties of labeling with crowdsourcing:

Labelers tend to make mistakes even in the simplest tasks, such as binary classification of common objects. Most labelers on the platform work for multiple customers, so there is little chance they will memorize the nuances of the labeling instructions you’ve provided. As a result, labelers should be considered interchangeable, and the instruction should be as simple and nonambiguous as possible.
Some labelers are more attentive than others, and you may want to motivate them to dedicate more time to your tasks, not other gigs. Other labelers can be less attentive or adversarial; for example, they may try to use bots generating unverified labels instead of their own judgment. It is crucial to be able to distinguish the former from the latter. The most popular way of doing so is by including tests in the labeling process. Pick a share of a dataset with a label you’re confident in, and use it for labeling tasks so you can measure accuracy and other metrics of each particular labeler. You can also log some details of the labeling process for a clearer picture. If labels are generated very fast, it may be a strong signal of a bot or other illegal automation tool.
The labeling task should be designed in a way that reduces the variance caused by crowdsource workers. Consider the following scenario: a company is developing a chatbot and employs labelers to evaluate responses (questions and several possible answers). The most straightforward evaluation method here would be to ask labelers to rate each answer with a single number (e.g., 1 to 5). However, such labels might be inconsistent. A basic improvement in this case would be to introduce multiple evaluation criteria so that each answer is assessed separately based on factual correctness, tone of voice, and so on. These labels would be more consistent but still far from perfect. Another improvement would be asking for a pairwise comparison. Instead of assigning a number to each answer, labelers would perform a ternary comparison based on the criteria (version A is better / version B is better / A and B are equal). These labels are easier to collect, and the expected variance between labelers’ outputs is lower.

In the 18th century, the French mathematician and philosopher Marquis de Condorcet stated a theorem applicable to political science. De Condorcet revealed the following idea: if a group needs to make a binary decision based on a majority vote and each group member is correct with a probability p > 0.5, the probability of a correct decision by the group grows with the number of group members asymptotically. This could be a formal reason for collective decision-making in the early days of the French Revolution, and now the very same flawless logic is applicable to ML systems!

Let’s imagine we have three labelers annotating the same object for a binary classification. The problem is tricky, so each one is only right in 70% of cases. So we can expect a majority vote of three labelers to be correct in 0.7 * 0.7 * 0.7 + 0.3 * 0.7 * 0.7 + 0.7 * 0.3 * 0.7 + 0.7 * 0.7 * 0.3 = 78.4%. That’s a nice boost!

These numbers should be interpreted with a grain of salt. Condorcet’s jury theorem implies assumptions that are not exactly met here: labelers are not completely independent, and their origin of error is correlated (this may be the case when data samples are noisy and hard to read). But still, an ensemble of labelers is more accurate compared to a single labeler, and this technique can be used to improve labeling results.

This idea can be modified to decrease costs. One algorithm modification we have faced is the following:

Label the object with two labelers.
If they agree, the label is accepted.
If not, add three more votes and use the majority vote as a label.

With this modification, more votes are used for complicated samples that lack consistency, while fewer votes are required for simple cases.

The heuristic described here is a simple though powerful baseline. However, if data labeling is a key ingredient for your problem and you see the need to invest some effort, there is a lot of research dedicated to better design. Some of the materials we recommend are “A Survey on Task Assignment in Crowdsourcing” () and a tutorial by crowdsourcing vendor Toloka.ai ().

Labelers’ consistency and mutual agreement may and should be measured. While ML practitioners often tend to use regular ML metrics to estimate it, those who have a statistical background may recall a concept called interrater reliability and its separate set of metrics (Cohen’s kappa, Scott’s pi, and Krippendorff’s alpha, to name a few). These metrics help keep the labeling team’s work reliable, filter out unscrupulous labels (thus greatly improving the overall system performance), and, in some scenarios, give a solid upper bound for the performance of your model (recall the human-level performance we mentioned earlier in chapter 3).

Labeling teams require proper tooling, which may boost efficiency and improve the quality of created labels. If you go with third-party tooling, it’s very likely it will have a toolset for the most popular problems; for your own team, you need to set up or create your own. Aspects of choosing an existing solution or building a brand-new one are very domain specific, although the general approach is: the more popular the ML formulation of your problem, the more chances you have to find a modern quality toolset for the labeler team. As an example, there are multiple software solutions to create labels for image detection at scale, but once you need to annotate videos with 3D polygon meshes, you’re very likely to require something custom made.

Recent advances in large foundation models have become a game changer in the labeling process as well: they often can be used to gather initial labels in a few-shot setup with the accuracy higher than nonexpert human labelers. One of many examples of a tool used for large language model labeling is Refuel Docs (), notable for being focused on text data; however, using foundation models is also possible in other domains (e.g., addressing universal segmentation models like Segment Anything [] or multimodal solutions like LLaVA []).

There are various algorithmic tricks that may help simplify/reduce the efforts required for the data labeling aspect of building a system, but we will not go into an in-depth review of those in this book, as they are too subject-specific. What we would like to additionally highlight is that very often efforts invested in the data labeling process and tools can be a driver for the success of your ML system.

6.3 Data and metadata

While datasets are used for the ML system we’re building, there is a layer of information on top of it—let’s use the word metadata for it. Metadata is information that describes certain properties of your datasets. Here are some examples:

Time-related attributes —These include when the event happened and when it was processed and stored.
Source attributes —Source attributes tell whether or not data is gathered from multiple sources.
User attributes —If a user is somehow involved in dataset generation, which is not a rare situation, their metadata becomes important metadata of related data samples.
Versions —If the data is processed, it can be useful to understand which version of software has been involved in processing.

Metadata is crucial for data flows and guarantees their consistency. Let’s get back to the previous scenario with a medical application. Imagine that a company hired 10 medical professionals to label data for your system. After working for over a year, they have finally labeled a big dataset. Suddenly, you get an email: one of the labelers turns out to be a fraud, his diploma is fake, and his license has been suspended. This is an unacceptable situation because a tangible part of the data can be considered unreliable, which jeopardizes the entire dataset. So you enact the only adequate solution, which is to stop using his labels in the model ASAP. While this example is somewhat exaggerated, tens of less significant problems will occur in your system over time. Problems with data sources, regulatory interventions, and attempts to interpret model errors will make you refer to metadata on a regular basis.

As we discussed earlier, while recipes for datasets may vary, most likely, your dataset will be cooked properly. It’s okay to reveal new aspects of the problem, thus adjusting the dataset structure over time, and reflect them accordingly in how the dataset was processed. All these differences should be reflected in metadata.

Another source of changes is testing and bug fixing. Imagine that you work for a ridesharing company and are to solve a problem of estimating how long it will take to reach point A from point B. You start with historical data your company generated, and there is a preprocessing job that extracts the information from logs and stores it in your favorite database, so you end up with a table like latitude_from, longitude_ from, datetime_start, datetime_finish, distance, data_source.

At some point, the company makes a big acquisition of a competitor, and you decide to add its data to your ETL process. A new engineer on the team starts coding, new sources are attached, new data points are added to your dataset with the same schema, and suddenly your model performance drops. What happened?

Fast forward: your company stored distance in kilometers, while the acquired team stored distance in miles. An engineer who handled the integration knew it and implemented the support of miles for the new source but accidentally enabled it for the old one as well. So new data samples are correct for the new data source but not for the old one. If you had a metadata field like preprocessing_function_version, you could easily find affected samples. Without it, you need to gather the dataset from scratch after the code defect is discovered.

Defects like this are not the only scenario where you may need to reveal the date of data sample origin. Some ML systems may affect data sources, creating a phenomenon called a feedback loop. A very common example of a feedback loop is recommender systems: let’s say a marketplace sells many items, and there is a block on the website titled “You may also be interested in these goods.” As a naive baseline, the company could put in the items that are most popular overall. But when this baseline is replaced with an ML system, new items appear, and old leaders lose their popularity in favor of new recommended items. With the poor design of such a system, there is a chance for an item that may not be that good to gain its place among popular ones and dominate for a long time. Storing the information when the sample was generated and what versions of related systems were live at the moment is crucial (not enough, though!) to avoid feedback loops.

Another important scenario related to metadata is stratification, a process of data sampling that shapes the distribution of various subgroups. For example, say a company started its operation in country X, gathering a lot of data related to customers from X. Later the company reached a new market, Y, and has a goal of providing the same level of service, including accuracy of ML models, for customers in both countries. It will require representing customers from X and Y in training and test datasets with proper balance.

Stratification is crucial for validation design and resisting the algorithmic bias, and both topics will be covered in future chapters.

6.4 How much data is enough?

After everything we’ve mentioned on the importance of datasets, inexperienced ML practitioners may think that building a data pipeline that will stream numerous samples is enough to build a good system. Well, that’s where we can’t say confidently yes or no.

First, not all data samples are equally useful. Growing the dataset size is only useful if new objects help the model learn something new and related to the problem. Adding samples very similar to existing ones is pointless; the same goes for extremely noisy data. That’s why there is a popular research avenue dedicated to finding the most samples for dataset extension called active learning. The simplest intuition is to enrich the dataset with samples where the model was incorrect (this signal is often fetched using humans in the loop processed) or demonstrated the lowest confidence. The academic community has developed tens of methods related to active learning; see a recent survey (e.g., “A Comparative Survey of Deep Active Learning” by Xueying Zhan et al., ) for more information.

One more aspect related to dataset size is sampling. While in most cases the question we ask ourselves is “How do we get more data?” sometimes it is “What part of data should I use to increase efficiency?” It usually happens when a dataset needs no manual labeling and is generated by a defined process—for example, the clickstream in popular B2C web services (search engines, marketplaces).

Sampling is effective when a dataset is not only huge but also tends to be imbalanced and/or may contain a lot of duplicates. Different strategies can be applicable here, and the most common of them are based on stratification, where you split data into groups (based on key features or algorithmically defined clusters) and limit the amount of data used for each group. The amounts don’t have to be equal—for example, if the temporal component is relevant to your problem, it’s very likely you will want to prioritize fresh data over old groups.

Another point is about data noisiness. A while ago there was a strong consensus in the ML community that a few clean samples are better than multiple noisy samples. Things have changed recently, however; the progress with large models like GPT-3 and CLIP demonstrated that at a certain scale, manual filtering is almost impossible (processing a dataset of 500 billion text tokens or tens of millions of images would cost a fortune), but using a huge amount of weakly or self-supervised (automatically, using heuristics) data works, so for some tasks massive imperfect datasets are more suitable than smaller hand-picked ones.

NOTE While some amount of noise is safe to have in the training data, it’s way less acceptable for validation/test datasets. We’ll cover it more in chapter 7.

You could be expecting the model’s performance to be improved as a square root of dataset size asymptotically, no matter what metric is being used. This estimation is very rough and does not have to fit your problem exactly, but it may give you some intuition (see figure 6.3).

Not every model can be improved by adding new data because of the uncertainty of origin. Uncertainty primarily comes from two central origins:

Lack of knowledge or information (known as epistemic uncertainty)
Uncertainty found in the given data itself (also called aleatoric uncertainly)

Aleatoric uncertainty emerges primarily due to the existing complexity in the data, like overlapping classes or additive noise. A critical characteristic of data uncertainty is that no matter how much additional training data gets collated, it does not reduce.

Figure 6.3 A plot showcasing model metric improvements over dataset size for a real project. Each line demonstrates the dynamics of the target metric for a single task once the dataset size is growing.

On the other hand, epistemic uncertainty occurs when the model encounters an input located in a region that is either thinly covered by the training data or falls outside the scope of the training data (see figure 6.4).

Figure 6.4 A schematic view of the main differences between aleatoric and epistemic uncertainties (source: “Explainable Uncertainty Quantifications for Deep Learning-Based Molecular Property Prediction” by ChuI Yang and YiPei Li, )

Once you have gathered some data and formed a training pipeline (see chapter 10 for details), it unlocks an option of making an informed decision on how much more data you need and estimating the economic efficiency of new data. It is especially reasonable when working with expensive data (e.g., labeled by highly skilled professionals). A high-level algorithm is the following:

Split the dataset into buckets so that the bucket size is close to uniform and similarity between samples of different buckets is maximized. This way, each sample for a user goes to the same bucket determined by a user ID (this brings us back to the stratification aspect mentioned earlier).
Fix the computational budget if applicable. No matter what the dataset size is, you train the model for N batches.
Train the model on data subsets in a range from a tiny share of the dataset to the whole thing, ensuring the subset used is cumulative (e.g., if sample X was used when training on a 10% subset, it must be used on 20%, 30%, etc.).
Calculate key metrics for each trained model and make a plot using a horizontal axis for dataset size and a vertical axis for the metric itself.
With some imagination enabled, extrapolate how much more data is required to squeeze another 1% of metric change.

The precision of this metric is low, and it doesn’t take systematic aspects into account (e.g., concept drifts that we’ll discuss in chapter 14), although even this precision is often crucial to making an important decision on how much we should invest in the data labeling pipeline.

6.5 Chicken-or-egg problem

One of the hardest problems you may face with regard to datasets is a problem of a cold start (aka chicken-or-egg problem): when you need data to build a system but the data is not available until the system is launched. What a terrible loop to get stuck in!

It is often a problem for startups or companies trying to launch products in new markets or verticals. And, as often happens in the ML world, the go-to solution is approximation. Since we don’t have data perfectly matching our problem, we need to find something as close as possible. What data will be close depends on the problem, so let’s look at some examples.

For this example, let’s imagine a company focused on employee safety that builds products monitoring how workers follow safety rules in various environments. The new product should check if hard hats were worn on factory grounds. When the product is available, there will be a lot of data coming from customers who agree to share it, but before that, customers’ cameras are not available for the company.

So how do we approach the problem?

Approach 1 —Let’s use the computer graphics world. We take a stack of 3D human models and render them on the factory backgrounds, both wearing hats and not.
Approach 2 —Let’s use public sources. We look for relevant images in public sources (from Google images to photo banks) and scrape or buy them.
Approach 2.5 (hybrid) —Let’s do a crossover: we take some images of people from the factory and draw/render hats on some of them.
Approach 3 —Let’s try acting! Your team buys hard hats, goes to an abandoned factory, and runs a photo session with hats on.
Approach 4 —Let’s get lazy. We find a public dataset with people wearing very similar hats but outdoors, on a construction site—not too close but still something.
Approach 5 —Let’s build a very naive baseline with no actual ML under the hood and suggest it to the customer to break the loop. An example of such a baseline could be finding human heads using an existing face detector and then cropping those faces and adding a simple heuristic trying to localize a hat (e.g., a bright enclosed blob).
Approach 6 —Instead of using a naive baseline, we integrate a solution provided by a large vendor if it is available. This option works better if vendors’ labels can be reused from the legal perspective. However, proxying a vendor’s output can provide you with initial unlabeled data that has its own value.

These examples may not always be reasonable and applicable, but they represent several ways of solving this problem:

Generating synthetic data
Using available data from similar situations
Creating data manually
Taking data from a similar problem and trying to adjust it
Use a dummy baseline model or third party to bootstrap

We should mention that not every scenario is applicable to every problem. You obviously won’t build a medical system for lung cancer detection using images of brain scans, and a naive baseline as a medical advisor is no good at all. But training on scans from other hospitals using the same equipment may be a good idea to consider; while every scanner will be calibrated slightly differently, together they can provide some generalization (a model trained on data from hospitals A, B, and C is likely to be useful for hospital D). Scraping other websites is rarely greenlighted in a respectful public company, although it is a popular technique among small startups. Synthetic images obtained by a straightforward rendering pipeline are usually not the way to go because they’re not realistic enough (imperfect lighting, shadows, etc.). But with some secret sauce, you can make them realistic. In the paper “MobilePose: Real-Time Pose Estimation for Unseen Objects with Weak Shape Supervision” (Hou et al., 2020, ), researchers rendered objects on top of augmented reality recordings (with accurately estimated lights) to make images way more realistic and thus valuable for a model.

In every scenario, we need to keep in mind that data is not a representative sample of the real distribution we’re going to work with when the system goes live. This means validation results should be taken with a grain of salt, and replacing such proxy datasets with more realistic ones is a top priority. Let’s talk about it when discussing the properties of a healthy data pipeline.

6.6 Properties of a healthy data pipeline

Data gathering and preprocessing need to have three important properties:

Reproducibility
Consistency
Availability

Let’s review the properties one by one.

Reproducibility means you should be able to create a dataset from scratch if needed. There should be no golden data file in your storage that is crafted by a wise engineer with a bit of dark magic. But there must be a software solution (a simple script, more advanced pipeline, or big system) that allows you to create such files again using the same origin as used before. This solution needs to be documented and tested well enough so every project collaborator is able to run it when needed.

The reasoning behind reproducibility is similar to the infrastructure-as-code paradigm popular in the DevOps community. Yes, it may seem easier to create an initial dataset manually in the first place, but it’s not scalable in the long run and is very error-prone. Next time—when a new batch of data is going to be added—you may forget to run a preprocessing step, thus causing implicit data inconsistency, a hard-to-detect problem that will affect the whole system’s performance.

Consistency itself is the key. ML problems often have situations when labels are partially ill-defined, and defining a strict separating plane is impossible. In such cases, experts who do labeling tend to disagree with each other.

A very demonstrative example from our background is related to the credit card transactions classification problem. Customers are expected to give each transaction a label describing the purpose of the expense. The initial label taxonomy contains “food and drink,” “bars and wineries,” and “coffee houses.” A rhetorical question arises: What label is more suitable for payment for a coffee and a sandwich ordered in a wine bar? A partial solution for this example would be to have some kind of protocol on how ties are to be broken and how blurry boundaries are resolved (e.g., “when a merchant type and purchased item type are conflicting, the first to be opened takes precedence”)—it greatly simplifies model training and especially validation.

Consistency is not only about labels. All aspects of data should be consistent: what the data origin is, how data is preprocessed, what filtering is applied, etc. It is relatively easy when the system is being built within a small company and way more challenging for an international corporation, and some formal definitions can be helpful. Once you feel there is a chance of misreading a core term used in the data pipeline, it may be useful to add it to the design doc to make sure the whole team is on the same page.

One more aspect of consistency is how data is gathered at the system-building stage and what the system inputs are during system usage. It is a common problem, and in ML terms, it leads to distribution mismatch, a problem affecting model performance and fairness that is not easy to detect before the system goes live.

We recently mentioned the feedback loop phenomena—the model affects data origin and thus violates the consistency assumption. We also described situations when data is not available until the system launch, and this also may lead to a mismatch. Problems like this are still among the most challenging aspects of applied ML, and you can never ignore them while designing a system.

Data consistency is always an open question, so you should care about it at every stage of ML system development, from initial drafts to long-term maintenance. We will get back to this aspect in chapter 14 when discussing monitoring and drift detection (there will even will be a short but rather cool campfire story from one of our friends).

Last but not least is availability. This property is actually an umbrella for two ideas: availability for a system and availability for engineers. The first can also be named reliability; a system designer should be very critical of nonreliable data sources. Here is a good situation for a negative example—one where things can ultimately go wrong. Imagine a system that depends on a third-party API enriching your data stream, and everything was good until it wasn’t—a small company powering the API lost its key site reliability engineer and thus the ability to handle the infrastructure. If your system’s dependency on this data source is critical, their problems become your problems.

Of course, it doesn’t mean using third-party APIs is not an option. As we mentioned earlier, using external solutions is often a good practice—and not only provided by giants like Amazon, Google, and Microsoft. But data availability is important, and these risks should be taken into account seriously. There are some services provided by vendors that can be less reliable (we can’t imagine a top-priority problem caused by the outage of an experimental visualization tool), but data sources are not one of them.

The same philosophy goes for internal systems (e.g., those controlled by your peers from other teams). There are different control gears for external and internal systems: for the former, service-level agreements are used to estimate risks, while internal systems exist only in mature organizations. At the same time, in smaller companies, it is easier to be aligned with a team maintaining the system and effectively reducing related risks.

It is also worth mentioning that problems related to data availability are not strictly software related. Even if all related systems are built and maintained properly, there may be problems of more sophisticated origins—for example, caused by information security and privacy rights (such as new legal regulations that affect using personal data or your key customer CEO’s decision that their data should not leave their infrastructure anymore).

Availability for engineers can be overestimated: you can hear something like “Come on, our engineers can fetch the data even if it’s not a simple operation—they are professionals and will handle the tech difficulties, if any.” Most likely it is true—we assume our readers and their colleagues are outstanding professionals, but you can’t ignore time constraints. Imagine that an engineer went for lunch with their buddy from another team; they chatted about work-related stuff, and after a cup of coffee, a new hypothesis related to the system of current interest popped up.

If data is attainable easily, you can make an informative decision relatively quickly by pulling the dataset, aggregating some statistics, and maybe even running basic experiments. If the idea gets confirmed in the first naive approach, it can be prioritized higher, meaning more resources should be allocated. Who knows, maybe it is going to be a significant improvement in the future.

Otherwise, if pulling data and making data-driven decisions is time-consuming, the hypothesis is likely to be ignored (“Well, it might be interesting, but I have so many things to do and finding the data will take so long!”). To avoid these cases, we recommend dedicating a share of engineering efforts to building tools that will make datasets more available to people who should work with data. While it is somewhat applicable to any engineering productivity tools, return on investment is especially high when the tools improve data availability. As we said in the very beginning, it’s hard to overestimate the importance of quality data for ML systems, and thus smoothing interactions with the datasets is a good investment in the long run.

This is sufficient depth in the information on consistency and availability at this point in the book. We will return to these two properties of the pipeline in greater detail later in chapter 10. Meanwhile, it’s time to move on to the practical part of the chapter: the design document.

6.7 Design document: Dataset

As we describe steps for the data problem, it organically leads us to additional questions to be answered in the design document. The following is a checklist of questions we suggest asking yourself at this point.

ETL:
- What are the data sources?
- How should we represent and store the data for our system?
Filtering:
- What are the criteria for good and bad data samples?
- What corner cases can we expect? How do we handle them?
- Do we filter data automatically or set up a process for manual verification?
Feature engineering:
- How are the features computed?
- How are representations generated?
Labeling:
- What labels do we need?
- What’s the label’s source?

As often happens, answering these questions can spawn even more questions. Give yourself the freedom to think about those. Time dedicated to answering data-related questions always has an outstanding return on investment in ML systems design.

6.7.1 Dataset for Supermegaretail

Now let’s go back to preparing the design document. This time we are preparing a section devoted to datasets for our imaginary companies. As usual, we’ll start with Supermegaretail.

Design document: Supermegaretail

III. Dataset

The atomic object of the dataset is a bundle of (date, product, store), and the target variable we aim to predict is the number of units sold.

i. Data sources

There are multiple sources of data we can utilize for our objective.

Inner sources

Historical data on purchases (i.e., transaction history) is collected from the chain of Supermegaretail stores and saved to a centralized database. It will be our primary source of truth: the number of sales, the amount of money spent, the discounts applied, transaction ID, and so on.
Stock history is the second most important source of truth for our problem since it directly determines how many units of each product can be sold in each store. This source can help estimate how many products are available for sale at the beginning of each day and how many were expired and withdrawn from sale.
Metadata of each product, store, and transaction.
Calendar of planned promo activities. This is a significant factor affecting future sales that definitely needs to be taken into account.

Outer sources: Manually gathered data

Price monitoring. Prices and other product info collected from our competitors. They are manually gathered daily from a subset of stores of different competitors. It could be done either by our in-house team or by a third party (outsourced). A hybrid approach also can take place. Each product should also contain a global product identifier (barcode) so we can easily match collected data with our product. Knowing aggregated competitors’ prices and their dynamics helps us understand what is happening in the market.

Outer sources: Purchased data

Weather history and forecast we buy from the meteorological service. Weather is an important factor directly affecting consumer behavior.
Customer traffic estimation near our stores (from telecom providers).
Global market indicators.

Mobile app and website data (optional)

Supermegaretail has a delivery service (even if it generates less than 5% of revenue). We will collect additional data about specific sales in a specific location. Sometimes this information can be a valuable predictor.
Also, mobile and web services collect implicit feedback about user activity, including views, clicks, or adding to the cart, which also can predict sales in physical stores.

ii. Data labeling

Since we are dealing with a demand forecasting problem, we don’t need extra data labeling derived directly from the transaction history.

iii. Available metadata

We forecast demand based on the SKU per store level with three key elements: products, stores, and transactions.

Products

Product ID and barcode.
Category codes of different levels (1, 2, 3). We can use a hierarchy of categories for a rule-based measurement of similarity between products. Also, other categorical info like brand, manufacturer, or pricing group.
Shelf life, which determines how bad it is to overpredict sales for this product.
Date when the product was added to the assortment matrix of the chain.
Dimensions and weight of the product.

Stores

Store ID.
Location (coordinates), with the support of third-party sources—we can use it to add information about the weather, flow of people, and distance to critical points, as well as other related things like city, region, and associated logistics center.
The nearest competitors’ stores (with their IDs and distances).
The size of the store and its format. They determine which products and how many unique products will be in the assortment at this store.
The dates when the store is open and closed.

Transactions

Timestamp. This allows us to enrich the dataset with things like holidays.
Customer ID (if a loyalty card is applied). Despite the fact that the final unit of the dataset is (product, store), a bundle of (customer, product) can be used in a separate data pipeline for calculating product embeddings via aggregating transactions to a user-item matrix and its factorization. The embeddings will contain patterns of purchasing behavior.
Store ID and product ID.

iv. Available history

Demand forecast is nothing new for Supermegaretail. Critical ETL processes are already in place. Supermegaretail has been collecting data for more than 3 years.

This history is essential for our forecasting model to learn patterns, catching the seasonality of sales, estimating trends, etc. The same is applicable for products and stores metadata. Weather data (which we take from external sources) has been available for a period in the past as long as we need.

Stocks history and promo activities have been gathered as well.

Price monitoring data of competitors has been collected for the last 2 years.

v. Data quality issues

Transactions, stock, and promo data may contain missing or duplicated values, so additional filtering or preprocessing is required before aggregation.

The external data we bought has already been cleaned and passed some quality control before coming to us. However, necessary checks need to be implemented.

The competitors’ prices cover about 25% of SKU and have gaps.

vi. Final ETL pipeline

The top-level scheme is as follows:

The transactions data is aggregated daily.
Newly aggregated partition is added to the table of transaction aggregates.
(Optional) We rewrite not only the last day but the previous 2 to 3 days to fix possible corruptions in the data (duplicates, incomplete data, duplicated data, and so on).
We join other sources of internal/external data based on data, product ID, or store ID.
Finally, we calculate features based on the joined dataset.

Optionally, we can add a data pipeline for product embeddings, as described in section III, if needed.

6.7.2 Dataset for PhotoStock Inc.

Let’s now switch to the PhotoStock Inc. design document.

Design document: PhotoStock Inc.

III. Dataset

i. Dataset and sources

One potential data source that can be used to gather information for the PhotoStock Inc. search engine is the data associated with each photo in the stock library. This data may include information such as tags, labels, and descriptions associated with the photos, which can provide valuable context about the content of the photos. This dataset should also contain URLs of actual photos and URLs of thumbnails we used. We suggest naming this dataset “description dataset.”

Another potential data source is the search queries that users have submitted to the PhotoStock Inc. platform. These queries can provide insight into the types of photos that users are looking for and can be used to help guide the development of the ML models. When combined with user clicks, they provide a strong signal of relevancy and user interest. An extension of this dataset may contain further information about sessions related to these clicks: how much time did a user spend on the photo page, and did they purchase it afterward? We suggest naming this dataset “clicks dataset.”

Additionally, we can hire labelers to manually assign relevancy scores to pairs of queries and images. The labelers will be given a set of search queries and a selection of images from the PhotoStock Inc. library and will be asked to assign a relevancy score to each query–image pair based on how closely the image matches the content of the query. Initially, query–image pairs should be selected randomly from the pool of past user queries and available photos in the stock library; however, it could be improved later to involve some active learning approaches and getting signals for the pairs where the model is less confident or tends to have more errors. We suggest naming this dataset “labels dataset.”

We could have considered some public text/image datasets (e.g., COCO—Microsoft Common Objects in Context) for model bootstrapping. However, given that there is a solid history of user interactions, it should be fine to use our own data, while the public datasets will be only used indirectly—we can start with models pretrained on such datasets earlier.

Description and clicks datasets are generated organically by the main flow of the PhotoStock Inc. business. So it doesn’t require too much attention from us right now, except for building proper ETL process and storage. The labels dataset needs to have its own budget given that it requires hiring a labelers team or using a third-party service.

ii. Metadata, filtering, and subsampling

The description dataset is expected to be clean enough given that photo descriptions are moderated by the content quality team, as their quality is already a core part of our product. The clicks dataset is expected to be noisy because of different patterns in user behavior and the existence of scraping bots; we need to suggest heuristics to handle these problems. The extended version, with data on purchases, can be less noisy but also way smaller considering the conversion rate from click to purchase. Given these aspects, we expect the click+purchase and labels datasets to be more valuable for validation, while description and clicks without purchases are potentially more noisy and thus more suitable for training.

To address the quality problems with filtering, we should take care of metadata. The description dataset should be annotated with seller information and change dates; the clicks dataset should be annotated with user information, clicked item seller information, and additional search session information. At this point, there is no need to aggregate all the information about user and seller information. It may be part of further feature engineering steps, but at the very least we need to be sure to store relevant user_id, seller_id, session_id, and purchase_id to keep the ability to join this data for later use.

Given the volume of searches and purchases at PhotoStock Inc., it’s not likely we need additional subsampling for purely engineering needs; full datasets can be processed. However, we may need to run subsampling to adjust class balances: there are way more items clicked than purchased, shown at the search engine result page than clicked, etc.

iii. ETL and data preparation

Training data should be fetched on a regular (daily for now) basis in a batch manner. We suggest using the Flyte framework to orchestrate the jobs because it’s already a framework of choice for other batch jobs in PhotoStock Inc.

No fancy stuff is required here—just gathering the data from production databases and storing it separately in the form of Parquet files for a simpler read should be enough. We don’t expect any need for sophisticated preprocessing for now.

iv. Labeling

It’s not clear what platform we should prefer for human labeling; it’s also an open question whether we should use an in-house team (e.g., our customer support and moderators teams) or hire a third-party service. This question needs to be addressed together with these team managers. Also, we should get cost estimates from some third-party services to compare the costs of internal labeling with them.

We need to label pairs (query, image) and divide them among three classes:

Relevant
Irrelevant
Can’t answer

We should be ready to split “Relevant” into “Very relevant” and “Somewhat relevant” for higher granularity; the same goes for “Irrelevant.” As for “Can’t answer,” we need to require labelers to type a reason why they can’t label it—first iterations of labeling can provide us with newer insights on the dataset.

Summary

Don’t limit yourself to just one data source. Determine if you have enough internal sources or if you need to look outside the ecosystem of your business and expand your search range.
Access to unique datasets can give you a significant competitive advantage. In their turn, datasets that are easier to acquire can bring great value if “cooked” properly.
When working on datasets, don’t neglect metadata, which is crucial for data flows and guarantees its consistency.
Keep in mind that there is no strict sequence of techniques when preparing a dataset. The final order of operations will depend on the business goals of your system, the domain in which you operate, and other factors.
Avoid situations where a dataset is populated with samples similar to those already present. Instead, fill the dataset with samples where the model did not work correctly.
Make sure you have data that you can feed to the system before running it.
Remember that the data pipeline must meet three criteria: reproducibility, consistency, and availability.