Книга: Machine Learning System Design
Назад: 5 Loss functions and metrics
Дальше: 7 Validation schemas

6 Gathering datasets

This chapter covers

  • Data sources
  • Turning raw data into datasets
  • Distinguishing data from metadata
  • Defining how much is enough
  • Solving the cold start problem
  • Looking for properties of a healthy data pipeline

In the preceding chapters, we’ve covered the inherent steps in the preparation for building a machine learning (ML) system, including the problem space and solution space, identifying risks, and finding the right loss functions and metrics. Now we will talk about an aspect your ML project simply won’t take off without—datasets. We will compare them with vital elements of our lives. Just like you’ll need fuel to start your car or a nutritious breakfast to get a charge before a busy day at work, an ML system needs a dataset to function properly.

There is an old popular quote about real estate: the three most important things about it are location, location, and location. Similarly, if we were to choose only three things to focus on while building an ML system, those would be data, data, and data. Another classic quote from the computer science world says “garbage in, garbage out,” and we can’t doubt its correctness.

Here we’ll break down the essence of working with datasets, from finding and processing data sources to properly cooking your dataset and building data pipelines. As a culmination of the whole chapter, we will look at datasets as a part of design documents, using the examples of Supermegaretail and PhotoStock Inc.

6.1 Data sources

You can use absolutely any source to find data for your dataset. The availability and quality of these sources will depend on your work environment, your company’s legacy, and many other factors. Which should be addressed in the first place will depend mostly on the goals of your ML system. Here we list the most popular data sources or their categories while accompanying them with real-world examples:

Some data sources are unique, and having access to them may be a significant competitive advantage. Many big tech companies like Google and Meta are successful mainly because of their valuable user behavior data used for ad targeting. On the other hand, other datasets are easy to acquire; the information is either free to download or can be created for a nonrestrictive price (many datasets sold by data providers are relatively cheap). This doesn’t mean that cheap equals low quality, as it all depends on what kind of data you need. It might turn out that this particular free-of-charge source fits your ML system perfectly. There are also intermediate points on this spectrum, though, and not in terms of price. Access to data can be limited in some regions or be in a “gray zone” legality-wise. Mature companies tend to follow the law (and we recommend you do so as well!), while young startups with the YOLO mentality sometimes consider minor violations.

Some datasets become valuable when enriched or annotated/labeled. Annotation means combining a raw dataset with proper labels or, in other words, creating a closely tied dataset and connecting it with a new one. It is a popular pattern of mixing a unique proprietary dataset with a public data source and getting a much more valuable dataset as a result.

Ad tech companies, mentioned earlier, may benefit from joining datasets as well. Let’s take a look at a classic example. A company operates a free-to-play game, which means it has a lot of players, and only some of them pay money. The number and list of paying customers is kept secret and available only to the game’s publisher. At the same time, its partnering ad network has millions of detailed user profiles derived from behavior. When these datasets are combined (see figure 6.1), it opens a great marketing opportunity: the company can target its new ad to potential customers who are similar to its paying players. Data exchanges like this increase the efficiency of online marketing and thus are one of the powers driving the modern web.

figure
Figure 6.1 Combining data sources of two completely different businesses can eventually benefit both.

When we’re talking about joining datasets, it’s not always the case of a straightforward connection between similar datasets (like a SQL join between two tables). An important concept here is multimodality, which is an intersection of various modalities in a dataset. In simple words, modality is a kind of information we receive; it is common to describe the world as multimodal for humans (we hear sounds, see colors, feel motions, etc.). In ML-related literature, multimodal datasets are those that combine various kinds of data to describe one problem. Think of images of an item being sold and its text description. Combining datasets of different origins and modalities is a powerful technique.

Speaking about combining data sources, Arseny once helped kickstart a startup as an advisor. The company worked in the agri-tech area, helping farmers and related companies to increase their operating efficiency, and its secret sauce was based on datasets. The way it worked was as follows. One data source was public, using satellite images of the planet available thanks to several space initiatives like Landsat by NASA and Copernicus by ESA (where you can fetch countless images of agricultural lands). But having those images alone could not bolster the efficiency of the startup, as it lacked information describing these agricultural lands. The main problem is most agricultural companies are far from innovative, and there is no single solid source of data on what crops were grown, the yield results, and more. Such data is poorly digitized but is really valuable for multiple business needs: it can be used to reduce the amount of fertilizer used, to estimate future prices for food commodities, and so on. Eventually, the team implemented smart ways of gathering such data and merging it with the huge photo database. An ML system built on these joint datasets helped the company to grow rapidly.

Defining data sources and the way they interconnect is the first cornerstone of solving the data problem for the system. But raw data is often almost useless until we make it available for the system and ML model, filter it, and preprocess it in other ways.

6.2 Cooking the dataset

Experienced engineers who work with data know that in the vast majority of cases, the data in its raw form is too raw to work effectively. Hence the name raw data, meaning a chaotically compiled, unorganized giant clump of information. Thus, initial raw datasets are rarely in good enough shape to use them as is. You need to cook the dataset to apply it to your ML system in the most efficient way possible.

We’ve gathered a list of techniques you can use to properly cook your dataset, with each technique presented in a separate subsection. This list is not strictly ordered, and the order of actions may vary depending on your domain, so there is no single universal answer. In some cases, you can’t filter data before labeling, while sometimes filtering happens multiple times throughout the cooking process. Let’s briefly touch on these techniques.

6.2.1 ETL

ETL, which stands for “extract, transform, load,” is a data preparation phase, such as fetching information from an external data source and tailoring its structure to your needs. As a simplified example, you can fetch JSON files from a third-party API and store them in local storage as CSV or Parquet files.

Data availability at this point implies two things:

The important question here is: “Should I care about data storage and structure at this point?” The answer to this question lies in the following spectrum:

6.2.2 Filtering

No data source is absolutely perfect. By perfect, we mean clean, consistent, and relevant to your problem. In the short story at the beginning of this chapter, we mentioned APIs that provide satellite images, but the company only needed those that weren’t too cloudy and were related to agricultural regions; otherwise, storing irrelevant data would significantly increase the costs. It meant that a good chunk of preliminary work of selecting appropriate satellite photos had to be done as the very first step.

Data filtering is a very domain-specific operation. In some cases, it can be done fully automatically based on a set of rules or statistics; in others, it requires human attention at scale. Experience shows that the end approach normally lies somewhere between those two extremes. A combination of the human eye and automated heuristics is what works best, with the following algorithm being a popular approach: look through a subset of data (either randomly or based on some initial insight or feedback), find patterns, reflect them in the code to extend coverage, and then look through a narrower subset.

While the absence of data filtering leads to extensive noise in the dataset and thus worse performance of the whole system, overly aggressive filtering may have a negative effect as well: in some cases, it may distort data distribution and lead to a worse performance on real data.

6.2.3 Feature engineering

Feature engineering means transforming the data view in a manner so it’s most valuable for an ML algorithm. We’ll cover the topic in more detail later in the book when it’s time to discuss intermediate steps, as it’s rarely detailed during the early stages of ML system design. At this stage, we tend to focus on the question of how to get initial data to make a baseline model, which requires some level of abstraction at the current stage.

Sometimes features are not engineered manually but created by a more complicated model; unlike “regular” features, they can be non-human-readable vectors. In these cases, it’s more precise to use the term representations, although at some level of abstraction, they’re the same thing: inputs to the ML model itself.

For example, in a big ML-heavy organization, there may be a core team building a model that generates the best possible representations of users, goods, or other items at scale. Their model doesn’t solve business problems directly, but applied teams can use it to generate representations for their specific needs. That is a popular pattern when we’re talking about such data as images, videos, texts, or audio.

6.2.4 Labeling

In many scenarios, a dataset itself is not too valuable, but adding extra annotations, often known in the ML world as labels, is a game changer. Deciding what kind of labels should be used is extremely important, as it dictates many other choices down the road.

Let’s imagine you’re building a medical assistance product, a system that helps radiologists analyze patients’ images. It is one of the most popular ML applications in medicine that is still very complicated to design properly. A use case may seem simple: a doctor looks at an image and judges if there is a malignancy, so you want doctors to label the dataset.

There are numerous ways to label it (see figure 6.2), including

figure
Figure 6.2 Various approaches to data labeling

These ways of labeling data require different labeling tools and different expertise from the labeling team. They also vary in terms of time required. The method affects what kind of model you can use for the problem, and, most importantly, it limits the product. Making the right decision in this case is impossible unless you follow the steps from chapter 2 and gather enough information about product goals and needs.

In some situations, labeling data with the proper level of detail is nearly impossible. For example, it could take too much time, there may not be enough experts to cover enough data, or the toolset is not ready. In these cases, it is worth considering using weakly supervised approaches that allow the use of inaccurate or partial labels. If you’re not familiar with this branch of ML tricks, we recommend “A Brief Introduction to Weakly Supervised Learning” () for an overview; more links for further reading are at Awesome-Weak- Supervision ().

Efficient data labeling requires a choice of a labeling platform, proper decomposition of tasks, task instructions that are easy to read and follow, easy-to-use task interfaces, quality control techniques, an overview of aggregation methods, and pricing. Additionally, best practices should be followed when designing instructions and interfaces, setting up different types of templates, training and examining performers, and constructing pipelines for evaluating the labeling process.

When datasets need to be labeled manually, there are two main ways to go: an in-house labeling team or third-party crowdsourcing services. In a nutshell, the former provides a more controlled and consistent labeling process, while the latter can scale up the annotation process quickly, possibly at a lower cost. The simplest heuristic of choosing one over another is based on the labeling complexity: once it requires specific knowledge or skills, it makes sense to develop an in-house team of curated experts; otherwise, using an external service is a good option.

There are dozens of crowdsourcing platforms available; the most popular is probably Amazon Mechanical Turk. As we opt for width over depth in this book, we will not focus on features of different platforms. Instead, we focus on generic properties of labeling with crowdsourcing:

In the 18th century, the French mathematician and philosopher Marquis de Condorcet stated a theorem applicable to political science. De Condorcet revealed the following idea: if a group needs to make a binary decision based on a majority vote and each group member is correct with a probability p > 0.5, the probability of a correct decision by the group grows with the number of group members asymptotically. This could be a formal reason for collective decision-making in the early days of the French Revolution, and now the very same flawless logic is applicable to ML systems!

Let’s imagine we have three labelers annotating the same object for a binary classification. The problem is tricky, so each one is only right in 70% of cases. So we can expect a majority vote of three labelers to be correct in 0.7 * 0.7 * 0.7 + 0.3 * 0.7 * 0.7 + 0.7 * 0.3 * 0.7 + 0.7 * 0.7 * 0.3 = 78.4%. That’s a nice boost!

These numbers should be interpreted with a grain of salt. Condorcet’s jury theorem implies assumptions that are not exactly met here: labelers are not completely independent, and their origin of error is correlated (this may be the case when data samples are noisy and hard to read). But still, an ensemble of labelers is more accurate compared to a single labeler, and this technique can be used to improve labeling results.

This idea can be modified to decrease costs. One algorithm modification we have faced is the following:

With this modification, more votes are used for complicated samples that lack consistency, while fewer votes are required for simple cases.

The heuristic described here is a simple though powerful baseline. However, if data labeling is a key ingredient for your problem and you see the need to invest some effort, there is a lot of research dedicated to better design. Some of the materials we recommend are “A Survey on Task Assignment in Crowdsourcing” () and a tutorial by crowdsourcing vendor Toloka.ai ().

Labelers’ consistency and mutual agreement may and should be measured. While ML practitioners often tend to use regular ML metrics to estimate it, those who have a statistical background may recall a concept called interrater reliability and its separate set of metrics (Cohen’s kappa, Scott’s pi, and Krippendorff’s alpha, to name a few). These metrics help keep the labeling team’s work reliable, filter out unscrupulous labels (thus greatly improving the overall system performance), and, in some scenarios, give a solid upper bound for the performance of your model (recall the human-level performance we mentioned earlier in chapter 3).

Labeling teams require proper tooling, which may boost efficiency and improve the quality of created labels. If you go with third-party tooling, it’s very likely it will have a toolset for the most popular problems; for your own team, you need to set up or create your own. Aspects of choosing an existing solution or building a brand-new one are very domain specific, although the general approach is: the more popular the ML formulation of your problem, the more chances you have to find a modern quality toolset for the labeler team. As an example, there are multiple software solutions to create labels for image detection at scale, but once you need to annotate videos with 3D polygon meshes, you’re very likely to require something custom made.

Recent advances in large foundation models have become a game changer in the labeling process as well: they often can be used to gather initial labels in a few-shot setup with the accuracy higher than nonexpert human labelers. One of many examples of a tool used for large language model labeling is Refuel Docs (), notable for being focused on text data; however, using foundation models is also possible in other domains (e.g., addressing universal segmentation models like Segment Anything [] or multimodal solutions like LLaVA []).

There are various algorithmic tricks that may help simplify/reduce the efforts required for the data labeling aspect of building a system, but we will not go into an in-depth review of those in this book, as they are too subject-specific. What we would like to additionally highlight is that very often efforts invested in the data labeling process and tools can be a driver for the success of your ML system.

6.3 Data and metadata

While datasets are used for the ML system we’re building, there is a layer of information on top of it—let’s use the word metadata for it. Metadata is information that describes certain properties of your datasets. Here are some examples:

Metadata is crucial for data flows and guarantees their consistency. Let’s get back to the previous scenario with a medical application. Imagine that a company hired 10 medical professionals to label data for your system. After working for over a year, they have finally labeled a big dataset. Suddenly, you get an email: one of the labelers turns out to be a fraud, his diploma is fake, and his license has been suspended. This is an unacceptable situation because a tangible part of the data can be considered unreliable, which jeopardizes the entire dataset. So you enact the only adequate solution, which is to stop using his labels in the model ASAP. While this example is somewhat exaggerated, tens of less significant problems will occur in your system over time. Problems with data sources, regulatory interventions, and attempts to interpret model errors will make you refer to metadata on a regular basis.

As we discussed earlier, while recipes for datasets may vary, most likely, your dataset will be cooked properly. It’s okay to reveal new aspects of the problem, thus adjusting the dataset structure over time, and reflect them accordingly in how the dataset was processed. All these differences should be reflected in metadata.

Another source of changes is testing and bug fixing. Imagine that you work for a ridesharing company and are to solve a problem of estimating how long it will take to reach point A from point B. You start with historical data your company generated, and there is a preprocessing job that extracts the information from logs and stores it in your favorite database, so you end up with a table like latitude_from, longitude_ from, datetime_start, datetime_finish, distance, data_source.

At some point, the company makes a big acquisition of a competitor, and you decide to add its data to your ETL process. A new engineer on the team starts coding, new sources are attached, new data points are added to your dataset with the same schema, and suddenly your model performance drops. What happened?

Fast forward: your company stored distance in kilometers, while the acquired team stored distance in miles. An engineer who handled the integration knew it and implemented the support of miles for the new source but accidentally enabled it for the old one as well. So new data samples are correct for the new data source but not for the old one. If you had a metadata field like preprocessing_function_version, you could easily find affected samples. Without it, you need to gather the dataset from scratch after the code defect is discovered.

Defects like this are not the only scenario where you may need to reveal the date of data sample origin. Some ML systems may affect data sources, creating a phenomenon called a feedback loop. A very common example of a feedback loop is recommender systems: let’s say a marketplace sells many items, and there is a block on the website titled “You may also be interested in these goods.” As a naive baseline, the company could put in the items that are most popular overall. But when this baseline is replaced with an ML system, new items appear, and old leaders lose their popularity in favor of new recommended items. With the poor design of such a system, there is a chance for an item that may not be that good to gain its place among popular ones and dominate for a long time. Storing the information when the sample was generated and what versions of related systems were live at the moment is crucial (not enough, though!) to avoid feedback loops.

Another important scenario related to metadata is stratification, a process of data sampling that shapes the distribution of various subgroups. For example, say a company started its operation in country X, gathering a lot of data related to customers from X. Later the company reached a new market, Y, and has a goal of providing the same level of service, including accuracy of ML models, for customers in both countries. It will require representing customers from X and Y in training and test datasets with proper balance.

Stratification is crucial for validation design and resisting the algorithmic bias, and both topics will be covered in future chapters.

6.4 How much data is enough?

After everything we’ve mentioned on the importance of datasets, inexperienced ML practitioners may think that building a data pipeline that will stream numerous samples is enough to build a good system. Well, that’s where we can’t say confidently yes or no.

First, not all data samples are equally useful. Growing the dataset size is only useful if new objects help the model learn something new and related to the problem. Adding samples very similar to existing ones is pointless; the same goes for extremely noisy data. That’s why there is a popular research avenue dedicated to finding the most samples for dataset extension called active learning. The simplest intuition is to enrich the dataset with samples where the model was incorrect (this signal is often fetched using humans in the loop processed) or demonstrated the lowest confidence. The academic community has developed tens of methods related to active learning; see a recent survey (e.g., “A Comparative Survey of Deep Active Learning” by Xueying Zhan et al., ) for more information.

One more aspect related to dataset size is sampling. While in most cases the question we ask ourselves is “How do we get more data?” sometimes it is “What part of data should I use to increase efficiency?” It usually happens when a dataset needs no manual labeling and is generated by a defined process—for example, the clickstream in popular B2C web services (search engines, marketplaces).

Sampling is effective when a dataset is not only huge but also tends to be imbalanced and/or may contain a lot of duplicates. Different strategies can be applicable here, and the most common of them are based on stratification, where you split data into groups (based on key features or algorithmically defined clusters) and limit the amount of data used for each group. The amounts don’t have to be equal—for example, if the temporal component is relevant to your problem, it’s very likely you will want to prioritize fresh data over old groups.

Another point is about data noisiness. A while ago there was a strong consensus in the ML community that a few clean samples are better than multiple noisy samples. Things have changed recently, however; the progress with large models like GPT-3 and CLIP demonstrated that at a certain scale, manual filtering is almost impossible (processing a dataset of 500 billion text tokens or tens of millions of images would cost a fortune), but using a huge amount of weakly or self-supervised (automatically, using heuristics) data works, so for some tasks massive imperfect datasets are more suitable than smaller hand-picked ones.

You could be expecting the model’s performance to be improved as a square root of dataset size asymptotically, no matter what metric is being used. This estimation is very rough and does not have to fit your problem exactly, but it may give you some intuition (see figure 6.3).

Not every model can be improved by adding new data because of the uncertainty of origin. Uncertainty primarily comes from two central origins:

Aleatoric uncertainty emerges primarily due to the existing complexity in the data, like overlapping classes or additive noise. A critical characteristic of data uncertainty is that no matter how much additional training data gets collated, it does not reduce.

figure
Figure 6.3 A plot showcasing model metric improvements over dataset size for a real project. Each line demonstrates the dynamics of the target metric for a single task once the dataset size is growing.

On the other hand, epistemic uncertainty occurs when the model encounters an input located in a region that is either thinly covered by the training data or falls outside the scope of the training data (see figure 6.4).

figure
Figure 6.4 A schematic view of the main differences between aleatoric and epistemic uncertainties (source: “Explainable Uncertainty Quantifications for Deep Learning-Based Molecular Property Prediction” by ChuI Yang and YiPei Li, )

Once you have gathered some data and formed a training pipeline (see chapter 10 for details), it unlocks an option of making an informed decision on how much more data you need and estimating the economic efficiency of new data. It is especially reasonable when working with expensive data (e.g., labeled by highly skilled professionals). A high-level algorithm is the following:

The precision of this metric is low, and it doesn’t take systematic aspects into account (e.g., concept drifts that we’ll discuss in chapter 14), although even this precision is often crucial to making an important decision on how much we should invest in the data labeling pipeline.

6.5 Chicken-or-egg problem

One of the hardest problems you may face with regard to datasets is a problem of a cold start (aka chicken-or-egg problem): when you need data to build a system but the data is not available until the system is launched. What a terrible loop to get stuck in!

It is often a problem for startups or companies trying to launch products in new markets or verticals. And, as often happens in the ML world, the go-to solution is approximation. Since we don’t have data perfectly matching our problem, we need to find something as close as possible. What data will be close depends on the problem, so let’s look at some examples.

For this example, let’s imagine a company focused on employee safety that builds products monitoring how workers follow safety rules in various environments. The new product should check if hard hats were worn on factory grounds. When the product is available, there will be a lot of data coming from customers who agree to share it, but before that, customers’ cameras are not available for the company.

So how do we approach the problem?

These examples may not always be reasonable and applicable, but they represent several ways of solving this problem:

We should mention that not every scenario is applicable to every problem. You obviously won’t build a medical system for lung cancer detection using images of brain scans, and a naive baseline as a medical advisor is no good at all. But training on scans from other hospitals using the same equipment may be a good idea to consider; while every scanner will be calibrated slightly differently, together they can provide some generalization (a model trained on data from hospitals A, B, and C is likely to be useful for hospital D). Scraping other websites is rarely greenlighted in a respectful public company, although it is a popular technique among small startups. Synthetic images obtained by a straightforward rendering pipeline are usually not the way to go because they’re not realistic enough (imperfect lighting, shadows, etc.). But with some secret sauce, you can make them realistic. In the paper “MobilePose: Real-Time Pose Estimation for Unseen Objects with Weak Shape Supervision” (Hou et al., 2020, ), researchers rendered objects on top of augmented reality recordings (with accurately estimated lights) to make images way more realistic and thus valuable for a model.

In every scenario, we need to keep in mind that data is not a representative sample of the real distribution we’re going to work with when the system goes live. This means validation results should be taken with a grain of salt, and replacing such proxy datasets with more realistic ones is a top priority. Let’s talk about it when discussing the properties of a healthy data pipeline.

6.6 Properties of a healthy data pipeline

Data gathering and preprocessing need to have three important properties:

Let’s review the properties one by one.

Reproducibility means you should be able to create a dataset from scratch if needed. There should be no golden data file in your storage that is crafted by a wise engineer with a bit of dark magic. But there must be a software solution (a simple script, more advanced pipeline, or big system) that allows you to create such files again using the same origin as used before. This solution needs to be documented and tested well enough so every project collaborator is able to run it when needed.

The reasoning behind reproducibility is similar to the infrastructure-as-code paradigm popular in the DevOps community. Yes, it may seem easier to create an initial dataset manually in the first place, but it’s not scalable in the long run and is very error-prone. Next time—when a new batch of data is going to be added—you may forget to run a preprocessing step, thus causing implicit data inconsistency, a hard-to-detect problem that will affect the whole system’s performance.

Consistency itself is the key. ML problems often have situations when labels are partially ill-defined, and defining a strict separating plane is impossible. In such cases, experts who do labeling tend to disagree with each other.

A very demonstrative example from our background is related to the credit card transactions classification problem. Customers are expected to give each transaction a label describing the purpose of the expense. The initial label taxonomy contains “food and drink,” “bars and wineries,” and “coffee houses.” A rhetorical question arises: What label is more suitable for payment for a coffee and a sandwich ordered in a wine bar? A partial solution for this example would be to have some kind of protocol on how ties are to be broken and how blurry boundaries are resolved (e.g., “when a merchant type and purchased item type are conflicting, the first to be opened takes precedence”)—it greatly simplifies model training and especially validation.

Consistency is not only about labels. All aspects of data should be consistent: what the data origin is, how data is preprocessed, what filtering is applied, etc. It is relatively easy when the system is being built within a small company and way more challenging for an international corporation, and some formal definitions can be helpful. Once you feel there is a chance of misreading a core term used in the data pipeline, it may be useful to add it to the design doc to make sure the whole team is on the same page.

One more aspect of consistency is how data is gathered at the system-building stage and what the system inputs are during system usage. It is a common problem, and in ML terms, it leads to distribution mismatch, a problem affecting model performance and fairness that is not easy to detect before the system goes live.

We recently mentioned the feedback loop phenomena—the model affects data origin and thus violates the consistency assumption. We also described situations when data is not available until the system launch, and this also may lead to a mismatch. Problems like this are still among the most challenging aspects of applied ML, and you can never ignore them while designing a system.

Data consistency is always an open question, so you should care about it at every stage of ML system development, from initial drafts to long-term maintenance. We will get back to this aspect in chapter 14 when discussing monitoring and drift detection (there will even will be a short but rather cool campfire story from one of our friends).

Last but not least is availability. This property is actually an umbrella for two ideas: availability for a system and availability for engineers. The first can also be named reliability; a system designer should be very critical of nonreliable data sources. Here is a good situation for a negative example—one where things can ultimately go wrong. Imagine a system that depends on a third-party API enriching your data stream, and everything was good until it wasn’t—a small company powering the API lost its key site reliability engineer and thus the ability to handle the infrastructure. If your system’s dependency on this data source is critical, their problems become your problems.

Of course, it doesn’t mean using third-party APIs is not an option. As we mentioned earlier, using external solutions is often a good practice—and not only provided by giants like Amazon, Google, and Microsoft. But data availability is important, and these risks should be taken into account seriously. There are some services provided by vendors that can be less reliable (we can’t imagine a top-priority problem caused by the outage of an experimental visualization tool), but data sources are not one of them.

The same philosophy goes for internal systems (e.g., those controlled by your peers from other teams). There are different control gears for external and internal systems: for the former, service-level agreements are used to estimate risks, while internal systems exist only in mature organizations. At the same time, in smaller companies, it is easier to be aligned with a team maintaining the system and effectively reducing related risks.

It is also worth mentioning that problems related to data availability are not strictly software related. Even if all related systems are built and maintained properly, there may be problems of more sophisticated origins—for example, caused by information security and privacy rights (such as new legal regulations that affect using personal data or your key customer CEO’s decision that their data should not leave their infrastructure anymore).

Availability for engineers can be overestimated: you can hear something like “Come on, our engineers can fetch the data even if it’s not a simple operation—they are professionals and will handle the tech difficulties, if any.” Most likely it is true—we assume our readers and their colleagues are outstanding professionals, but you can’t ignore time constraints. Imagine that an engineer went for lunch with their buddy from another team; they chatted about work-related stuff, and after a cup of coffee, a new hypothesis related to the system of current interest popped up.

If data is attainable easily, you can make an informative decision relatively quickly by pulling the dataset, aggregating some statistics, and maybe even running basic experiments. If the idea gets confirmed in the first naive approach, it can be prioritized higher, meaning more resources should be allocated. Who knows, maybe it is going to be a significant improvement in the future.

Otherwise, if pulling data and making data-driven decisions is time-consuming, the hypothesis is likely to be ignored (“Well, it might be interesting, but I have so many things to do and finding the data will take so long!”). To avoid these cases, we recommend dedicating a share of engineering efforts to building tools that will make datasets more available to people who should work with data. While it is somewhat applicable to any engineering productivity tools, return on investment is especially high when the tools improve data availability. As we said in the very beginning, it’s hard to overestimate the importance of quality data for ML systems, and thus smoothing interactions with the datasets is a good investment in the long run.

This is sufficient depth in the information on consistency and availability at this point in the book. We will return to these two properties of the pipeline in greater detail later in chapter 10. Meanwhile, it’s time to move on to the practical part of the chapter: the design document.

6.7 Design document: Dataset

As we describe steps for the data problem, it organically leads us to additional questions to be answered in the design document. The following is a checklist of questions we suggest asking yourself at this point.

As often happens, answering these questions can spawn even more questions. Give yourself the freedom to think about those. Time dedicated to answering data-related questions always has an outstanding return on investment in ML systems design.

6.7.1 Dataset for Supermegaretail

Now let’s go back to preparing the design document. This time we are preparing a section devoted to datasets for our imaginary companies. As usual, we’ll start with Supermegaretail.

6.7.2 Dataset for PhotoStock Inc.

Let’s now switch to the PhotoStock Inc. design document.

Summary

Назад: 5 Loss functions and metrics
Дальше: 7 Validation schemas