AI Essentials: The role of data and where it comes from

This blog continues our series called “AI Essentials.” It discusses the role of data in AI development, where companies get their training data, and what this means for competition and innovation in the AI ecosystem.

Data is a fundamental resource that powers all AI systems. There are different types of data that are utilized for different functions in AI development, and varying sources of that data. Both of these can be highly context-dependent on stages of development, resources that the developer has on hand, and the task(s) that the model is being trained to perform. Legislators, regulators, and rightsholders each can impact the availability and types of data used for AI, and shape which companies can participate in AI innovation in the first place.

AI systems require not just large quantities of data, but data that’s properly structured, labeled, and divided for different phases of development. The specific requirements depend on the model’s purpose and learning method. In supervised learning, we give the AI system examples with clear “answers” or labels — imagine teaching a child by showing them pictures of animals and telling them “this is a cat” or “this is a dog.” For a medical AI, this means X-rays must be marked by experts exactly where fractures appear to teach the system what a broken bone looks like and where to find it.

In contrast, unsupervised learning uses unlabeled examples, allowing the AI to find patterns on its own. This is like asking someone to sort a drawer full of socks without telling them how; they might stumble across groups by color or size as the most effective, thereby ‘learning’ the groups. In AI applications, this can help discover natural groupings without predefined categories.

Meanwhile, semi-supervised learning combines both approaches, using a small amount of labeled data together with larger amounts of unlabeled data. This is particularly valuable when labeling is expensive or time-consuming, like in medical imaging where specialist doctors would need to annotate thousands of images.

Different stages of AI development require datasets that serve distinct purposes. Training data teaches initial patterns, while testing data — kept entirely separate — verifies whether the AI can handle new situations. For medical imaging, this means not just training on thousands of fracture X-rays, but testing on a separate set of cases to ensure the AI can spot breaks it hasn’t seen before.

For data, quality often matters more than quantity. Good training data needs to be accurate, well-labeled, and representative of real-world scenarios. A customer service AI tool trained mostly on routine queries will struggle with complex problems, even if trained on millions of examples. Similarly, an autonomous vehicle AI model learns more from a few thousand carefully annotated hazard scenarios than millions of normal driving situations.

Such specialized data can originate from various sources. Some datasets are publicly available, such as government weather data or academic research. However, many valuable datasets are often collected from real-world interactions, high-quality sources, or must be curated specifically for the intended AI application. Large technology companies generally hold a competitive advantage given the significant resources it takes to curate those datasets, negotiate licenses or fight lawsuits with rightsholders, and their access to large volumes of highly relevant data collected through user interactions.

AI companies source data from a range of sources, including from themselves or their users. Information directly collected from users, or first-party data, can be especially useful for improving models because it captures real-world interactions and is continuously updating. For example, an AI product designed to detect elderly people falling benefits from such data to differentiate falling versus sitting in a chair. Large companies with lots of users have access to more interactions that help improve their models. In contrast, startups typically don’t have large user bases or multiple services generating continuous data streams. Data privacy rules and regulators also factor in here, with the Federal Trade Commission earlier this year warning AI companies against changing their terms to leverage user data for AI training.

Many companies large and small ingest data from the open web or from open data sources. Open data sources are freely accessible datasets that anyone can use, often maintained by public institutions, governments, or nonprofits. These sources provide essential resources for startups that lack extensive proprietary data. The non-profit Common Crawl, for instance, maintains a public web archive of over 9.5 petabytes of data, accessible to anyone–from small startups to big players such as Stable Diffusion, who use filtered versions of this data through another nonprofit organization, LAION.

Original, expressive content — from everyday folks and the most well known organizations — is plentiful online and therefore throughout many datasets scraped from the public corpus. Large rightsholder organizations and well-known celebrities have alleged ingesting this data for training purposes amounts to copyright infringement, and filed lawsuits against the largest AI companies and startups alike. Some large companies have negotiated agreements to license data for AI from these entities in response. Those deals can run into many millions of dollars annually, beyond the budgets of startups, meaning startups might not be able to participate in AI innovation. Ingesting data — facts about the world — to learn is not an infringing practice, and policymakers will need to support this understanding if the AI ecosystem is to remain competitive. (Outputs on the other hand, can sometimes be infringing.)

Other types of data used for AI include custom datasets compiled to tailor the data to meet specific industry needs. For example, an autonomous vehicle company might collect video footage of different driving scenarios, then label objects like pedestrians, cars, and traffic signs in each frame. This purpose-built dataset ensures that the data aligns precisely with the company’s goals, but it requires significant time, resources, and infrastructure to organize and label at scale.

To aid where this real-world data is scarce, some companies may use synthetic data — artificially generated data that mirrors real-world patterns. Rather than collecting thousands of real-world images of rare driving scenarios, a company could simulate various traffic conditions, weather patterns, and road layouts and record that data. This approach can help expand the dataset quickly and provide diverse examples that might be challenging to capture in real life. However, recent research warns models may become overly tuned to synthetic patterns that do not align with real-world data, resulting in “model collapse” when training exclusively on synthetic data.

Policymakers — on issues from data privacy to intellectual property rights — have wide remit to impact the competitiveness of the AI ecosystem (or lack thereof), depending upon the actions they take when it comes to data and AI. They should seek a balanced and competitive landscape that ensures small startups with few resources can continue to innovate, grow, and compete.