This blog continues our series called “AI Essentials,” which aims to bridge the knowledge gap surrounding AI-related topics. It discusses copyright infringement and fair use when it comes to the inputs used in AI training and why these legal questions are critical for the AI ecosystem and startup innovation.
Startups are increasingly innovating in AI, but unresolved questions about copyright hang like a sword of Damocles over the entire AI ecosystem. Numerous ongoing AI lawsuits turn on whether including copyrighted content, such as written works, images, or music, in datasets to train generative AI models constitutes infringement. How these cases are resolved will determine the pace of AI innovation and whether startups can afford to participate in the AI ecosystem at all.
Much of the current wave of innovation in AI is based upon closed or open-source foundation models that startups often fine-tune to perform a specific task. These models are trained on a large corpus of training data inputs so that they can accurately learn about the world and document the relationships between words, pixels, tones, and more. Many large models learn from what is publicly available on the Internet. And since most content created is copyrightable — including anything expressive, such as articles, songs, and even meaningless tweets about brushing your teeth — this training data may include copyrighted content.
Copyright law is designed to benefit the public by incentivizing the creation and dissemination of works through granting creators exclusive rights to their works. While copyright protects the expression of an idea, it does not protect the idea itself or facts. Additionally, the fair use doctrine allows copyrighted content to be used without the rightsholders’ consent for purposes like criticism, comment, news reporting, teaching, scholarship, or research when the application is weighed in favor of fair use. These limitations and exceptions are key to understanding how debates around copyright and AI should be resolved.
For startups, it is crucial that using copyrighted content as inputs does not constitute infringement and falls under fair use. Generative AI models learn from inputs similar to how humans learn from articles, books, or art to produce a new creation. AI models use inputs to understand and interpret concepts. This learning process does not result in the AI model directly copying the content it is trained on, but instead documenting relationships and patterns as vectors. Once training is finished, a model can produce outputs, like new writing or images, informed by those relationships.
You can think of an AI model as a composer who takes tempos, rhythms, and chord progressions from various known classical pieces to create a new composition. Similar to how the composer pulls and combines existing musical elements to generate new, original music, AI models learn from and interpret existing creative content to then produce a new output. (AI models produce unique creations, but users can still prompt outputs that are materially similar to copyrighted material, often despite safeguards designed to prevent it.)
If inputs are deemed to be infringing, courts will need to determine whether fair use justifies using copyrighted content in training data. The fair use doctrine allows for certain unauthorized uses of copyrighted content, and when evaluating a fair use defense claim, courts weigh four factors: the purpose and character of the use, the nature of the copyrighted work, the amount and substantiality of the portion use, and the effect on the market.
In the context of AI training, these factors support an applicable fair use defense. The purpose and character of AI’s use of copyrighted content is transformative because it does not replicate the original content but generates new outputs with added character or a further purpose. Regarding the nature of the work, AI processes creative inputs in a way that transforms it into new content. While AI models may ingest entire copyrighted works, they do not simply replicate this content in its entirety in the output. Finally, AI’s use of copyrighted content is unlikely to harm the market for the original work because AI developers don’t directly profit from the distribution of the generated content. Instead, they monetize the tools that enable users to generate outputs.
While these fair use factors are generally favorable for AI developers, the application is not always straightforward, and different cases can lead to varying outcomes. Fair use is a fact-specific, case-by-case determination made by the courts. For AI companies, particularly startups, this legal uncertainty poses a significant challenge. The risk of facing a copyright infringement suit can result in costly litigation and could stifle innovation.
Rightsholders — like publishers, authors, record labels, and others — have construed AI training as infringement, saying it constitutes an unauthorized use of their works. Many have sued AI companies of all sizes, ranging from startups to market leaders. If the rightsholders prevail, it will upend the AI innovation ecosystem.
Under such an environment, AI developers would need to seek licenses before they can begin development. In response to the current legal uncertainty, larger AI companies are opting to make licensing agreements for copyrighted content to avoid potential lawsuits or settle ongoing litigation. This is unworkable for startups and threatens smaller companies’ participation in the AI ecosystem. Licenses would be prohibitively expensive for startups and negotiations unbalanced given the comparative size difference of small startups and large rightsholder organizations. This would make it both difficult for startups to fine-tune existing models with anything besides their own data (which is often in short supply as new entities), and make it near impossible for a new startup to challenge existing players in frontier model development.
Moreover, requirements to license training data would remake the ecosystem most startups build upon. Licenses will increase costs for frontier model development, which will be passed on to startups that rely on access to them. Open source development will be harmed as open source developers are unlikely to release their models for free and with documentation that reveals how they were trained and what they ‘know.’
To promote innovation and maintain a competitive AI ecosystem, startups need legal clarity to move forward with AI development without the fear of facing costly copyright infringement lawsuits related to training data.
Disclaimer: This post provides general information related to the law. It does not, and is not intended to, provide legal advice and does not create an attorney-client relationship. If you need legal advice, please contact an attorney directly.
Engine is a non-profit technology policy, research, and advocacy organization that bridges the gap between policymakers and startups. Engine works with government and a community of thousands of high-technology, growth-oriented startups across the nation to support the development of technology entrepreneurship through economic research, policy analysis, and advocacy on local and national issues.