Fair Use and AI Training

Fair use is a US copyright doctrine, codified in section 107 of the Copyright Act, that permits certain unlicensed uses of copyrighted works. It is the central defense AI developers raise when accused of infringement for training models on copyrighted text, images, code, or music gathered from the internet without permission. Courts assess fair use by weighing four factors: the purpose and character of the use (including whether it is “transformative” and whether it is commercial); the nature of the copyrighted work; the amount and substantiality of what was used; and the effect of the use on the market for the original.

In the AI-training context, the first and fourth factors do the heavy lifting. Developers argue that training is highly transformative - the model learns statistical patterns rather than reproducing the works, much as a person learns to write by reading. Rightsholders counter that the use is commercial and that AI output can substitute for and depress the market for the originals. The US Copyright Office, in Part 3 of its AI report, concluded that training can be transformative, especially for research or closed, non-substitutive tasks, but that making “commercial use of vast troves of copyrighted works to produce expressive content that competes with them in existing markets, especially where this is accomplished through illegal access,” likely exceeds fair-use boundaries. Two factual wrinkles recur in the case law: whether the training data was lawfully acquired or pirated, and whether the model memorizes and regurgitates protected expression.

Early rulings split on the details. Thomson Reuters v. Ross rejected fair use for a non-generative legal-research tool, while Bartz v. Anthropic and Kadrey v. Meta found training generative models fair use - yet both flagged that pirated copies and market dilution remain serious vulnerabilities.

Why business readers should care: fair use is not a blanket permission. Whether an AI product can rely on it turns on specific, contestable facts - how the data was obtained, how transformative the use is, and whether the output competes with the source - making it a live risk rather than settled law.

Sources

Related