What do the works of John Grisham, Jodi Picoult, and George R.R. Martin have in common? In addition to being bestsellers, they are at the heart of a consolidated class action asserting OpenAI's use of those authors' work as training material for its large language models ("LLMs") violates copyright law. In a blow to ChatGPT maker OpenAI, the U.S. District Court for the Southern District of New York ruled this week that the plaintiffs adequately stated a prima facie claim for copyright infringement, denying OpenAI's motion to dismiss and opening the door to discovery. The dispute has the potential to radically alter the artificial intelligence landscape and could hamper its astronomical growth. Everything about the In re: OpenAI, Inc. Copyright Infringement Litigation, 25-md-3143 (S.D.N.Y.) is big: the authors, the industry, the claims, and the stakes – for the parties and the public.
What You Need to Know:
- A consolidated class action by authors and copyright holders alleges OpenAI infringed their copyrights by training OpenAI's LLMs on their works and creating infringing works in the LLM outputs.
- Judge Stein of the Southern District of New York denied OpenAI's motion to dismiss plaintiffs' consolidated class action complaint, finding OpenAI copied protected works and the outputs ChatGPT created bore a substantial similarity to protected elements of those works.
- The court emphasized its ruling was not "intended to suggest a view on whether the allegedly infringing outputs are protected as fair uses of the original works," noting that is a fact-intensive inquiry courts typically do not address until summary judgment.
If you've ever wondered how ChatGPT does what it does, the court breaks down the process succinctly: OpenAI and other companies supply LLMs with massive amounts of text that enable the LLMs to uncover similarities in words' meanings and functions; the words are broken down into "tokens" that are then translated into numerical sequences called "vectors," and the vectors are used to "map" the location of tokens relative to one another. Understanding the relationship between the tokens allows the LLMs to digest prompts and produce appropriate responses in plain language. Because the quality of outputs depends on the quality of the text inputs, published books, such as those of the plaintiffs, are prized training material.
Plaintiffs – a group of authors and copyright holders – assert incorporating their works into training material and responses to prompts constitutes infringement. Specifically, the consolidated class action complaint alleges OpenAI infringes their works first by downloading and reproducing their works as training text, and then again by using the reproduced works in the outputs of LLMs such as ChatGPT. In order to state such a claim, plaintiffs had to allege actual copying and substantial similarity. Seeking a quick exit, OpenAI moved to dismiss, claiming plaintiffs failed to plausibly allege substantial similarity between the protected works and ChatGPT's outputs.
The court disagreed and denied OpenAI's motion. Because the complaint squarely alleged access to plaintiffs' works and that ChatGPT's outputs were based on plaintiffs' works, the court found the "actual copying" element of a copyright infringement claim was satisfied.
The bulk of the court's opinion concerned the second element, substantial similarity. As the court explained, the standard test for substantial similarity is the "ordinary observer" test, which assesses whether the average "lay observer would overlook any dissimilarities between the works and would conclude that one was copied from the other." However, where – as here – a work contains both protectible and unprotectible (i.e., stock themes and characters or "scenes a faire") elements, courts apply the stricter, "more discerning observer" test. Under that test, courts ask whether substantial similarity exists between "only those elements[] that provide copyrightability to the allegedly infringed" work.
The court found plaintiffs' allegations satisfied this element as well. Using (spoiler alert) outputs that summarized the plot of the book "A Game of Thrones" by George R.R. Martin and created an alternative sequel to "A Clash of Kings" from the same series, the court explained that a "more discerning observer could reasonably conclude" ChatGPT's outputs were substantially similar to the copyrighted works. The court based its conclusion on the fact that the plot summary "conveys the overall tone and feel of the original work by parroting the plot, characters, and themes of the original." The court also found the imagined sequel to be substantially similar to plaintiffs' works because it incorporated copyrightable elements such as setting, plot, and characters. As a result, the court found plaintiffs adequately stated a prima facie claim of copyright infringement based on ChatGPT's outputs.
Overcoming the motion to dismiss is a significant hurdle for the plaintiffs, as they can now move toward discovery. In this case, the discovery phase will likely be contentious, as the income generated by LLMs hinges on the particular and sophisticated manner in which they process prompts, which is partly attributable to the training material they are fed, such as the plaintiffs' works. As a result, keeping those processes shielded from competitors is a high priority and the entry of a comprehensive protective order is likely forthcoming.
If the similar copyright infringement suit filed by The New York Times is any indication (see The New York Times Co. v. Microsoft Co., 1:23-cv-11195 (S.D.N.Y.)), discovery will also present other complex issues, such as protocols for inspecting training data, the selection of numerous ESI custodians, and the production of financial documents and investor materials. More broadly, the case moving forward means that a perhaps existential threat to monetized LLMs is one step closer to materializing. Alternatively, the case could end in settlement as other AI-related litigation has; in September, Anthropic agreed to pay $1.5 billion to resolve a similar suit (amounting to around $3,000 per allegedly infringed work). This potentially paradigm-shifting case will remain one to watch as the law attempts to keep up with evolving uses of artificial intelligence.