🤖 AI Tools
· 7 min read

AI Copyright & Training Data — The Lawsuits That Matter for Developers (2026)


Between 2023 and 2024, over 50 copyright lawsuits were filed against AI companies. By early 2026, the results are in — and the era of “train first, ask later” is over.

For years, AI companies operated on a simple assumption: scraping the internet was fair game. Books, news articles, source code, music — if it was publicly accessible, it was training data. Courts, regulators, and creators have now pushed back hard. Several landmark settlements and rulings have redrawn the lines around what AI companies can and can’t do with other people’s work.

If you build software with AI tools — or build AI tools yourself — these cases directly affect you. Here’s what happened, what it means, and what you should be doing about it.

The cases that changed everything

Bartz v. Anthropic — The $1.5 billion wake-up call

What happened: In fall 2025, Anthropic agreed to a $1.5 billion settlement after a class action led by authors whose books were used to train Claude. The core allegation was straightforward: Anthropic had downloaded and processed millions of pirated books — full copies, not snippets — as training data. The judge found that ingesting entire copyrighted works for commercial model training goes well beyond fair use.

What it means: This was the first mega-settlement in AI copyright law, and it sent shockwaves through the industry. The sheer dollar amount made it clear that “we didn’t know” or “it was publicly available” aren’t viable defenses. If your model was trained on pirated or unlicensed content, you’re exposed — and the liability can be enormous.

NYT v. OpenAI — The case that could reshape everything

What happened: The New York Times sued OpenAI, arguing that ChatGPT can reproduce Times articles nearly verbatim. The case is still ongoing as of April 2026, and it’s the one legal experts are watching most closely. The Times has presented evidence of outputs that closely mirror original reporting, raising the question: when does “learning from” become “copying”?

What it means: This case could set the broadest precedent. If the court rules that reproducing substantial portions of copyrighted text — even as model output rather than a direct copy — constitutes infringement, it would fundamentally change how models are trained and how outputs are filtered. Every AI company with a chatbot product has skin in this game.

Doe v. GitHub, Microsoft & OpenAI — Code gets its day in court

What happened: Developers filed a class action arguing that GitHub Copilot was trained on public repositories, including code licensed under GPL and other copyleft licenses, without honoring those licenses’ terms. In November 2025, the parties settled. The terms included commitments to filtering mechanisms, attribution for suggested code, and a clear statement that Microsoft and OpenAI make no ownership claim over Copilot’s output.

What it means: This is the case that hits closest to home for developers. If you use Copilot or similar tools, the settlement’s attribution and filtering commitments are directly relevant. It also validated a key concern: training on open-source code doesn’t mean you can ignore the license it came with. For more on this, see our deep dive on open-source AI and legal compliance.

UMG v. Udio — The opt-in principle arrives

What happened: Universal Music Group sued AI music generator Udio, and the court established a principle that’s now rippling across the industry: artists must consent before their work is used for training. No more burying an opt-out link in a terms-of-service page. The default flipped — if you don’t have permission, you can’t use it.

What it means: This ruling introduced the opt-in standard for creative works. While it originated in the music space, the logic applies broadly. Expect this principle to show up in future cases involving text, images, and code.

Warner Music v. Suno — From lawsuit to licensing deal

What happened: Warner Music sued AI music platform Suno, and the case settled in November 2025 — but not with just a payout. The two sides formed a licensing partnership. Under the deal, artists retain control over the use of their names, likenesses, and voices in AI-generated content.

What it means: This is the template some in the industry are hoping for: litigation that leads to a workable licensing model rather than a blanket ban. It suggests a future where AI companies pay for training data the same way streaming services pay for music. Whether that model scales to code and text remains to be seen.

Like Company v. Google — The EU enters the ring

What happened: On March 10, 2026, the EU’s Grand Chamber heard Like Company v. Google — the first case to directly ask whether training a large language model violates EU copyright law. A 15-judge panel is now deliberating. The case centers on whether the text and data mining exceptions in the EU Copyright Directive cover commercial LLM training at scale.

What it means: This is the most important AI copyright case outside the US. The EU’s answer will determine whether AI companies need explicit licenses to train on European content — or whether existing exceptions provide enough cover. A ruling against Google could force AI companies to negotiate data access across the entire EU, with massive compliance implications.

Thaler v. Perlmutter — AI can’t be an author

What happened: Stephen Thaler tried to register a copyright for an image generated entirely by AI, with no human author. The Copyright Office refused. Thaler sued, lost, appealed, and ultimately the Supreme Court declined to hear the case.

What it means: The law is now settled on this point in the US: purely AI-generated works are not copyrightable. If there’s no human authorship in the creative process, there’s no copyright protection. This matters for anyone generating code, text, or images with AI — your output may not be protectable. We covered the implications for developers in who owns AI-generated code.

The fair use argument — why courts weren’t convinced

AI companies leaned heavily on fair use as their primary defense. The argument: training a model on copyrighted material is “transformative” because the model doesn’t store or reproduce the original works — it learns patterns from them.

Courts weren’t buying it, for three main reasons:

  1. Verbatim reproduction. In multiple cases (NYT v. OpenAI, Bartz v. Anthropic), plaintiffs showed that models could reproduce substantial portions of copyrighted works. That undercuts the “transformative” argument significantly.

  2. Commercial revenue. Fair use is harder to claim when the use generates billions in revenue. These aren’t academic research projects — they’re commercial products competing in the market.

  3. Market competition. The US Copyright Office weighed in with guidance in May 2025 stating that when AI training competes with existing licensing opportunities for the original works, the fair use analysis tilts against the AI company. If a chatbot can summarize a news article, that competes with the newspaper’s subscription model.

The takeaway: fair use is not a blank check for commercial AI training. Companies that assumed otherwise are now paying for it — literally.

The opt-in revolution

The clearest trend across all of these cases is the shift from opt-out to opt-in.

Two years ago, the default was: your work is training data unless you actively object. Creators had to find opt-out forms, submit requests, and hope they were honored. Now, the legal and regulatory momentum is moving in the opposite direction. UMG v. Udio established opt-in for music. The Bartz settlement implies it for books. The EU case could codify it for all content under European law.

For AI companies, this means building licensing infrastructure — not just scraping infrastructure. For developers, it means the tools you use will increasingly come with data provenance documentation. Ask for it.

What this means for developers using AI tools

If you use AI coding assistants, chatbots, or generation tools in your workflow, here’s what to keep in mind:

  • Your AI-generated output may not be copyrightable. After Thaler v. Perlmutter, purely AI-generated work has no copyright protection in the US. If human creativity is part of the process, you’re in better shape — but the lines are still being drawn.

  • License compliance still matters. The Copilot settlement confirmed that training on open-source code doesn’t erase license obligations. If your AI tool suggests code that originated from a GPL project, you need to know about it. Check out our guide on open-source AI legal compliance.

  • Data provenance is your responsibility too. You may not have trained the model, but if you ship code or content generated by a model trained on infringing data, you could face downstream risk. Read more in our piece on AI code and data privacy.

  • Vendor transparency matters more than ever. The settlements are pushing AI companies toward disclosure about training data sources. Take advantage of that — and push for more. See our roundup of AI coding agents with strong privacy practices.

What to ask your AI vendor

Before you commit to (or renew) an AI tool, ask these questions:

  1. What data was used to train this model? Can they provide documentation or a data card?
  2. Do you have licenses for the training data? Especially for code, text, and creative works.
  3. What filtering or attribution mechanisms are in place? Post-Copilot settlement, this is table stakes.
  4. What’s your indemnification policy? If your output infringes on someone’s copyright, who’s liable?
  5. How do you handle data retention? Understanding AI data retention policies is critical for compliance.
  6. Are you compliant with EU regulations? If you operate in or serve European users, the Like Company v. Google ruling could change the landscape overnight.

The legal ground is still shifting, but the direction is clear: more transparency, more licensing, more accountability. The developers and companies that get ahead of this now will be in the strongest position when the dust settles.