Training Data

The pirated library inside the model

A new suit against Meta argues the company didn't just read the books to build Llama. It kept them — and the model can be made to prove it.

Ada Mercer·May 20·7 min

Photograph: Iñaki del Olmo / Unsplash

A language model does not contain a library, exactly. It contains the residue of one — the statistical shadow that millions of books leave on a few hundred billion numbers after a system has read them all and been trained, over and over, to guess what word comes next. For three years the central legal question about that residue was whether producing it counted as reading or as copying. On May 5 a class action filed in the Southern District of New York proposed a more uncomfortable answer: that for some of the books, the model did not leave a shadow at all. It kept the thing itself.

The plaintiffs are not the usual roster of aggrieved novelists. They are five of the largest academic and trade publishers in the world — Elsevier, Cengage, Hachette Book Group, Macmillan and McGraw Hill — joined by the lawyer and novelist Scott Turow, and they are suing Meta and, by name, Mark Zuckerberg. The allegation is specific and, if the records support it, severe: that Meta obtained millions of copyrighted books and journal articles by torrenting them from pirate repositories, used that corpus to train its Llama models, and stripped out the copyright-management information — the metadata that identifies a work and who owns it — to obscure where the text had come from.

Why this is not the last suit, and why it is the dangerous one

It is worth being precise about what has changed, because the headline — publishers sue Meta — has been true in one form or another for two years. The relevant precedent is Kadrey v. Meta, the case brought by a group of authors over the same training practice, and Meta largely prevailed there. The reason it prevailed is the hinge on which this new case turns. The court did not bless the copying. It found that the authors had offered no meaningful evidence that Llama's existence diluted the market for their particular books, and it noted, almost as an instruction to the next litigant, that a plaintiff arriving with a better-developed record on market effects might win.

The publishers appear to have read the instruction closely. Academic and reference publishing is the precise corner of the market where 'market dilution' stops being a theory you have to argue in the abstract and becomes a quantity you can measure. A novel competes with a model loosely, on attention and mood. A journal article competes with it head-on, because the value of the article is the specific information inside it. If a model will reproduce that information on request — in the article's structure, sometimes in its sentences — the licensing market the publisher was counting on does not gently erode. It disappears.

A novel competes with the model on attention. A journal article competes with it on the one thing it is selling: the information inside.

The model as a witness against itself

This is the point at which the technical detail stops being background and becomes the case. A model trained to predict the next token does not, as a rule, store its training data; it stores a compressed, lossy representation that lets it approximate the data's patterns. But 'as a rule' is carrying a great deal of weight. When a passage appears often enough in the training set, or is distinctive enough, or the model is simply large enough, the cheapest way for the optimizer to get that passage right is to memorize it — to encode it closely enough that the model can emit it verbatim. Researchers have a clean term for this, memorization, and a growing body of evidence that it is not a rare malfunction but a predictable function of scale: the larger and more capable the model, the more of its training data it can be induced to reproduce word for word.

For a copyright plaintiff, memorization is the difference between an argument and an exhibit. The complaint alleges that Llama can be made to produce full-length scientific papers and journal articles that substitute for the originals. If that is demonstrable in court — if a plaintiff can sit in front of the model and draw out a paper it was never licensed to hold — then the abstract dispute about whether training 'is' copying gives way to something judges already know how to handle. Here is our work. Here is the defendant's product reproducing it. Here is the market we can no longer sell into. The model becomes the most persuasive witness in the room, and it is testifying for the other side.

Stripping the labels

The second allegation is quieter and, in some ways, more revealing. The plaintiffs claim Meta removed copyright-management information from the works — the rights notices, the identifiers, the metadata that says who made a thing and who owns it — before feeding them to the model. Under U.S. law that is its own violation, distinct from the copying, and it speaks to intent in a way the copying alone does not. You do not strip the labels off something you believe you are plainly entitled to use. The removal of provenance is, in effect, a concession that provenance mattered — that someone understood these were not free texts and took a step to make that harder to see.

The market that now exists

What lifts this above one company's problem is that a licensing market for training text demonstrably exists now. Other AI developers have signed deals with publishers, wire services and image libraries; the going rate for high-quality text is no longer zero. Once such a market exists, the fair-use calculation shifts, because one of the four factors a court weighs is precisely the effect of the use on the market for the original. And 'they could have licensed this and chose to torrent it instead' is not a sentence any defendant wants read slowly to a jury. The four-factor test was built for exactly this kind of weighing:

The purpose and character of the use — is it transformative, or a substitute?
The nature of the copyrighted work — factual reference material is treated differently from fiction.
The amount taken — and a model that can reproduce a whole paper has, functionally, taken all of it.
The effect on the market — the factor the publishers have built their case to win.

The exposure is sharpest for an open-weight model like Llama, whose trained parameters are released for anyone to download. A closed model can be patched, rate-limited and wrapped in filters that make memorized text harder to surface; the company keeps its hand on the dial. Weights that have already been downloaded a million times cannot be recalled. Whatever the model memorized is now distributed across the world's hard drives, beyond the reach of any injunction a court might write.

And there is no clean remedy inside the model either. You cannot un-train a system the way you pulp a print run. The information is smeared across the weights, and the techniques for removing a specific work after the fact are partial, expensive, and tend to degrade under a determined prompt. Which means the question the case is really asking is not whether Meta will pay — it can — but what a court is supposed to do with a product that has already absorbed the thing it was not permitted to take, in a form no one can fully extract.

The honest version of this dispute is not about whether machines should be allowed to learn from books. Nearly everyone, including most of the publishers, expects that they will, under some arrangement, for some price. It is about whether 'learning' is the right word for what a model does when it can recite — and whether an industry that built its first products on text it did not pay for can be made, in retrospect, to pay. The May filing will not settle that. But it is the first major case constructed to be tried on the model's own output, and the model, it turns out, makes a poor witness for the defense. Ask it the right question and it will read you the book.

The pirated library inside the model

Why this is not the last suit, and why it is the dangerous one

The model as a witness against itself

Stripping the labels

The market that now exists

References

Read next

Can you even test a model before you ship it?

The assistant that acts before you ask

I gave an agent the keys to my laptop for a week