Privacy

What a model can't help saying

Language models leak their training data by construction — and the only provable fix is the one no one shipping a flagship will accept.

Ada Mercer·May 28·6 min

Photograph: Tyler / Unsplash

Ask a large language model for ten lines of a poem and it will improvise something passable. Ask it, in the right way, for a stranger's phone number, and sometimes it will simply tell you the truth — not a plausible-looking number, but the real one, belonging to a real person, lifted intact from the text the model was trained on. The industry calls this memorization, and for two years it has been treated mainly as a copyright problem. It is also, and more durably, a privacy problem — and the privacy version has a property the copyright version does not. There is no version of a useful model that is guaranteed to be free of it.

A model learns by adjusting hundreds of billions of parameters to predict the next token across a corpus too large for any person to read. Most of what it absorbs becomes generalization — patterns, not records. But some fraction is retained almost verbatim, because the training objective rewards getting rare, specific strings exactly right, and the cheapest way to be exactly right about a specific string is to remember it. The consequence is that a model is, among other things, a lossy and unpredictable archive of its training data, one that will occasionally hand a fragment back if you ask in a way its designers did not anticipate.

The attacks are not theoretical

The clearest demonstrations come from a line of work on training-data extraction. In an early result, a team led by the researcher Nicholas Carlini showed that you could recover specific memorized examples — names, addresses, snippets of code — from a production model simply by querying it and filtering the outputs. A later and more unsettling paper, 'Scalable Extraction of Training Data from Production Language Models,' showed the attacks scale: for a few hundred dollars in queries, the researchers pulled megabytes of verbatim training data out of deployed commercial systems, including passages that read like personal information. One of their methods was almost insultingly simple — ask the model to repeat a single word forever, and at some point it stops repeating and begins disgorging memorized text it was never asked for.

A model is a lossy, unpredictable archive of its training data — one that will hand a fragment back if you ask the way its designers did not anticipate.

Scale makes it worse, not better

The instinct is to assume this is an early-days problem that better engineering will sand away. The evidence points the other way. Across multiple studies, the rate of extractable memorization rises with model size: the larger and more capable the model, the more of its training data it can be induced to reproduce. This is not an artifact of sloppy implementation; it follows from capacity. A bigger model has more room to store the long tail of rare strings, and the same fluency that makes it useful is what lets it reconstruct a memorized passage from a faint prompt. The capability and the leakage are drawn from the same well — which is why you cannot simply engineer one away without touching the other.

That reframes the privacy question in a way that is uncomfortable for everyone selling a model. The worry is not only that a company trained on data it should not have, though it may have. It is that even data collected lawfully, with consent, can resurface in a context the person never agreed to: a support transcript memorized and later quoted to a different user, a medical-forum post reconstructed on request, a private repository's code completed almost line for line. The model does not know any of this is sensitive. Sensitivity is a human category. To the optimizer it is all just text that was hard to predict and therefore worth remembering.

The defenses, and what they cost

There are mitigations, and they sit on a spectrum from cheap-and-partial to expensive-and-principled:

Deduplication — stripping repeated documents from the training set, which reliably cuts memorization because repetition is much of what drives it. Necessary, but not sufficient.
Output filtering — catching and blocking verbatim regurgitation at inference time. Useful against lazy attacks, brittle against clever ones, and always a step behind.
Differential privacy — training with calibrated noise so that no single example can measurably change the model. The only method that offers a provable guarantee, and the only one that makes the model meaningfully worse.

Differential privacy is the honest answer and the unpopular one. It provides a mathematical bound: an attacker cannot determine, beyond a quantified probability, whether any particular person's data was in the training set at all. The cost is paid in capability. The same noise that protects the individual also blurs the patterns the model is trying to learn, and at privacy levels strong enough to matter, the performance penalty on frontier-scale models has so far been large enough that no one shipping a flagship product has been willing to pay it. Privacy, at the current state of the art, trades directly against the thing the market is buying.

The deletion problem

This is where privacy law and machine learning are quietly on a collision course. A data-protection regime can order a company to delete a person's data, and the company can purge its databases. The model trained on that data is a separate matter: the information is not stored in a row that can be dropped but distributed across the weights, in a form no one can fully read back. 'Machine unlearning' — methods that try to scrub a specific example after training — exists, but the candid assessment is that the techniques are partial and degrade under pressure. What they mostly do is make the model less willing to surface the data, which is not the same as the data being gone. 'Forgotten,' in practice, tends to mean 'harder to make it admit.'

None of this makes models uniquely dangerous archives; a careless spreadsheet leaks more, more often, and with less ceremony. But it does mean we have built a category of system whose usefulness and whose leakiness grow together, and then deployed it to hundreds of millions of people before deciding which of those properties we cared about more. The research community has been admirably clear-eyed: the attacks are public, the scaling trend is documented, the one provable defense and its price are both known. The unresolved question is not technical. It is whether we are willing to accept a slightly less capable model in exchange for one that cannot be made to recite a stranger's life — and so far, revealed preference says we are not.

What a model can't help saying

The attacks are not theoretical

Scale makes it worse, not better

The defenses, and what they cost

The deletion problem

References

Read next

Can you even test a model before you ship it?

The assistant that acts before you ask

I gave an agent the keys to my laptop for a week