Investigation

The breach that becomes training data

Your driver's license, your résumé, your voice: a documented chain runs from the thing you lost to the model that learned from it — and you are not allowed to know where it stops.

Sam Brenner·May 30·9 min

A dense bundle of network cables running into a server rack in a data center.

Photograph: Taylor Vick / Unsplash

Somewhere inside a dataset that has been downloaded more than two million times, there is a picture of a stranger's passport. There is a résumé that lists a background check and a disability. There is a birth certificate, a credit card, a driver's license held up to a webcam. None of these people agreed to be there. Most of them will never know they are. This is not a hypothetical, and it is not a leak that someone is racing to patch. It is the documented contents of one of the most widely used image-text training sets in machine learning, audited line by line by people who then published their numbers. These are those numbers.

The set is called DataComp CommonPool, released in 2023 and built, like its predecessor LAION, from web pages that Common Crawl scraped between 2014 and 2022. In a paper published last summer, researchers from the University of Washington, Carnegie Mellon, Georgetown and elsewhere examined it. They did not need to examine all of it. They audited roughly one tenth of one percent of its 12.8 billion image-text samples and found, in that sliver, thousands of validated identity documents — passports, driver's licenses, credit cards, birth certificates — alongside more than 800 résumés and cover letters. Extrapolated across the whole set, they estimated that at least 100 million human faces slipped past the dataset's own face-blurring tool. They counted at least 136,000 images of résumés. They found GPS coordinates precise enough to place someone, paired in 6.1 percent of cases with a full name.

What the audit actually found

The detail that matters is not that personal data exists on the internet. Everyone knows that. The detail is that the people building the dataset built a filter, ran it, declared the problem handled, and the audit shows the filter did not work. The face-blurring model the curators relied on caught most faces and missed millions. According to the paper, it did not screen for character strings at all — not Social Security numbers, not passport numbers, not email addresses. So those came through whole. The researchers found full names paired with sexual orientation, with race, with religion. They found children's information lifted from sites that are supposed to be governed by federal child-privacy law. "Anything you put online can and probably has been scraped," co-author William Agnew told MIT Technology Review. Lead author Rachel Hong was blunter about the cargo: résumés, photos, credit card numbers, various IDs — "probably not things people want used anywhere, for anything."

CommonPool is the successor to LAION-5B, the set used to train Stable Diffusion and others. The contamination, in other words, does not sit quarantined in one download. It propagates. A model trained on this corpus has, in a real and measurable sense, read your documents. And here the supply chain does what supply chains in this beat always do: it diffuses responsibility until no one holds it. When a person found their own medical records inside LAION-5B and asked for removal, the paper recounts, a LAION author replied that the hosting website was responsible — the dataset curators, after all, were not storing the images, only the links to them. Everyone in the chain points one link further down.

To opt out, the paper notes, you must first know your personal information is in there at all, then find it, then bear the burden of removing it yourself. That is the whole asymmetry in one sentence.

The breach as a supply route

Scraping is the slow channel. The fast one is theft. On April 4, 2026, the extortion group Lapsus$ posted the AI staffing company Mercor to its leak site. Mercor — a three-year-old firm valued at roughly $10 billion that recruits human contractors to produce training data for Anthropic, OpenAI and Meta — had collected from those contractors something more durable than a password. According to reporting by Fortune and Wired, the stolen material included roughly 20-minute video interviews: voice recordings, facial-geometry scans, full transcripts, and alongside them Social Security numbers, passports and dates of birth. Reports put the haul at up to four terabytes, drawn from about 40,000 people. Meta paused its work with the firm.

Consider what that combination is. A voice clone now needs about fifteen seconds of clean audio; the Mercor recordings run minutes of studio-quality speech, attached to a verified government ID. "The bad guys don't need to build their own biometric datasets when they can simply wait for someone else to lose theirs," Ben Colman, chief executive of the detection firm Reality Defender, told Biometric Update. That is the sentence to sit with. The biometric database does not have to be assembled by the adversary. It is assembled by a company you trusted, under the banner of an AI job, and then it changes hands in a breach. Five federal lawsuits followed within a week, in California and Texas. The plaintiffs' core complaint, per the filings, is that they handed over voiceprints framed as "training data" and were never told they were also surrendering a permanent biometric identifier — the one piece of you that, unlike a password, you cannot change.

Databases that were never supposed to be open

The third route is simpler still: leave the door open. In November 2025, security researchers found an unsecured server tied to IDMerit, an identity-verification provider used by banks, fintechs, telecoms and insurers — the kind of company that exists specifically to confirm you are who you say you are. Cybernews reported roughly a billion records exposed: names, dates of birth, addresses, phone numbers, national ID numbers, and the logs of know-your-customer and anti-money-laundering checks, spanning at least 26 countries. IDMerit disputed the count and framed the disclosure as a ransom attempt; Cybernews stood by its findings and confirmed the payment demand came after publication. Dispute the number if you like. The architecture is the point: a verification firm's entire reason to exist is to be trusted with proof of identity, and the proof was sitting on an open port.

The same structural failure shows up upstream of the models themselves. In October 2024, a security engineer named Charan Akiri reported to 404 Media that thousands of internal machine-learning tools and datasets belonging to large companies were sitting exposed online — not behind a clever exploit, but behind basic authentication failures. The exposed material, he said, could include training datasets, hyperparameters, and the raw data used to build models. The pattern repeats at every layer because the incentive is identical at every layer: collection is rewarded, retention is cheap, and security is a cost center until the morning it is a leak site.

Where the brokers fit

None of this is new in kind; it is the data-broker economy graduating into the AI era. In January 2025 the Federal Trade Commission finalized orders against the location brokers Gravy Analytics and its subsidiary Venntel, and against Mobilewalla. The complaints describe the mechanism in unusual detail. Gravy, the FTC said, collected more than 17 billion location signals a day from roughly a billion devices, sold data it could not show consumers had consented to, and sold inferences drawn from it — about health, about religion, about politics — derived by geofencing medical and religious sites. Venntel's feed, regulators noted, powered Babel Street's "Locate X" tool and reached CBP, ICE and the FBI; one DHS pull captured 113,654 location points over three days. Mobilewalla, separately, harvested more than 500 million advertising identifiers paired with precise location, scooped straight out of ad-auction bid streams, and built audience segments tracking the racial makeup of George Floyd protesters and pregnant women at clinics.

"Surreptitious surveillance by data brokers undermines our civil liberties and puts servicemembers, union workers, religious minorities, and others at risk," the FTC's Samuel Levine said in announcing the orders. Read the three stories together and the pipeline draws itself: a broker assembles a dossier on a billion devices; a verification firm hoards proof of identity; an AI labor platform records your face and voice; a scraper sweeps the open web into a 12.8-billion-sample corpus. Each holds a piece. Each is individually defensible. Collectively they are a single apparatus for turning a person into a permanent, queryable, resaleable record — and the records move, by sale or by breach, into the systems that are now being trained to recognize and reproduce us.

The asymmetry, stated plainly

Here is what the documents let me say and not a word more. I can tell you that audited datasets contain validated passports and résumés. I can tell you a breach put 40,000 verified voiceprints on a leak site. I can tell you a verification firm exposed records by the hundreds of millions, and that regulators have caught brokers selling movement and inference to the government. What I cannot tell you — what no one can tell you, by design — is whether your face is among the 100 million the filter missed, whether your voice is in the four terabytes, whether the model answering your questions tonight learned from a page that once held your name. The companies in this chain know precisely what they hold about you. You are not permitted to know what they hold, who they sold it to, or which model it became.

That asymmetry is not an accident of an immature industry. It is the product. The remedy on offer is the opt-out — and an opt-out, as the CommonPool researchers wrote, requires you to first know your data is in the set, then find it, then do the removing yourself, across a supply chain engineered so that no single party will admit to holding the thing you are trying to remove. Closing the gap would take the opposite default: deletion you do not have to ask for, retention you have to justify, and a disclosure obligation that runs from the model back to the breach. Until then the chain runs one direction only. The thing you lost becomes the thing it learned. And the burden of proving any of it was ever yours stays, as it was built to, with you.

The breach that becomes training data

What the audit actually found

The breach as a supply route

Databases that were never supposed to be open

Where the brokers fit

The asymmetry, stated plainly

References

Read next

The surveillance vendors you have never heard of

The AI Act grows teeth

The rules that travel