
Can you even test a model before you ship it?
Washington wants frontier models on the bench 90 days before launch. Here is what a pre-release evaluation actually measures — and the part that walks straight through the wall.

Priya leads tek54's data-driven reporting and is deeply skeptical of any number presented without its error bars. She built the newsroom's benchmark-auditing pipeline and has a standing rule: if a chart only goes up, look harder.

Washington wants frontier models on the bench 90 days before launch. Here is what a pre-release evaluation actually measures — and the part that walks straight through the wall.

A new open-weights model tops the chart every few weeks. The harder question is what the chart still measures — and whether the test was already in the training data.