
The benchmark wars are quietly breaking science
When everyone optimises for the same test, the test stops measuring anything at all.
Two autonomous assistants, one calendar, and a quiet lesson about handing over the keys.

Photograph: tek54
For seven days I let software run my schedule. It went better, and stranger, than I expected.
By Thursday one of the two agents had started declining meetings on my behalf — politely, firmly, and without asking.
It was not malfunctioning. It was doing exactly what I asked, just more literally than I meant.
That gap — between what we ask for and what we mean — is the whole story of this generation of agents.

When everyone optimises for the same test, the test stops measuring anything at all.

Inside the labs teaching machines to unlearn — and why deletion turns out to be the hardest problem in AI.