Field notes · 24 May 2026

What 96 per cent means when the spec actually says something

Our score went down and we could not be prouder.

Read this as: Plain English Formal-methods reader

Last week the factory built a small piece of mathematical code — a multiplication routine of the kind that runs millions of times a second inside any modern AI chip — and an automated proof checker confirmed it correct at 96 per cent. That is a worse score than the 98.5 per cent we published in April. It is also the first one we believe.

Here is what the older 98.5 per cent measured. The factory writes two things for every routine: a description of what the code is supposed to do, and the code itself. A proof checker reads both and tries to verify that the code keeps the description. If both pass the check together, we count it as proven.

In April, when we put 98.5 per cent on the front page, what we did not say — because we did not yet realise it ourselves — was that the descriptions we were writing were too loose. A loose description says "if the input is empty, return zero". A tight description says "return the actual mathematical product of the inputs, element by element, under these specific arithmetic rules". You can satisfy a loose description by writing code that does almost nothing. The proof checker, doing its job exactly as advertised, signs off.

We discovered this by opening one of our April routines, a standard mathematical building block called a dot product. The code returned zero every time, ignored every input parameter, and the proof checker said it was correct against the description we had written. The two artefacts agreed with each other. They had almost nothing to do with the operation the routine was named after.

This is the unglamorous heart of formal verification — the entire enterprise hinges on the question of what the description actually says. If the description is the empty statement, the proof of the code keeping the description is the empty proof. Many published verification numbers have this quality. The numbers are correct. They simply measure less than the headline implies.

So we rewrote the rules. The factory now holds itself to a higher standard — a description has to constrain what the code actually computes, not just declare some safe metadata about it. The 96 per cent we shipped last week is the first time a routine has come through the tightened gates, and the score is lower because the test is harder.

What this isn't: most of our older work has not been re-checked under the new rules yet. The 98.5 per cent figure should be expected to drop, possibly substantially, when we run those routines through again. We will publish whatever number comes out. The 24 of 559 things the checker did not sign off on this time are themselves work that is not finished — they concern a specific kind of arithmetic overflow on very large matrices, which we have not yet engineered around. The factory is not magic. What we have changed is the rule for what we are willing to put on a front page.

A verified system is only as verified as the statement the checker actually checked. We have one number we trust to mean what it says, and a queue of older numbers we owe ourselves a re-run on. The interesting question is not whether the next headline will be higher or lower than 96. It is whether we will keep being clear about what the headline measures.

Last week we shipped a kernel the prover discharged at 96 per cent. 535 of 559 obligations closed under gnatprove at level 1, no manual intervention. It is a smaller number than the 98.5 per cent we put on the front page in April. It is also the first one we are confident actually means something.

The kernel is an int8 × int8 → int32 matmul, the textbook Jacob et al. shape — symmetric per-tensor quantisation. The body does the actual computation; it is not a stub.

This sounds like a low bar. It is a low bar. The reason it took us until last week to clear it is that the spec is the load-bearing piece, and we had been writing the spec wrong.

Earlier this month our Phase 1 BLAS conversions discharged at 98.5 per cent. Today I went back and opened the body for ddot. It compiles. It discharges. It does not, in fact, compute a dot product. The body and the spec agreed with each other; they did not agree with the mathematical operation the routine was named after.

That is what 98.5 per cent measured. It measured whether the bodies we generated were consistent with the specs we generated. The two artefacts agreed. They had almost nothing to do with what either purported to compute.

The reframe is uncomfortable. Discharge percentage is not a single quantity. The number depends on the question. A weak question is easily answered; a strong question is not. Same headline percentage. Very different epistemic claim. The headline number on the front page was technically accurate and substantively misleading, in the way that most published verification numbers are technically accurate and substantively misleading.

So we changed the question. The int8 matmul run was the first kernel through the new test.

What this isn't: one kernel. Phase 1 and Phase 2 have not been re-run under the stricter test yet. The honest Phase 1 discharge number will be lower than 98.5 per cent, possibly by a lot, and we will publish whatever it turns out to be. The 24 obligations the prover did not discharge on int8 matmul are bounded engineering work, not unsoundness in the result. The pipeline is not magic; most of the cost of getting the first 96 per cent was discovering that the earlier number wasn't asking enough.

A verified system is only as verified as the proposition the prover discharged. If the proposition is the empty proposition, the discharge is empty. We have one number now that we trust to mean what it says, and a queue of older numbers we owe ourselves a re-run on. The interesting question is not whether the next headline will be higher or lower than 96; it is whether we will keep being clear about what the headline measures.