Methodology

Built with Claude — and what we add.

The Dark Factory uses Anthropic's Claude models as the language-model worker inside our software production pipeline. We chose Claude because, straightforwardly, they are the best at the hardest part of what we do: generating formally-verifiable code that gnatprove can discharge. We have tested the alternatives. None are close yet.

The numbers on this page are taken directly from our own measured runs. We publish them as the Organised Working Index (OWI) — the fraction of proof obligations a given pipeline can discharge against a fixed set of safety-critical conversion targets. OWI is re-measured monthly and on every new frontier-model release.

How we use it

Claude is the engine. The discipline around it is what we build.

Claude sits inside our pipeline; we do not expose it directly, we build around it. The production line takes a natural-language specification or a legacy Fortran source, puts it through a series of structured, machine-checkable stages with proof-discharge gates between them, and emits SPARK-Ada code whose safety properties are mathematically proven, not just plausible.

The gap

What we add over calling Claude directly.

A frontier model on its own — invoked from a shell with a single prompt — can translate Fortran into Ada that mostly compiles. That is a remarkable capability and we do not dismiss it. For demos and prototyping it is enough.

But for the work our customers need — safety-critical software with audit trail and provable correctness — the gap between one-shot model and our production line is real, recently measured, and worth being precise about. The table below shows our Phase 1 foundational numeric ladder — ten reference numeric kernels (BLAS L1, L2, L3 plus LAPACK LU, Cholesky and QR factorisations) from Netlib. These are the kernels that any safety-critical edge inference stack (automotive ASIL-D, aerospace DO-178C, defence-grade edge compute) ultimately sits on top of.

Measurement	Claude alone, one-shot	The Dark Factory pipeline
Compile rate, ten reference routines	5 of 6 compiled	10 of 10 complete (spec + body)
Proof obligations auto-discharged at `gnatprove --level=1` under strong postconditions	66% (289 / 436)	90.9% (3,377 / 3,715)
Lowest per-routine discharge in the set	—	82.4% (BLAS `dgemm`)
Highest per-routine discharge in the set	—	94.9% (BLAS `daxpy`)
Behaviour on complex multi-module systems	Limited by single-prompt scope	Decomposed into per-subsystem proofs
Audit trail / provenance for every emitted artefact	None	Per-stage artefacts traceable to the originating specification

The thesis the index carries is simple: the variable that moves the number is the discipline around the model, not the model alone. Same model, different OWI.

Public reference point

We gifted UK government a formally-verified HMPPS sentence-release calculator — live, all proof obligations discharged — produced by the factory pipeline. We invite anyone in the field to inspect it and judge for themselves.

Phase 2 in progress

Beyond the foundational kernels.

The ten routines above are the foundational numeric layer. We have also produced an initial Phase 2 set — the signal, control and rotation kernels that surround the perception network in any ASIL-D-shaped autonomous-systems deployment, regardless of the underlying edge-compute platform.

Routine	Source	Total	Unproved	Discharge
FFTPACK forward complex FFT (full bundle)	Netlib FFTPACK (public domain)	661	48	92.7%
Linear Kalman filter †	Welch-Bishop textbook	323	18	94.4%
PID controller (positional form) †	Åström-Murray textbook	242	17	93.0%
Quaternion algebra + rotation †	Kuipers textbook	517	76	85.3%
Phase 2 cfftf (strong-postcondition)		661	48	92.7%

† Kalman, PID and Quaternion are the April 2026 baseline numbers, measured under weaker postconditions than the routines above. They sit on the queue for rerun under the strong-postcondition regime. When that work lands we will publish whatever number comes out, lower or higher; in the meantime the asterisk flags that the three are not yet on the same footing as Phase 1 and the FFT.

Combined Phase 1 + cfftf under strong postconditions — eleven routines, spec plus body, no manual intervention — 4,424 obligations, 436 unproved, 90.1% aggregate.

The dgemm result (82.4%) is the lowest in the set, and the remaining gap across the eleven is bounded engineering — the kind of closure work the rest of the page discusses.

What these numbers measure

Methodology — and why our score went down.

A discharge percentage answers one specific question: of the proof obligations the verifier generated, what fraction were proved automatically? The number depends on the question. A weak question is easily answered; a strong question is not. Same headline percentage; very different epistemic claim.

The numbers above come from a stricter question than we asked in April. The body cannot now discharge by accident. This is the version of the metric we believe.

Our earlier publication on this page (April 2026) cited 98.5% and 96.2% discharge on the same routines under weaker postconditions. Those numbers were technically accurate and substantively misleading — they measured body-vs-spec consistency under low semantic load, not functional correctness. We have replaced them with the strong-postcondition figures above and written about the change in detail in our field-notes post What 96 per cent means when the spec actually says something.

What this means

In practice.

If the work is “demonstrate that a tool can produce SPARK from Fortran,” calling Claude directly is enough. We have nothing to say against that.
If the work is “ship verified software a safety reviewer can sign off — with a reproducible audit trail showing how it was produced,” that is the factory's job, and the gap is what we sell.

When a new frontier model shipped this week, we tested exactly that gap — the field note is here: A new model arrived. Our promise didn’t change.

How to verify these numbers

The discharge percentages are reproducible. Re-run them yourself.

Every routine on /results/ ships with three pieces: the original Fortran (linked), the emitted Ada spec, and the emitted Ada body. Each routine now also serves its filtered gnatprove.out — the prover's per-obligation analysis — alongside the spec and body.

The verifier kit packs all of that into a single Makefile + tarball. Two commands, after installing gnatprove:

tar xzf dark-factory-verifier.tar.gz
cd dark-factory-verifier && make all

The kit's make all target runs gnatprove --level=1 on every routine in the kit and writes a summary table to actual-results.txt. Compare against the expected-results.txt shipped in the kit: diff expected-results.txt actual-results.txt. Modulo prover wall-time, the two files should be identical.

What's in the kit and what isn't is plain. In: the Ada artefacts the factory emitted, the build files needed to re-run the prover, and the original Fortran for context. Out: the factory itself — the production pipeline and the methods inside it. The factory is what we sell. The output is what you can verify.

If your actual-results.txt doesn't match the expected file, please email tony.gair@thedarkfactory.co.uk with the output. The publication contract works only if the re-runs match.

The closure path

From 90.1% to 100% — honest about what's left.

A close reader will notice the OWI baseline is at --level=1 discharge. The published aerospace SPARK case studies typically sit in the 95–99% band at --level=2 with formal-methods engineers manually closing the residuals over weeks or months per module via the standard closure techniques.

We have deliberately not yet done that manual closure work. The OWI is a baseline — what AI-emission plus production discipline produces on day one, no human in the loop. The remaining 1–5% per routine is the part of the engineering investment that takes formal-methods specialists, and that work is industrial in shape rather than something one small company can credibly push on its own.

Closure typically needs three roles in the room: the emission company (our role), a proof-engineering team carrying deep gnatprove / --level=4 and manual-closure expertise, and a customer-side certification driver whose deployment safety case makes the closure work commercially load-bearing.

We do not yet have either of those collaborators. Our only formal relationship today is the downstream-customer one with Anthropic described below. We have views on the obvious candidates and we are actively seeking the conversations that would let us scope the work properly — but until those conversations are real we will not name them here.

Relationship

A note on Anthropic.

The Dark Factory is an independent company; our relationship with Anthropic is that of a customer running their models inside our stack. We have applied to NVIDIA Inception in parallel, and we intend to support running the factory against locally-hosted models where the customer requires air-gapped operation. Claude is, for now, the best engine available; the production discipline around it is what we build.

Last measured 2026-05-23. Methodology and current canary numbers are re-run monthly and on every new frontier-model release. Page drafted with AI assistance and reviewed by The Dark Factory Ltd.