Methodology

Built with Claude — and what we add.

The Dark Factory uses Anthropic's Claude models as the language-model worker inside our software production pipeline. We chose Claude because, straightforwardly, they are the best at the hardest part of what we do: generating formally-verifiable code that gnatprove can discharge. We have tested the alternatives. None are close yet.

The numbers on this page are taken directly from our own measured runs. We publish them as the Organised Working Index (OWI) — the fraction of proof obligations a given pipeline can discharge against a fixed set of safety-critical conversion targets. OWI is re-measured monthly and on every new frontier-model release.

How we use it

Claude is the engine. The discipline around it is what we build.

Claude sits inside our pipeline; we do not expose it directly, we build around it. The production line takes a natural-language specification or a legacy Fortran source, puts it through a series of structured, machine-checkable stages with proof-discharge gates between them, and emits SPARK-Ada code whose safety properties are mathematically proven, not just plausible.

The gap

What we add over calling Claude directly.

A frontier model on its own — invoked from a shell with a single prompt — can translate Fortran into Ada that mostly compiles. That is a remarkable capability and we do not dismiss it. For demos and prototyping it is enough.

But for the work our customers need — safety-critical software with audit trail and provable correctness — the gap between one-shot model and our production line is real, recently measured, and worth being precise about. The table below shows our Phase 1 foundational numeric ladder — ten reference numeric kernels (BLAS L1, L2, L3 plus LAPACK LU, Cholesky and QR factorisations) from Netlib upstreams. These are the kernels that any safety-critical edge inference stack (automotive ASIL-D, aerospace DO-178C, defence-grade edge compute) ultimately sits on top of.

Measurement Claude alone, one-shot The Dark Factory pipeline
Compile rate, ten reference routines 5 of 6 compiled 10 of 10 complete (spec + body)
Proof obligations auto-discharged at gnatprove --level=1 66% (289 / 436) 98.5% (3,578 / 3,631)
Lowest per-routine discharge in the set 96.1% (BLAS daxpy)
Highest per-routine discharge in the set 99.3% (LAPACK LU dgetrf)
Behaviour on complex multi-module systems Limited by single-prompt scope Decomposed into per-subsystem proofs
Audit trail / provenance for every emitted artefact None Per-stage artefacts traceable to the originating specification

The thesis the index carries is simple: the variable that moves the number is the discipline around the model, not the model alone. Same model, different OWI.

Public reference point
We gifted UK government a formally-verified HMPPS sentence-release calculator — live, all proof obligations discharged — produced by the factory pipeline. We invite anyone in the field to inspect it and judge for themselves.
Phase 2 in progress

Beyond the foundational kernels.

The ten routines above are the foundational numeric layer. We have also produced an initial Phase 2 set — the signal, control and rotation kernels that surround the perception network in any ASIL-D-shaped autonomous-systems deployment, regardless of the underlying edge-compute platform.

Routine Source Total Unproved Discharge
FFTPACK forward complex FFT (full bundle) Netlib FFTPACK (public domain) 624 24 96.2%
Linear Kalman filter Welch-Bishop textbook 323 18 94.4%
PID controller (positional form) Åström-Murray textbook 242 17 93.0%
Quaternion algebra + rotation Kuipers textbook 517 76 85.3%
Phase 2 aggregate 1,706 135 92.1%

Combined Phase 1 and Phase 2 — fourteen routines, spec plus body, no manual intervention — 5,337 obligations, 188 unproved, 96.5% aggregate.

The Quaternion result (85.3%) is the lowest in either set — chains of multiplications, an in-place normalise, and a control-flow split on the zero-length guard. That gap is honestly the shape of what closure work looks like at the routine level. We discuss the route to closing that gap below.

What this means

In practice.

The closure path

From 98.5% to 100% — honest about what's left.

A close reader will notice the OWI baseline is at --level=1 discharge. The published aerospace SPARK case studies typically sit in the 95–99% band at --level=2 with formal-methods engineers manually closing the residuals over weeks or months per module via lemmas, ghost code, refinements and Loop_Invariant tightening.

We have deliberately not yet done that manual closure work. The OWI is a baseline — what AI-emission plus production discipline produces on day one, no human in the loop. The remaining 1–5% per routine is the part of the engineering investment that takes formal-methods specialists, and that work is industrial in shape rather than something one small company can credibly push on its own.

Closure typically needs three roles in the room: the emission company (our role), a proof-engineering team carrying deep gnatprove / --level=4 and manual-closure expertise, and a customer-side certification driver whose deployment safety case makes the closure work commercially load-bearing.

We do not yet have either of those collaborators. Our only formal relationship today is the downstream-customer one with Anthropic described below. We have views on the obvious candidates and we are actively seeking the conversations that would let us scope the work properly — but until those conversations are real we will not name them here.

Relationship

A note on Anthropic.

The Dark Factory is an independent company; our relationship with Anthropic is that of a customer running their models inside our stack. We have applied to NVIDIA Inception in parallel, and we intend to support running the factory against locally-hosted models where the customer requires air-gapped operation. Claude is, for now, the best engine available; the production discipline around it is what we build.

Last measured 2026-05-23. Methodology and current canary numbers are re-run monthly and on every new frontier-model release. Page drafted with AI assistance and reviewed by The Dark Factory Ltd.