Built with Claude — and what we add.
The Dark Factory uses Anthropic's Claude models as the language-model
worker inside our software production pipeline. We chose Claude because,
straightforwardly, they are the best at the hardest part of what we do:
generating formally-verifiable code that gnatprove can
discharge. We have tested the alternatives. None are close yet.
The numbers on this page are taken directly from our own measured runs. We publish them as the Organised Working Index (OWI) — the fraction of proof obligations a given pipeline can discharge against a fixed set of safety-critical conversion targets. OWI is re-measured monthly and on every new frontier-model release.
Claude is the engine. The discipline around it is what we build.
Claude sits inside our pipeline; we do not expose it directly, we build around it. The production line takes a natural-language specification or a legacy Fortran source, puts it through a series of structured, machine-checkable stages with proof-discharge gates between them, and emits SPARK-Ada code whose safety properties are mathematically proven, not just plausible.
What we add over calling Claude directly.
A frontier model on its own — invoked from a shell with a single prompt — can translate Fortran into Ada that mostly compiles. That is a remarkable capability and we do not dismiss it. For demos and prototyping it is enough.
But for the work our customers need — safety-critical software with audit trail and provable correctness — the gap between one-shot model and our production line is real, recently measured, and worth being precise about. The table below shows our Phase 1 foundational numeric ladder — ten reference numeric kernels (BLAS L1, L2, L3 plus LAPACK LU, Cholesky and QR factorisations) from Netlib upstreams. These are the kernels that any safety-critical edge inference stack (automotive ASIL-D, aerospace DO-178C, defence-grade edge compute) ultimately sits on top of.
| Measurement | Claude alone, one-shot | The Dark Factory pipeline |
|---|---|---|
| Compile rate, ten reference routines | 5 of 6 compiled | 10 of 10 complete (spec + body) |
Proof obligations auto-discharged at gnatprove --level=1 |
66% (289 / 436) | 98.5% (3,578 / 3,631) |
| Lowest per-routine discharge in the set | — | 96.1% (BLAS daxpy) |
| Highest per-routine discharge in the set | — | 99.3% (LAPACK LU dgetrf) |
| Behaviour on complex multi-module systems | Limited by single-prompt scope | Decomposed into per-subsystem proofs |
| Audit trail / provenance for every emitted artefact | None | Per-stage artefacts traceable to the originating specification |
The thesis the index carries is simple: the variable that moves the number is the discipline around the model, not the model alone. Same model, different OWI.
Beyond the foundational kernels.
The ten routines above are the foundational numeric layer. We have also produced an initial Phase 2 set — the signal, control and rotation kernels that surround the perception network in any ASIL-D-shaped autonomous-systems deployment, regardless of the underlying edge-compute platform.
| Routine | Source | Total | Unproved | Discharge |
|---|---|---|---|---|
| FFTPACK forward complex FFT (full bundle) | Netlib FFTPACK (public domain) | 624 | 24 | 96.2% |
| Linear Kalman filter | Welch-Bishop textbook | 323 | 18 | 94.4% |
| PID controller (positional form) | Åström-Murray textbook | 242 | 17 | 93.0% |
| Quaternion algebra + rotation | Kuipers textbook | 517 | 76 | 85.3% |
| Phase 2 aggregate | 1,706 | 135 | 92.1% | |
Combined Phase 1 and Phase 2 — fourteen routines, spec plus body, no manual intervention — 5,337 obligations, 188 unproved, 96.5% aggregate.
The Quaternion result (85.3%) is the lowest in either set — chains of multiplications, an in-place normalise, and a control-flow split on the zero-length guard. That gap is honestly the shape of what closure work looks like at the routine level. We discuss the route to closing that gap below.
In practice.
- If the work is “demonstrate that a tool can produce SPARK from Fortran,” calling Claude directly is enough. We have nothing to say against that.
- If the work is “ship verified software a safety reviewer can sign off — with a reproducible audit trail showing how it was produced,” that is the factory's job, and the gap is what we sell.
From 98.5% to 100% — honest about what's left.
A close reader will notice the OWI baseline is at --level=1
discharge. The published aerospace SPARK case studies typically sit in
the 95–99% band at --level=2
with formal-methods engineers manually closing the residuals over
weeks or months per module via lemmas, ghost code, refinements
and Loop_Invariant tightening.
We have deliberately not yet done that manual closure work. The OWI is a baseline — what AI-emission plus production discipline produces on day one, no human in the loop. The remaining 1–5% per routine is the part of the engineering investment that takes formal-methods specialists, and that work is industrial in shape rather than something one small company can credibly push on its own.
Closure typically needs three roles in the room: the emission company
(our role), a proof-engineering team carrying deep
gnatprove / --level=4 and manual-closure
expertise, and a customer-side certification driver whose deployment
safety case makes the closure work commercially load-bearing.
We do not yet have either of those collaborators. Our only formal relationship today is the downstream-customer one with Anthropic described below. We have views on the obvious candidates and we are actively seeking the conversations that would let us scope the work properly — but until those conversations are real we will not name them here.
A note on Anthropic.
The Dark Factory is an independent company; our relationship with Anthropic is that of a customer running their models inside our stack. We have applied to NVIDIA Inception in parallel, and we intend to support running the factory against locally-hosted models where the customer requires air-gapped operation. Claude is, for now, the best engine available; the production discipline around it is what we build.
Last measured 2026-05-23. Methodology and current canary numbers are re-run monthly and on every new frontier-model release. Page drafted with AI assistance and reviewed by The Dark Factory Ltd.