Lacuna

falsification-first biological law discovery

github.com/jang1563/lacuna-falsification

For every scientist whose null results pushed the field forward
but never made it into a paper.

When biology becomes a search problem —
and AI is making it one —
the cost is paid in failed trials.

The map of where not to go
is the map worth building.

To trust that map: a gate that finds truth it was never told.

Opus proposes. Python gates.
Routine fires. What's next?

Claude Code Routine API-triggered · live ↗

···→

MA · Session A Proposer Opus 4.7

→

local Python PySR symbolic search

→

pre-registered · Python 5-Test Gate no LLM decides pass / fail

→

MA · Session B Skeptic Opus 4.7

→

MA · Session C Interpreter Opus 4.7

↓ 224 FAIL → rejection log committed ↑ 9 PASS → Skeptic review

3 context-isolated Opus sessions

Deterministic Python gate

New Routine per disease · same Skills · same classification-gate family · 4 diseases confirmed

FAIL CA9−AGXT Δ=0.015 · PASS CDK1−EPAS1 Δ=0.062 · 23/28 Stage CXCR4/EPAS1 · 15/22 Colon MSI · 2/25 LGG Grade II/III · AUROC 0.840 · static evidence in results/live_evidence/

Managed Agents Memory · rejection lessons accumulate across sessions · persists server-side

Session 1 Kidney · one gene dominated

→

Session 2 Kidney · compound collapsed

→

Session 3 Lung · surfactant gene 99.8%

→

Session 4 ✓ Prostate · predicted PSA dominance

Prostate cancer proposer cited the lung cancer single-gene saturation lesson — without seeing prostate data — and correctly predicted the dominant PSA-gene pattern. 8 lessons stored across 4 sessions.

Rejection

The gate refused.

The TCGA-KIRC gate rejected 194 of 203 candidate evaluations. The 9 survivors all came from the 45-gene metastasis_expanded sub-layer, after the loop diagnosed panel absence on the original 11-gene layer. CA9 alone saturates tumor-normal AUROC at 0.965, so CA9-dominated compounds are refused.

224

refused

9

repair pass

"Even the law Opus most wanted to keep — refused."

203 KIRC evaluations · 194 rejected · 9 survivors all from 45-gene sub-layer

rejected

survivor (gate PASS)

Survivor

Expand the panel.
Same gate.

45-gene HIF / Warburg / proliferation panel. Same pre-registered thresholds. Nine laws pass on metastasis. The simplest is a law that was already published — in 2010:

TOP2A − EPAS1

cell-division gene − oxygen-response gene

AUROC 0.726 Δbaseline +0.069 CI lower 0.665 perm p < 0.001

When cell division outruns oxygen response — it predicts spread.
The published kidney cancer molecular subtype axis.
We did not plant it.
Known truth · unconstrained search · method validated

A gate that recovers what was already published can be trusted for what has not been published yet.

Task landscape · 11-gene vs 45-gene panel

Task	Dominant gene	11-gene	45-gene	Best Δ
Tumor vs Normal	CA9 = 0.965	0 / 26	—	+0.029
Stage I-II vs III-IV	CUBN = 0.610	0 / 27	—	+0.029
5-yr Survival	CUBN = 0.696	0 / 29	0 / 29	+0.019
Metastasis M0 vs M1	MKI67 = 0.645	0 / 30	9 / 30 ✓	+0.069

Same 5-test gate · same BH-FDR · same Python code · four biologically distinct tasks.

Near-equivalent formulas (within 2% accuracy) · 3 of 990 two-gene pairs · all: proliferation − oxygen-response gene

#1 / 990

TOP2A − EPAS1

0.7275

cell-division enzyme − oxygen-response factor

#2 / 990

CDK1 − EPAS1

0.7192

cell-cycle kinase − oxygen-response factor

#3 / 990

MKI67 − EPAS1

0.7100

proliferation marker − oxygen-response factor

External Replay

Cross-cohort survival.
Pre-registered gate.
log-rank · Cox · C-index all pass.

IMmotion150 — independent Phase-2 immunotherapy trial. n = 263 metastatic kidney cancer (ccRCC). Different cohort, different preprocessing, different endpoint (PFS — progression-free survival). Same two-gene score. Same direction.

✗ Our own three-gene extension · same external survival gate · KILLED
p = 0.117 · pre-registered · commit 47a6bd5
The loop is not designed to confirm. It rejects — including itself.

IMmotion150 KM curve — TOP2A−EPAS1 median split, HR 1.36, p=0.0003

1.36

Cox HR per z-score

0.0003

Log-rank p

0.601

Harrell C-index

7.53 mo

Median PFS gap

Trajectory · DIPG active

250+

failed DIPG clinical trials

H3 K27M diffuse midline glioma — a pediatric brainstem cancer. Universally fatal — until August 2025. The same failure-first architecture that rejected the initial kidney-cancer layer is now pointed at brainstem.

15 candidates · pre-registered queue

Timeline

Decades

250+ failed clinical trials

no improvement over radiation alone

August 2025

Dordaviprone — first approval

itself a 10-year graveyard rescue · FDA accelerated

Now

15 candidates in queue

HERBY (n=47) · PNOC-022 (n=88) · pre-registered gate · same architecture

The published law is the tip.

The graveyard is the artifact.

Proven on known truth · trusted for unknown truth

For every scientist whose result said no.

github.com/jang1563/lacuna-falsification Explore the interactive story →

IPF · Context Isolation

Separate context.
Separate truth.

$58.28 32 minutes SHA 88eaca3

The Skeptic ran in a separate context window — it never received the Advocate's reasoning tokens. Context isolation as live audit layer.

Caught: two prior-trial claims the Advocate stated as established fact — both empirically false per the original trial publications.

Two fabricated claims · Skeptic verdict

Advocate stated as fact (tralokinumab / RAINIER)

"No IPF IL-13 trial prespecified a Th2 stratifier — RAINIER enrolled unselected IPF, leaving the Th2-high endotype untested."

FABRICATED Skeptic verdict

RAINIER itself prespecified a periostin sub-group (canonical IL-13-induced Th2 marker). The "never prespecified" premise is empirically false per the original trial protocol.

Advocate stated as fact (simtuzumab / Raghu 2017)

"LOXL2-high IPF patients were never tested in the simtuzumab trial — stratified rescue remains viable."

FABRICATED Skeptic verdict

Raghu 2017 already prespecified LOXL2-stratified co-primary endpoints with the highest-tertile arm. The "never-tested" narrative is falsified by the originating trial itself.

Capability Overhang · E2 Ablation

Swap the model.
The loop breaks.

180 API calls. Three models, same 6 candidates, same gate metrics, same prompts/skeptic_review.md. One stance collapses completely — with extended thinking.

0

Sonnet (with thinking) · 0 / 60

→

0

Opus (no thinking) · 10 / 60 PASS

Opus ran without extended thinking · wins anyway · calibration, not compute

3 models × 6 candidates × 10 repeats · Skeptic verdict distribution

Model	PASS / 60	Dissent on PASS
Opus 4.7 no thinking	10	66.7%
Haiku 4.5 with thinking	14	53.3%
Sonnet 4.6 with thinking	0	100%

Sonnet dissents on 100% of gate-PASS candidates · cannot distinguish pass from fail · thinking budget is irrelevant

PhI-1 · meta-calibration · Opus predicted 4/4 own rejections before testing

Skeleton #1: "VEGFA redundant with EPAS1 — delta_baseline will not clear 0.05"

FAIL: delta_baseline ✓

Skeleton #4: "CCNB1 alone insufficient; MKI67 already dominates proliferation hierarchy"

FAIL: delta_baseline ✓

Skeletons #2, #3: same pattern — kill test named the failure leg before the gate ran

4/4 ✓

The model's failure model is accurate. It knows when it's wrong before seeing the results.

Lacuna

Opus proposes. Python gates. Routine fires. What's next?

The gate refused.

Expand the panel.Same gate.

Cross-cohort survival.Pre-registered gate.log-rank · Cox · C-index all pass.

Separate context.Separate truth.

Swap the model.The loop breaks.

Opus proposes. Python gates.
Routine fires. What's next?

Expand the panel.
Same gate.

Cross-cohort survival.
Pre-registered gate.
log-rank · Cox · C-index all pass.

Separate context.
Separate truth.

Swap the model.
The loop breaks.