Lacuna — Falsification-First Biological Law Discovery

Built with Claude Opus 4.7 · hackathon submission 2026

The gate rejected 194 of 203 KIRC evaluations.
The 9 survivors that emerged are led by TOP2A − EPAS1.

A pre-registered, deterministic falsification gate running under Opus 4.7. It cannot be negotiated. It rejects its own proposed laws. Then it interprets what remains.

194 / 203

KIRC candidate evaluations
rejected

0.726

Survivor AUROC
on TCGA-KIRC M0/M1

1.36

Hazard ratio on
IMmotion150 PFS

1 / 990

Rashomon rank
among all 2-gene pairs

Rediscovery, not novelty. The survivor TOP2A − EPAS1 reproduces the published ccA/ccB ccRCC subtype axis (Brannon 2010, PMID 20871783; ClearCode34, DOI 10.1016/j.eururo.2014.02.035). Unconstrained symbolic regression on 45 genes arrived at the published biology without being seeded with it. The gate accepted it on pre-registered criteria written before any fit ran.

Flagship survivor · ccRCC metastasis

The accepted law

Of 30 PySR candidates on the 45-gene expanded panel × M0/M1 endpoint, 9 pass the pre-registered gate. The simplest survivor:

Survivor — metastasis M0 vs M1 · TCGA-KIRC

score = TOP2A − EPAS1

TOP2A: topoisomerase IIα — direct proliferation marker, HPA-annotated prognostic-unfavorable in renal cancer.
EPAS1: HIF-2α — canonical well-differentiated hypoxic ccRCC driver, HPA-annotated prognostic-favorable.
When proliferation runs ahead of HIF-2α differentiation, the tumor is more likely metastatic.

0.726

AUROC · CI lower 0.665 · perm p <0.001

+0.069

Δ over best sign-invariant single gene (MKI67 0.657)

0.321

AUPRC · 2.05× over 0.156 prevalence baseline

Honest caveat. A logistic regression on (TOP2A, EPAS1, TOP2A×EPAS1) reaches AUROC 0.722 on the same cohort — the compound law's distinctive contribution is interpretable compactness + pre-registered falsification, not an AUROC ceiling unreachable by an engineered baseline (Δ = +0.004 only). Research use only; not a diagnostic biomarker.

6-verdict replication chain

Where the law holds — and where it should not

The same equation is tested across mixed endpoints and platforms. 3 PASS · 2 pre-registered FAIL · 1 honest FAIL. The negative controls (saturation-expected + cross-platform) are the specificity story.

✓ TCGA-KIRC metastasis ✓ IMmotion150 PFS ✓ GSE53757 stage ∅ GSE53757 T-vs-N (informative FAIL) ∅ TCGA-BRCA T-vs-N (pre-reg FAIL) ∅ CPTAC-3 metastasis (honest FAIL)

Cohort / endpoint	n	Key metric	Verdict	Notes
TCGA-KIRC · metastasis M0/M1	505	AUROC 0.726 · perm p <0.001	PASS	Pre-registered TCGA 5-leg gate (confound leg null for this task)
IMmotion150 · PFS (survival)	263	HR 1.36 · p=0.0003 · C=0.601	PASS	Independently pre-registered survival kill tests (log-rank, Cox HR, Harrell C). Not same-endpoint replay.
GSE53757 · stage I-II vs III-IV	~70	AUROC 0.714 [0.584, 0.832]	PASS	Platform-shift support (Affymetrix). Not M0/M1 replay.
GSE53757 · tumor vs normal	~144	AUROC 0.995 (saturation)	INFORMATIVE FAIL	Single-gene saturation (CA9 AUROC 0.995); Δbase mathematically unreachable. Confirms specificity.
TCGA-BRCA · tumor vs normal	—	Δbase +0.009	PRE-REG FAIL	Cross-cancer negative control. Law is ccRCC-specific, not pan-cancer.
CPTAC-3 · metastasis M0/M1	155	ci_lower=0.542 · Δbase=−0.007 · direction p=0.006	HONEST FAIL	Direction preserved (p=0.006) but gate refuses: ci_lower < 0.60 and Δbase negative. Cross-platform replication not confirmed.

Own-output kill (PhL-1). The H1-loop SLC22A8 three-gene extension TOP2A − (EPAS1 + SLC22A8) was also tested on IMmotion150 under the same survival kill tests. Verdict: FAIL (log-rank p=0.117; Cox HR 1.16 CI 0.99–1.37; C-index 0.566 vs 2-gene 0.601). Our own best guess, refused by our own data. This is the gate biting on its own outputs — the strongest pre-emptive response to the Sakana-v2-style circularity critique.

Context isolation audit — IPF (Run #1, lock SHA 88eaca3). The same Managed Agents isolation architecture applied to idiopathic pulmonary fibrosis: a Skeptic session running in a separate context window — never seeing the Advocate's reasoning tokens — caught 2 fabricated prior-trial claims in the Advocate's output. The Advocate stated RAINIER and Raghu 2017 had never tested specific patient stratifiers. Both statements were false. A single-context pipeline cannot catch this: the Skeptic would rationalise the Advocate's framing. Cost: $58.28 · 32 minutes. Context isolation is a live audit layer, not a logging convenience.

G1/G2/I2/I3/I4 rigor package

Beyond AUROC

Six pre-registered extensions probe compactness, calibration, and clinical meaning. 12 of 13 predictions PASS.

Extension	Key result	Verdict
G2 · AUPRC (imbalance-aware)	0.321 vs 0.156 baseline → 2.05× lift	PASS
G2 · Calibration slope	0.979, intercept −0.032 — well-calibrated per TRIPOD+AI 2024	PASS
G1 · Knockoff v2 (individual-gene FDR)	0 / 45 genes selected; EPAS1 rank 1, TOP2A rank 2 by mean W	Honest discordance — compound gate and univariate-FDR test different objects. Signal is genuinely compound.
I2 · Rashomon set	Rank 1 / 990 two-gene pairs. Tight set (±0.02): only 3 pairs — all (proliferation − HIF-2α).	PASS
I3 · Clinical translation	Cohen's d = 0.856; OR per 1-SD = 2.07 [1.65–2.59]; risk stratification utility confirmed	Screening grade (sens ≥0.50 @ spec ≥0.85) FAILS by 0.044 — honest
I4 · Information theory	Joint MI 1.82× individual max; linear form captures 0.92–0.98 of bivariate MI	PASS
G4 · Anchor regression	Cochran Q p=0.238 (TOP2A) · p=0.410 (EPAS1); coefficients stable γ=0→100	PASS

The rejection surface · 194 of 203 KIRC rejected

What the gate killed

The original ccRCC task layer yielded zero survivors, including CA9-dominated HIF/tubule-identity contrasts with high AUROC but too little gain over single-gene baselines. The gate is not sparing its own side.

194 of 203 KIRC evaluations rejected. The 9 survivors all came from the 45-gene metastasis_expanded sub-layer; the 11-gene initial layer rejected 100% before panel-absence repair.

Task (panel)	Dominant single gene	Candidates	Survivors	Gate fail reason
Tumor vs Normal · KIRC 11-gene	CA9 AUROC 0.965	33	0	delta_baseline — CA9 saturates; Δ>1.015 required, unreachable
Stage I-II vs III-IV · KIRC 11-gene	CUBN 0.610	34	0	delta_baseline — tubule marker dominates
5-yr Survival · KIRC 11-gene	CUBN 0.696	36	0	delta_baseline / perm_p — same single-gene ceiling
Metastasis · KIRC 11-gene	MKI67 0.645	37	0	delta_baseline — panel lacks proliferation/HIF-2α expansion genes
Metastasis · KIRC 45-gene	MKI67 0.657	30	9	—
Tumor vs Normal · LUAD	SFTPC 0.998	4	0	delta_baseline — SFTPC saturates; same structure as CA9 in KIRC
BRCA + other cohorts	—	29	0	cross-cancer negative controls, gate robustness, external replay lanes

⊗ Interactive rejection log Initial rejects, repaired survivors, and external replay · filter by verdict →

Rejection landscape — AUROC distribution across all candidates

Rejection landscape — candidates coloured by pass/fail verdict. Survivors cluster at Δbase ≥0.05.

E2 cross-model ablation · 180 API calls

Why Opus 4.7 for the Skeptic

Same prompt, same candidates, same gate metrics. Three models. Pre-registered specificity predictions were falsified (all models cite ≥2 metrics in 100% of cases — honest null). Verdict calibration is where Opus 4.7 diverges.

Model	PASS	NEEDS_MORE	FAIL	Dissent on gate-PASS (%)	Note
claude-opus-4-7	10 / 60	20	30	66.7%	Base calibration — no thinking (HTTP 400 on "enabled"); wins anyway
claude-haiku-4-5	14 / 60	16	30	53.3%	WITH extended thinking (23s latency)
claude-sonnet-4-6	0 / 60	30	30	100%	WITH extended thinking — collapses to permanent rejection regardless of evidence

The Skeptic role is model-sensitive. In this six-candidate audit, Sonnet 4.6 dissents on 100% of gate-PASS candidates, while Opus 4.7 issues PASS only when gate metrics warrant it (10 calls) and abstains when the margin is thin. This shows model identity changes gate-alignment behavior under the available inference modes; it does not identify a causal mechanism.

PhI-1 · meta-calibration

Opus knows when it's wrong — before testing

Opus 4.7 wrote ex-ante kill tests for 4 skeleton hypotheses on ccRCC metastasis. The gate confirmed every prediction: Skeleton #1 (VEGFA redundant — FAIL) and Skeleton #4 (CCNB1 alone insufficient — FAIL). Zero of 4 passed. The kill tests were accurate. The model anticipated its own failures.

PhL-19 · Interpreter role quality · 3-model ablation

Interpretation quality: Opus vs alternatives

For each surviving law, the Interpreter role produces a mechanism hypothesis, a testable downstream prediction, and an explicit "what this is NOT" paragraph. The ablation measures whether this depth is Opus-specific or model-agnostic.

Model	Caveat rate	Prediction rate	Avg citations	Thinking
claude-opus-4-7	100%	100%	12	adaptive (effort:high)
claude-sonnet-4-6	0%	0%	—	with thinking
claude-haiku-4-5	0%	0%	—	with thinking

The "what this is NOT" paragraph is load-bearing. Sonnet and Haiku both produce interpretations — they describe mechanism and pathway biology. Neither produces explicit caveats ("not a diagnostic biomarker", "not novel biology", "not superior to an engineered LR baseline") or testable downstream predictions. Opus 4.7 achieves 100% on both dimensions without prompting. This is the difference between an interpretation that sounds like the literature and one that could be sent to a reviewer.

Claude Design · 6 slides · 4K UHD

Visual evidence

Click any slide to expand the judge-facing visual evidence for the loop, the gate, the survivor, and the replication chain.

Hook loop — rejection and acceptance cycle

01 · Hook loop

02 · Gate verdict

03 · Landscape × IMmotion150

04 · KM close

05 · Architecture

06 · DIPG trajectory

Managed Agents + Routines · Opus 4.7

How it works

Four roles. Three Anthropic products composed. One deterministic gate that no LLM can renegotiate.

1

Proposer (Opus 4.7) — given a DatasetCard + gene panel, emits 3–5 compact law families with ex-ante skeptic tests per family. Required to include at least one negative control it expects to FAIL. No data seen yet.

2

PySR Searcher (Sonnet 4.6) — runs symbolic regression locally with Opus's law families as guesses. Returns equations in gene-name form (e.g. TOP2A − EPAS1). No LLM judgment here.

3

Falsification Gate (deterministic Python) — five pre-registered tests: permutation null · bootstrap CI lower bound · sign-invariant Δbaseline · incremental confound · decoy-feature null. BH-FDR across candidates. Commits before any fit; thresholds cannot change. Decides pass/fail; no LLM involved.

4

Skeptic (Opus 4.7) — reviews the gate's JSON output, not the Proposer's rationale. Emits verdict + one additional test. Isolation from Proposer context is load-bearing to the rigor claim.

5

Interpreter (Opus 4.7) — for survivors only: mechanism hypothesis, testable downstream prediction, explicit "what this is NOT" paragraph. Reads gate metrics; cannot reverse a rejection.

6

Path C Routine — Claude Code Routine (experimental-cc-routine-2026-04-01) watches for new DatasetCard or GitHub PR/release events and fires the full loop automatically. Nightly cron available. The gate runs without a human pressing a button.

Agent	Model	Role	Can it change the gate?
Proposer	Opus 4.7	Law families + ex-ante kill tests	No
Searcher	Sonnet 4.6	PySR symbolic regression	No
Gate	—	Deterministic Python · 5 tests	The gate is the authority
Skeptic	Opus 4.7	Post-gate review · proposes 1 extra test	No
Interpreter	Opus 4.7	Survivor mechanism + prediction	No

Resources

Full artifact index

▶ Demo video (YouTube) 3-minute walkthrough · KIRC reject/accept cycle · DIPG + IPF generalization · Opus 4.7 roles → ↗ GitHub — lacuna-falsification Full source · 115/115 tests · make audit OK · .devcontainer/ for judge setup → ⊗ Interactive rejection log Initial rejects, repaired survivors, and external replay · every failed law with its fail reason → § Methodology doc Gate design · accept/reject table · G1/G2 rigor extensions · anchor regression → ◈ Survivor narrative TOP2A−EPAS1 deep dive · Rashomon set · clinical translation · what it is NOT → ◉ Why Opus 4.7 Ablation · thinking mechanism · adversarial critique · gate-as-authority-substrate → ◎ Discovery story (interactive) 6-verdict replication chain · full narrative · every node clickable with metric detail →