Built with Claude Opus 4.7 · hackathon submission 2026

The gate rejected 194 of 203 KIRC evaluations.
The 9 survivors that emerged are led by TOP2A − EPAS1.

A pre-registered, deterministic falsification gate running under Opus 4.7. It cannot be negotiated. It rejects its own proposed laws. Then it interprets what remains.

194 / 203
KIRC candidate evaluations
rejected
0.726
Survivor AUROC
on TCGA-KIRC M0/M1
1.36
Hazard ratio on
IMmotion150 PFS
1 / 990
Rashomon rank
among all 2-gene pairs
Rediscovery, not novelty. The survivor TOP2A − EPAS1 reproduces the published ccA/ccB ccRCC subtype axis (Brannon 2010, PMID 20871783; ClearCode34, DOI 10.1016/j.eururo.2014.02.035). Unconstrained symbolic regression on 45 genes arrived at the published biology without being seeded with it. The gate accepted it on pre-registered criteria written before any fit ran.
Flagship survivor · ccRCC metastasis

The accepted law

Of 30 PySR candidates on the 45-gene expanded panel × M0/M1 endpoint, 9 pass the pre-registered gate. The simplest survivor:

Survivor — metastasis M0 vs M1 · TCGA-KIRC
score = TOP2A − EPAS1
TOP2A: topoisomerase IIα — direct proliferation marker, HPA-annotated prognostic-unfavorable in renal cancer.
EPAS1: HIF-2α — canonical well-differentiated hypoxic ccRCC driver, HPA-annotated prognostic-favorable.
When proliferation runs ahead of HIF-2α differentiation, the tumor is more likely metastatic.
0.726
AUROC · CI lower 0.665 · perm p <0.001
+0.069
Δ over best sign-invariant single gene (MKI67 0.657)
0.321
AUPRC · 2.05× over 0.156 prevalence baseline
Honest caveat. A logistic regression on (TOP2A, EPAS1, TOP2A×EPAS1) reaches AUROC 0.722 on the same cohort — the compound law's distinctive contribution is interpretable compactness + pre-registered falsification, not an AUROC ceiling unreachable by an engineered baseline (Δ = +0.004 only). Research use only; not a diagnostic biomarker.
6-verdict replication chain

Where the law holds — and where it should not

The same equation is tested across mixed endpoints and platforms. 3 PASS · 2 pre-registered FAIL · 1 honest FAIL. The negative controls (saturation-expected + cross-platform) are the specificity story.

✓ TCGA-KIRC metastasis ✓ IMmotion150 PFS ✓ GSE53757 stage ∅ GSE53757 T-vs-N (informative FAIL) ∅ TCGA-BRCA T-vs-N (pre-reg FAIL) ∅ CPTAC-3 metastasis (honest FAIL)
Cohort / endpointnKey metricVerdictNotes
TCGA-KIRC · metastasis M0/M1 505 AUROC 0.726 · perm p <0.001 PASS Pre-registered TCGA 5-leg gate (confound leg null for this task)
IMmotion150 · PFS (survival) 263 HR 1.36 · p=0.0003 · C=0.601 PASS Independently pre-registered survival kill tests (log-rank, Cox HR, Harrell C). Not same-endpoint replay.
GSE53757 · stage I-II vs III-IV ~70 AUROC 0.714 [0.584, 0.832] PASS Platform-shift support (Affymetrix). Not M0/M1 replay.
GSE53757 · tumor vs normal ~144 AUROC 0.995 (saturation) INFORMATIVE FAIL Single-gene saturation (CA9 AUROC 0.995); Δbase mathematically unreachable. Confirms specificity.
TCGA-BRCA · tumor vs normal Δbase +0.009 PRE-REG FAIL Cross-cancer negative control. Law is ccRCC-specific, not pan-cancer.
CPTAC-3 · metastasis M0/M1 155 ci_lower=0.542 · Δbase=−0.007 · direction p=0.006 HONEST FAIL Direction preserved (p=0.006) but gate refuses: ci_lower < 0.60 and Δbase negative. Cross-platform replication not confirmed.
Own-output kill (PhL-1). The H1-loop SLC22A8 three-gene extension TOP2A − (EPAS1 + SLC22A8) was also tested on IMmotion150 under the same survival kill tests. Verdict: FAIL (log-rank p=0.117; Cox HR 1.16 CI 0.99–1.37; C-index 0.566 vs 2-gene 0.601). Our own best guess, refused by our own data. This is the gate biting on its own outputs — the strongest pre-emptive response to the Sakana-v2-style circularity critique.
Context isolation audit — IPF (Run #1, lock SHA 88eaca3). The same Managed Agents isolation architecture applied to idiopathic pulmonary fibrosis: a Skeptic session running in a separate context window — never seeing the Advocate's reasoning tokens — caught 2 fabricated prior-trial claims in the Advocate's output. The Advocate stated RAINIER and Raghu 2017 had never tested specific patient stratifiers. Both statements were false. A single-context pipeline cannot catch this: the Skeptic would rationalise the Advocate's framing. Cost: $58.28 · 32 minutes. Context isolation is a live audit layer, not a logging convenience.
G1/G2/I2/I3/I4 rigor package

Beyond AUROC

Six pre-registered extensions probe compactness, calibration, and clinical meaning. 12 of 13 predictions PASS.

ExtensionKey resultVerdict
G2 · AUPRC (imbalance-aware) 0.321 vs 0.156 baseline → 2.05× lift PASS
G2 · Calibration slope 0.979, intercept −0.032 — well-calibrated per TRIPOD+AI 2024 PASS
G1 · Knockoff v2 (individual-gene FDR) 0 / 45 genes selected; EPAS1 rank 1, TOP2A rank 2 by mean W Honest discordance — compound gate and univariate-FDR test different objects. Signal is genuinely compound.
I2 · Rashomon set Rank 1 / 990 two-gene pairs. Tight set (±0.02): only 3 pairs — all (proliferation − HIF-2α). PASS
I3 · Clinical translation Cohen's d = 0.856; OR per 1-SD = 2.07 [1.65–2.59]; risk stratification utility confirmed Screening grade (sens ≥0.50 @ spec ≥0.85) FAILS by 0.044 — honest
I4 · Information theory Joint MI 1.82× individual max; linear form captures 0.92–0.98 of bivariate MI PASS
G4 · Anchor regression Cochran Q p=0.238 (TOP2A) · p=0.410 (EPAS1); coefficients stable γ=0→100 PASS
The rejection surface · 194 of 203 KIRC rejected

What the gate killed

The original ccRCC task layer yielded zero survivors, including CA9-dominated HIF/tubule-identity contrasts with high AUROC but too little gain over single-gene baselines. The gate is not sparing its own side.

194 of 203 KIRC evaluations rejected. The 9 survivors all came from the 45-gene metastasis_expanded sub-layer; the 11-gene initial layer rejected 100% before panel-absence repair.

Task (panel)Dominant single geneCandidatesSurvivorsGate fail reason
Tumor vs Normal · KIRC 11-gene CA9 AUROC 0.965 33 0 delta_baseline — CA9 saturates; Δ>1.015 required, unreachable
Stage I-II vs III-IV · KIRC 11-gene CUBN 0.610 34 0 delta_baseline — tubule marker dominates
5-yr Survival · KIRC 11-gene CUBN 0.696 36 0 delta_baseline / perm_p — same single-gene ceiling
Metastasis · KIRC 11-gene MKI67 0.645 37 0 delta_baseline — panel lacks proliferation/HIF-2α expansion genes
Metastasis · KIRC 45-gene MKI67 0.657 30 9
Tumor vs Normal · LUAD SFTPC 0.998 4 0 delta_baseline — SFTPC saturates; same structure as CA9 in KIRC
BRCA + other cohorts 29 0 cross-cancer negative controls, gate robustness, external replay lanes

Interactive rejection log Initial rejects, repaired survivors, and external replay · filter by verdict

Rejection landscape — AUROC distribution across all candidates

Rejection landscape — candidates coloured by pass/fail verdict. Survivors cluster at Δbase ≥0.05.

E2 cross-model ablation · 180 API calls

Why Opus 4.7 for the Skeptic

Same prompt, same candidates, same gate metrics. Three models. Pre-registered specificity predictions were falsified (all models cite ≥2 metrics in 100% of cases — honest null). Verdict calibration is where Opus 4.7 diverges.

ModelPASSNEEDS_MOREFAILDissent on gate-PASS (%)Note
claude-opus-4-7 10 / 60 20 30 66.7% Base calibration — no thinking (HTTP 400 on "enabled"); wins anyway
claude-haiku-4-5 14 / 60 16 30 53.3% WITH extended thinking (23s latency)
claude-sonnet-4-6 0 / 60 30 30 100% WITH extended thinking — collapses to permanent rejection regardless of evidence
The Skeptic role is model-sensitive. In this six-candidate audit, Sonnet 4.6 dissents on 100% of gate-PASS candidates, while Opus 4.7 issues PASS only when gate metrics warrant it (10 calls) and abstains when the margin is thin. This shows model identity changes gate-alignment behavior under the available inference modes; it does not identify a causal mechanism.
PhI-1 · meta-calibration

Opus knows when it's wrong — before testing

Opus 4.7 wrote ex-ante kill tests for 4 skeleton hypotheses on ccRCC metastasis. The gate confirmed every prediction: Skeleton #1 (VEGFA redundant — FAIL) and Skeleton #4 (CCNB1 alone insufficient — FAIL). Zero of 4 passed. The kill tests were accurate. The model anticipated its own failures.

PhL-19 · Interpreter role quality · 3-model ablation

Interpretation quality: Opus vs alternatives

For each surviving law, the Interpreter role produces a mechanism hypothesis, a testable downstream prediction, and an explicit "what this is NOT" paragraph. The ablation measures whether this depth is Opus-specific or model-agnostic.

ModelCaveat ratePrediction rateAvg citationsThinking
claude-opus-4-7 100% 100% 12 adaptive (effort:high)
claude-sonnet-4-6 0% 0% with thinking
claude-haiku-4-5 0% 0% with thinking
The "what this is NOT" paragraph is load-bearing. Sonnet and Haiku both produce interpretations — they describe mechanism and pathway biology. Neither produces explicit caveats ("not a diagnostic biomarker", "not novel biology", "not superior to an engineered LR baseline") or testable downstream predictions. Opus 4.7 achieves 100% on both dimensions without prompting. This is the difference between an interpretation that sounds like the literature and one that could be sent to a reviewer.
Claude Design · 6 slides · 4K UHD

Visual evidence

Click any slide to expand the judge-facing visual evidence for the loop, the gate, the survivor, and the replication chain.

Hook loop — rejection and acceptance cycle
01 · Hook loop
Compose verdict — 5-test gate
02 · Gate verdict
Rejection landscape + IMmotion150
03 · Landscape × IMmotion150
KM curve close
04 · KM close
Architecture
05 · Architecture
DIPG trajectory
06 · DIPG trajectory
Managed Agents + Routines · Opus 4.7

How it works

Four roles. Three Anthropic products composed. One deterministic gate that no LLM can renegotiate.

1
Proposer (Opus 4.7) — given a DatasetCard + gene panel, emits 3–5 compact law families with ex-ante skeptic tests per family. Required to include at least one negative control it expects to FAIL. No data seen yet.
2
PySR Searcher (Sonnet 4.6) — runs symbolic regression locally with Opus's law families as guesses. Returns equations in gene-name form (e.g. TOP2A − EPAS1). No LLM judgment here.
3
Falsification Gate (deterministic Python) — five pre-registered tests: permutation null · bootstrap CI lower bound · sign-invariant Δbaseline · incremental confound · decoy-feature null. BH-FDR across candidates. Commits before any fit; thresholds cannot change. Decides pass/fail; no LLM involved.
4
Skeptic (Opus 4.7) — reviews the gate's JSON output, not the Proposer's rationale. Emits verdict + one additional test. Isolation from Proposer context is load-bearing to the rigor claim.
5
Interpreter (Opus 4.7) — for survivors only: mechanism hypothesis, testable downstream prediction, explicit "what this is NOT" paragraph. Reads gate metrics; cannot reverse a rejection.
6
Path C Routine — Claude Code Routine (experimental-cc-routine-2026-04-01) watches for new DatasetCard or GitHub PR/release events and fires the full loop automatically. Nightly cron available. The gate runs without a human pressing a button.
AgentModelRoleCan it change the gate?
ProposerOpus 4.7Law families + ex-ante kill testsNo
SearcherSonnet 4.6PySR symbolic regressionNo
GateDeterministic Python · 5 testsThe gate is the authority
SkepticOpus 4.7Post-gate review · proposes 1 extra testNo
InterpreterOpus 4.7Survivor mechanism + predictionNo
Lacuna