Benchmarks & Evaluation

W9 BIOINFORMATICS BENCHMARK

BioAgent Score

4-layer composable scorer: Gene overlap, Pathway matching, Direction accuracy, and LLM-as-judge biology evaluation.

TCGA BRCA Pathway GOLD

0.866

BioAgent Score

Gene Recall

0.833

Gene Precision

0.556

Gene F1

0.667

Pathway

0.875

Biology Judge

0.950

552s runtime Knowledge benchmark

Alzheimer Mouse (BAB) GOLD

0.743

BioAgent Score

Gene Recall

0.933

Gene Precision

0.636

Gene F1

0.757

Pathway

0.250

Biology Judge

0.750

91s runtime BioAgent Bench (external)

GTEx Tissue Markers SILVER

0.666

BioAgent Score

Gene Recall

0.556

Gene Precision

0.400

Gene F1

0.465

Pathway

0.750

Biology Judge

0.950

45s runtime Knowledge benchmark

SCORING FORMULA

BioAgent Score Weights

0.30

Gene Recall

0.20

Pathway

0.20

Direction

0.15

Biology

0.10

Precision

0.05

FC Corr

Fair Mode uses equal precision/recall weights (0.25/0.25), drops LLM-as-judge, and disables all domain-specific hints. No ground truth leakage possible.

FAIR MODE EVALUATION

Hints vs. No Hints

Domain-specific prompts in knowledge_extract() previously named expected genes directly, inflating recall by ~20–30pp. Fair mode runs disable all hints — results are fair_mode: true in saved JSON.

Dataset	With Hints	Fair Mode	Delta	Gene F1 (fair)	Grade
TCGA BRCA Pathway	0.866	0.787	−0.079	0.625	GOLD
GTEx Tissue Markers	0.666	0.516	−0.150	0.421	SILVER

Fair mode runs: 2026-03-23. Saved to backend/data/w9_benchmark/runs/. Cancer retains GOLD even without hints; GTEx drops from SILVER due to direction_accuracy=0 (query-only, no fold-change data).

PUBLISHED BASELINES

GenoMAS Comparison

Gene F1 on the GenoTEX unconditional gene-trait association task (132 traits). Baselines from GenoMAS (arXiv:2507.21035).

Human Expert

0.716

W9 Cancer (ours)

0.667

GenoMAS (SOTA)

0.605

Claude Sonnet 4

0.530

OpenAI o3

0.455

Gemini 2.5 Flash

0.407

GPT-4o

0.253

Direct Prompting

0.024

Honest assessment: Our W9 Cancer Pathway gene F1 (0.667) is competitive with GenoMAS SOTA (0.605) but below Human Expert (0.716). These are different tasks — GenoTEX uses 132 gene-trait pairs while our benchmark uses TCGA BRCA pathway enrichment. Direct comparison is approximate. Fair mode evaluation removes domain hints for unbiased scoring.

W8 PEER REVIEW BENCHMARK

BioReview-Bench

Automated scientific peer review evaluated against 29 real eLife reviewer comments. The first benchmark testing AI peer review quality in biology.

39.6%

Overall Recall

41.7%

Major Issue Recall

16.0%

Precision

What Works

Methodology critique~75%

Statistical concerns~65%

Missing controls~60%

Known Limitations

Figure/image analysis~0%

Domain-specific nuance~20%

False positive rateHigh

Pilot Performance

On 2 pilot articles with detailed evaluation: 80% recall and 41.1% precision — showing the system catches most real issues but generates many false positives.

Review Output Analysis (29 articles)

376

Total Review Comments

13.0

Avg Comments / Article

$0.89

Avg Cost / Review

Comment Category Breakdown

Minor issues133 (35.4%)

Major issues106 (28.2%)

Suggestions75 (19.9%)

Questions62 (16.5%)

Decision Distribution

Minor revision19 / 29 (66%)

Major revision8 / 29 (28%)

No decision2 / 29 (7%)

eLife Article	Comments	Major	Minor	Suggestion	Question	Decision	Confidence	Cost
108498	20	7	6	4	3	Major Revision	0.55	$1.26
104054	19	5	9	3	2	Minor Revision	0.72	$1.40
78908	19	6	7	3	3	Major Revision	0.68	$1.42
84749	19	5	8	3	3	Major Revision	0.70	$1.54
102144	17	5	6	3	3	Major Revision	0.72	$1.25
100396	12	3	4	2	3	Minor Revision	0.78	$0.71
85300	10	3	3	2	2	Minor Revision	0.80	$0.65
86334	9	2	3	2	2	Minor Revision	0.82	$0.56
... 21 more articles (avg 12.8 comments, range 9–17) ...

Key Observations

Higher major-comment count correlates with major revision decisions (r=0.72). Articles with confidence > 0.75 tend to receive minor revision recommendations. Average review cost of $0.89/article makes large-scale screening feasible. All 29 articles are from eLife's open peer review corpus with published decision letters.

W6 CONTRADICTION DETECTION

ContradictBio-338

6-rater cross-validation on 338 biomedical abstract pairs. 3 models x 2 prompt strategies. Total evaluation cost: $0.36.

Model	Prompt	Precision	Recall	F1	Fail %	Cost
Llama 4 Scout	contrastive	0.599	0.932	0.729	32.5%	$0.05
DeepSeek V3.2	contrastive	0.593	0.854	0.700	0%	$0.14
Gemini 2.5 Flash	baseline	0.619	0.645	0.632	0%	$0.00
Gemini 2.5 Flash	contrastive	0.459	0.967	0.623	7.7%	$0.00
DeepSeek V3.2	baseline	0.778	0.285	0.417	0%	$0.05
Llama 4 Scout	baseline	0.933	0.156	0.267	17.2%	$0.06

Key Insight

PROMPT > MODEL — The contrastive prompt yields recall of 0.85-0.97 across ALL model families, while the baseline prompt achieves only 0.16-0.65. Prompt engineering matters more than model selection for contradiction detection.

PANEL AGREEMENT

Krippendorff's Alpha

Type (contrastive)

0.560

Binary (contrastive)

0.352

Type (baseline)

0.290

Binary (baseline)

0.323

Contrastive prompt nearly doubled type agreement (0.290 → 0.560). Best pairwise: DeepSeek vs Llama 4 = Cohen's kappa 0.597 (contrastive). Method: PoLL (Panel of LLM Judges) following Verga et al. 2024 (arXiv:2404.18796).

TIER RELIABILITY

Consensus Quality by Agreement Level

Validated against gold-standard annotations

120

Tier 1 (≥5/6 agree)

94.2%

gold match — validated

100

Tier 2 (4/6 agree)

82.0%

gold match — reliable

137

Tier 3 (split/few)

~65%

needs manual review

Consensus accuracy: 81.5% overall. Gold corrections discovered: 17 entries (contextual→genuine), 8 missed (genuine→contextual). Tier 1 provides publication-grade quality at $0.36 total cost.

SCORING METHODOLOGY

4-Layer Architecture

Composable scoring pipeline: each layer operates independently and scores are combined via weighted sum with automatic renormalization when ground truth is unavailable.

Layer 1

🧬

Gene Level

Jaccard, recall, precision, F1 on predicted vs expected gene sets. Pure Python, zero cost.

Layer 2

🗺

Pathway Level

Fuzzy pathway name matching with KEGG suffix stripping. SequenceMatcher ≥ 0.70 threshold.

Layer 3

↕

Direction Level

Up/down accuracy on overlapping DEGs + Spearman correlation on fold changes.

Layer 4

🧠

Biology Judge

LLM-as-judge via Gemini 2.5 Flash (free). Scores biological plausibility 0–1. Fallback: 0.5.

⚖ Fair Mode Evaluation Rules

No domain-specific keywords in system prompts (removes hints like "breast cancer", "BRCA")
Equal precision/recall weights (0.25/0.25) instead of recall-biased (0.30/0.10)
LLM-as-judge (biology_score) excluded from composite to prevent self-reinforcement
Fold-change correlation weight increased to 0.10 (from 0.05) for quantitative rigor
Weight renormalization when ground truth axes are unavailable (same as RCMXT X=NULL pattern)
All published baseline comparisons use gene F1 as the common metric
Benchmark results tagged with fair_mode: true for reproducibility

Shared Design Principle

BioAgent Score follows the same renormalization pattern as RCMXT (our evidence confidence vector): when an axis has no ground truth, its weight is removed and remaining weights are rescaled to sum to 1.0. This prevents missing data from inflating scores.

Benchmarks &Evaluation

BioAgent Score

BioAgent Score Weights

Hints vs. No Hints

GenoMAS Comparison

BioReview-Bench

What Works

Known Limitations

Review Output Analysis (29 articles)

Comment Category Breakdown

Decision Distribution

ContradictBio-338

Krippendorff's Alpha

Consensus Quality by Agreement Level

4-Layer Architecture

⚖ Fair Mode Evaluation Rules

Benchmarks &
Evaluation