Three independent evaluation systems with published baseline comparisons. All scores from real runs on real data — no cherry-picking, no domain hints in fair mode.
4-layer composable scorer: Gene overlap, Pathway matching, Direction accuracy, and LLM-as-judge biology evaluation.
Domain-specific prompts in knowledge_extract()
previously named expected genes directly, inflating recall by ~20–30pp.
Fair mode runs disable all hints — results are fair_mode: true in saved JSON.
| Dataset | With Hints | Fair Mode | Delta | Gene F1 (fair) | Grade |
|---|---|---|---|---|---|
| TCGA BRCA Pathway | 0.866 | 0.787 | −0.079 | 0.625 | GOLD |
| GTEx Tissue Markers | 0.666 | 0.516 | −0.150 | 0.421 | SILVER |
backend/data/w9_benchmark/runs/.
Cancer retains GOLD even without hints; GTEx drops from SILVER due to direction_accuracy=0 (query-only, no fold-change data).
Gene F1 on the GenoTEX unconditional gene-trait association task (132 traits). Baselines from GenoMAS (arXiv:2507.21035).
Automated scientific peer review evaluated against 29 real eLife reviewer comments. The first benchmark testing AI peer review quality in biology.
On 2 pilot articles with detailed evaluation: 80% recall and 41.1% precision — showing the system catches most real issues but generates many false positives.
| eLife Article | Comments | Major | Minor | Suggestion | Question | Decision | Confidence | Cost |
|---|---|---|---|---|---|---|---|---|
| 108498 | 20 | 7 | 6 | 4 | 3 | Major Revision | 0.55 | $1.26 |
| 104054 | 19 | 5 | 9 | 3 | 2 | Minor Revision | 0.72 | $1.40 |
| 78908 | 19 | 6 | 7 | 3 | 3 | Major Revision | 0.68 | $1.42 |
| 84749 | 19 | 5 | 8 | 3 | 3 | Major Revision | 0.70 | $1.54 |
| 102144 | 17 | 5 | 6 | 3 | 3 | Major Revision | 0.72 | $1.25 |
| 100396 | 12 | 3 | 4 | 2 | 3 | Minor Revision | 0.78 | $0.71 |
| 85300 | 10 | 3 | 3 | 2 | 2 | Minor Revision | 0.80 | $0.65 |
| 86334 | 9 | 2 | 3 | 2 | 2 | Minor Revision | 0.82 | $0.56 |
| ... 21 more articles (avg 12.8 comments, range 9–17) ... | ||||||||
Higher major-comment count correlates with major revision decisions (r=0.72). Articles with confidence > 0.75 tend to receive minor revision recommendations. Average review cost of $0.89/article makes large-scale screening feasible. All 29 articles are from eLife's open peer review corpus with published decision letters.
6-rater cross-validation on 338 biomedical abstract pairs. 3 models x 2 prompt strategies. Total evaluation cost: $0.36.
| Model | Prompt | Precision | Recall | F1 | Fail % | Cost |
|---|---|---|---|---|---|---|
| Llama 4 Scout | contrastive | 0.599 | 0.932 | 0.729 | 32.5% | $0.05 |
| DeepSeek V3.2 | contrastive | 0.593 | 0.854 | 0.700 | 0% | $0.14 |
| Gemini 2.5 Flash | baseline | 0.619 | 0.645 | 0.632 | 0% | $0.00 |
| Gemini 2.5 Flash | contrastive | 0.459 | 0.967 | 0.623 | 7.7% | $0.00 |
| DeepSeek V3.2 | baseline | 0.778 | 0.285 | 0.417 | 0% | $0.05 |
| Llama 4 Scout | baseline | 0.933 | 0.156 | 0.267 | 17.2% | $0.06 |
PROMPT > MODEL — The contrastive prompt yields recall of 0.85-0.97 across ALL model families, while the baseline prompt achieves only 0.16-0.65. Prompt engineering matters more than model selection for contradiction detection.
Validated against gold-standard annotations
Composable scoring pipeline: each layer operates independently and scores are combined via weighted sum with automatic renormalization when ground truth is unavailable.
fair_mode: true for reproducibilityBioAgent Score follows the same renormalization pattern as RCMXT (our evidence confidence vector): when an axis has no ground truth, its weight is removed and remaining weights are rescaled to sum to 1.0. This prevents missing data from inflating scores.