SpaceOmicsBench β 100 questions across 9 spaceflight omics modalities & 4 difficulty tiers
Top across all difficulty tiers with exceptional Reasoning (4.97) and Completeness (4.77). 93 novel insights flagged.
4.34/5.00 β surpasses Claude Sonnet 4, GPT-4o, and Gemini 2.5 Flash. Especially strong on Hard & Expert questions.
With max_tokens=8192, Gemini 2.5 Flash scores 4.00 β uniform across difficulties (Easy 3.52 β Expert 4.04). Previous 2.74 Expert score was a truncation artifact.
GPT-4o Mini (3.32) edges ahead; GPT-4o, Llama-70B all ~3.30. Scale provides negligible benefit for this specialized domain.
| # | Model | Score | Easy | Med | Hard | Expert | Factual | Reason | Complete | Uncert | Domain | Halluc | Novel |
|---|