LLM Evaluation Leaderboard

SpaceOmicsBench β€” 100 questions across 9 spaceflight omics modalities & 4 difficulty tiers

Judge: Claude Sonnet 4.6 100 Questions 1–5 Scale 9 Models 5 Dimensions
πŸ† Overall Leaderboard
Weighted score across 5 dimensions (Factual 25%, Reasoning 25%, Completeness 20%, Uncertainty 15%, Domain 15%). πŸ”’ proprietary API Β· πŸ”“ open-weights
🌟

Claude Sonnet 4.6 leads at 4.62

Top across all difficulty tiers with exceptional Reasoning (4.97) and Completeness (4.77). 93 novel insights flagged.

πŸ”“

DeepSeek-V3: best open-weights

4.34/5.00 β€” surpasses Claude Sonnet 4, GPT-4o, and Gemini 2.5 Flash. Especially strong on Hard & Expert questions.

⚠️

Gemini: thinking mode matters

With max_tokens=8192, Gemini 2.5 Flash scores 4.00 β€” uniform across difficulties (Easy 3.52 β†’ Expert 4.04). Previous 2.74 Expert score was a truncation artifact.

β‰ˆ

Bottom tier clusters at ~3.30

GPT-4o Mini (3.32) edges ahead; GPT-4o, Llama-70B all ~3.30. Scale provides negligible benefit for this specialized domain.

πŸ“ˆ Difficulty Profile
Score trajectory from Easy β†’ Expert questions. Reveals which models degrade gracefully vs. collapse under complexity.
πŸ•ΈοΈ 5-Dimension Breakdown
Select up to 3 models to compare across the 5 scoring dimensions. Uncertainty Calibration is the universal weak spot.
🧬 Performance by Modality
Score breakdown across 9 omics modalities. Color: β–  low β†’ β–  mid β†’ β–  high (scale 1–5).
πŸ“‹ Full Data Table
Click column headers to sort. All scores on 1–5 scale, judged by Claude Sonnet 4.6. v2.1: Q27/Q28/Q64 ground truth corrected; Gemini re-evaluated with max_tokens=8192 (thinking mode).
# Model Score Easy Med Hard Expert Factual Reason Complete Uncert Domain Halluc Novel