LLM Evaluation Leaderboard

SpaceOmicsBench — 100 questions across 9 spaceflight omics modalities & 4 difficulty tiers

Judge: Claude Sonnet 4.6 100 Questions 1–5 Scale 9 Models 5 Dimensions

🏆 Overall Leaderboard

Weighted score across 5 dimensions (Factual 25%, Reasoning 25%, Completeness 20%, Uncertainty 15%, Domain 15%). 🔒 proprietary API · 🔓 open-weights

🌟

Claude Sonnet 4.6 leads at 4.62

Top across all difficulty tiers with exceptional Reasoning (4.97) and Completeness (4.77). 93 novel insights flagged.

🔓

DeepSeek-V3: best open-weights

4.34/5.00 — surpasses Claude Sonnet 4, GPT-4o, and Gemini 2.5 Flash. Especially strong on Hard & Expert questions.

⚠️

Gemini: thinking mode matters

With max_tokens=8192, Gemini 2.5 Flash scores 4.00 — uniform across difficulties (Easy 3.52 → Expert 4.04). Previous 2.74 Expert score was a truncation artifact.

≈

Bottom tier clusters at ~3.30

GPT-4o Mini (3.32) edges ahead; GPT-4o, Llama-70B all ~3.30. Scale provides negligible benefit for this specialized domain.

📈 Difficulty Profile

Score trajectory from Easy → Expert questions. Reveals which models degrade gracefully vs. collapse under complexity.

🕸️ 5-Dimension Breakdown

Select up to 3 models to compare across the 5 scoring dimensions. Uncertainty Calibration is the universal weak spot.

🧬 Performance by Modality

Score breakdown across 9 omics modalities. Color: ■ low → ■ mid → ■ high (scale 1–5).

📋 Full Data Table

Click column headers to sort. All scores on 1–5 scale, judged by Claude Sonnet 4.6. v2.1: Q27/Q28/Q64 ground truth corrected; Gemini re-evaluated with max_tokens=8192 (thinking mode).

#	Model	Score	Easy	Med	Hard	Expert	Factual	Reason	Complete	Uncert	Domain	Halluc	Novel