SpaceOmicsBench

A multi-omics AI benchmark for spaceflight biomedical data — 21 ML tasks across 9 modalities + 100-question LLM evaluation from the Inspiration4, NASA Twins, and JAXA CFE missions.

21
ML Tasks
9
Modalities
100
LLM Questions
7
Baselines
152K+
Total Samples
Baseline Results
Performance across all tasks. Best score per row highlighted. E2/E3 are supplementary (extreme imbalance).
TaskNameCategoryTierMetric RandomMajority LogRegRFMLPXGBLGBM
Performance Analysis
Normalized composite scores, category radar, and difficulty distribution.
Normalized Composite Score — (score − random) / (1 − random)
RF Category Performance — radar view
Difficulty Tier Distribution — 21 tasks
B1 Feature Ablation — AUPRC by feature set
Insight: Removing effect-size features (fold-changes) preserves RF/MLP performance (0.86 AUPRC), confirming the task tests genuine distributional signal rather than simple effect-size thresholding.
LLM Evaluation
100-question benchmark with 5-dimension scoring. 5 models evaluated (Claude Sonnet 4.6, Haiku 4.5, Sonnet 4, GPT-4o, GPT-4o Mini).
5-Model Ranking — Sonnet 4.6 Judge, weighted score (1-5)
Finding: Claude models rank 1–3. Haiku 4.5 notably outperforms Sonnet 4 (+0.36). GPT-4o ≈ GPT-4o Mini (3.30 each) — scale provides negligible benefit here.
Dimension Breakdown — Sonnet 4.6 Judge