SpaceOmicsBench

A multi-omics AI benchmark for spaceflight biomedical data — 21 ML tasks across 9 modalities + 100-question LLM evaluation from the Inspiration4, NASA Twins, and JAXA CFE missions.

ML Tasks

Modalities

100

LLM Questions

Baselines

152K+

Total Samples

Baseline Results

Performance across all tasks. Best score per row highlighted. E2/E3 are supplementary (extreme imbalance).

Task	Name	Category	Tier	Metric	Random	Majority	LogReg	RF	MLP	XGB	LGBM

Performance Analysis

Normalized composite scores, category radar, and difficulty distribution.

Normalized Composite Score — (score − random) / (1 − random)

RF Category Performance — radar view

Difficulty Tier Distribution — 21 tasks

B1 Feature Ablation — AUPRC by feature set

Insight: Removing effect-size features (fold-changes) preserves RF/MLP performance (0.86 AUPRC), confirming the task tests genuine distributional signal rather than simple effect-size thresholding.

LLM Evaluation

100-question benchmark with 5-dimension scoring. 5 models evaluated (Claude Sonnet 4.6, Haiku 4.5, Sonnet 4, GPT-4o, GPT-4o Mini).

5-Model Ranking — Sonnet 4.6 Judge, weighted score (1-5)

Finding: Claude models rank 1–3. Haiku 4.5 notably outperforms Sonnet 4 (+0.36). GPT-4o ≈ GPT-4o Mini (3.30 each) — scale provides negligible benefit here.

Dimension Breakdown — Sonnet 4.6 Judge