Current image generation models produce visually compelling but scientifically implausible images, exposing a fundamental gap between visual fidelity and physical realism. We introduce Science-T2I, an expert-annotated dataset comprising over 20k adversarial image pairs and 9k prompts across 16 scientific domains, along with an isolated test set of 454 challenging prompts. Using this benchmark, we evaluate 18 recent image generation models and find that none scores above 50 out of 100 under implicit scientific prompts, while explicit prompts yield scores roughly 35 points higher, confirming a systematic composition–reasoning disconnect. To address this, we develop SciScore, a reward model fine-tuned from CLIP-H that surpasses GPT-4o and human evaluators by roughly 5 points. We further propose a two-stage alignment framework combining supervised fine-tuning with masked online fine-tuning. Applying this framework to FLUX.1[dev] yields a relative improvement exceeding 50%, demonstrating that scientific reasoning in image generation can be substantially improved through targeted data and alignment.

Science-T2I: Bridging Visual Imagination and Scientific Realism

Current image generation models produce visually plausible but scientifically incorrect outputs when prompted with descriptions that require physical reasoning. We identify two root causes: (1) existing training data rarely pairs scientific concepts with their correct visual manifestations, and (2) standard evaluation protocols do not test whether a model understands the science behind a prompt. To address both issues, we introduce Science-T2I, a dataset that challenges models to perform implicit scientific reasoning.

Task overview. As shown below, Science-T2I consists of 16 tasks spanning physics, chemistry, and biology. Each task requires the model to infer or visualize a concept not explicitly stated in the prompt but rooted in an underlying scientific principle.

**Fig 2:** (Left) Science-T2I is organized into three scientific fields, each divided into specific categories. (Right) Word cloud of structured prompts in Science-T2I.

Task classification. Beyond scientific discipline, the tasks naturally fall into two categories that reveal distinct reasoning demands:

1. Subject-oriented Tasks (ST) require reasoning about how inherent differences between subjects lead to varying visual features under identical conditions. For example, different metals produce different flame colors.

2. Condition-oriented Tasks (CT) focus on how a single condition affects the visual appearance of various subjects. Here, scientific reasoning centers on the applied condition rather than the subject's individual properties. For example, all objects float in the absence of gravity.

**Fig 3:** Subject-oriented vs. condition-oriented task classification. Ten tasks are condition-oriented and six are subject-oriented.

Prompt design. A central design choice is the three-tier prompt structure, which disentangles a model's scientific reasoning ability from its compositional rendering ability. For each task, we construct a tuple of three prompts:

1. Implicit Prompt (IP) contains terms that imply certain visual characteristics requiring interpretative reasoning based on scientific knowledge. For example, "an unripe apple" suggests greenness without explicitly stating it.

2. Explicit Prompt (EP) reformulates the implicit prompt into a clear, descriptive statement that directly conveys the intended visual outcome. For instance, "a green apple" makes the expected appearance explicit.

3. Superficial Prompt (SP) provides a plausible but scientifically incorrect interpretation, focusing only on surface-level associations. For example, "a red apple" interprets "unripe" based on the default visual prototype rather than the scientific implication.

Together, the three prompt types serve complementary roles: the IP tests implicit reasoning, the EP establishes an upper bound on what the model can render when given direct instructions, and the SP provides a hard negative for preference-based training.

**Fig 4:** Data curation pipeline. GPT-4o generates structured templates, then expands each implicit prompt into explicit and superficial counterparts, guiding the synthesis of corresponding images.

Data curation. We leverage GPT-4o to generate structured templates and corresponding prompts for each task. The pipeline first produces implicit prompts from task-specific templates, then expands each into its explicit and superficial counterparts. These prompts are used to generate image pairs via image generation models. All generated data undergoes manual verification by domain experts, who cross-reference each tuple against established scientific knowledge.

Science-T2I Examples

Benchmarking Scientific Image Synthesis

Test Splits. We construct two manually annotated test sets covering the same scientific domains as the training data but containing no overlapping samples:

1. Science-T2I-S (Simple), containing 227 tuples. This split uses minimalist compositions with simple backgrounds. By isolating the scientific subject from environmental distractions, it provides a controlled test of whether the model can generate the correct scientific visual features.

2. Science-T2I-C (Complex), containing 227 tuples. This split embeds the same scientific tasks within diverse real-world scenes, adding contextual elements such as "in a bedroom" or "on the street." It tests whether the model can maintain scientific accuracy when the background is visually complex.

Science-T2I-S&C Leaderboard

🤩 Submit your LMM/VLM scores now and watch the leaderboard refresh with your achievements! Email us at .

Click the button below to view different results!

#	Model	Overall	Physics										Chemistry				Biology
#	Model	Acc	GR	SO	ME	AB	BU	DI	EL	EV	LI	Avg	RU	IM	FR	Avg	LR	WR	SC	RI	Avg
1	CLIP-H VLM	54.7	78.3	40.5	25.0	57.1	63.9	71.4	47.6	26.7	77.8	55.1	16.7	54.2	77.8	52.4	62.2	31.1	81.5	34.6	55.9
2	BLIPScore VLM	55.0	47.5	44.1	56.9	38.1	50.0	50.0	52.4	20.0	33.3	50.4	42.9	53.1	76.7	43.1	76.7	38.9	58.3	38.5	59.9
3	SigLIP VLM	57.2	78.3	45.2	44.4	57.1	58.3	83.3	47.6	63.3	58.3	59.6	23.8	60.4	62.2	53.2	46.7	33.3	83.3	53.9	55.9
4	Qwen2-VL-7B LMM	63.8	83.3	42.9	26.4	40.5	47.2	70.2	77.4	68.3	84.7	60.0	73.8	66.7	52.2	67.0	57.8	67.8	95.4	34.6	68.8
5	LLaVA-OV-7B LMM	65.1	92.5	56.0	36.1	38.1	45.8	75.0	77.4	100	95.8	68.2	59.5	55.2	48.9	57.8	51.1	72.2	78.7	46.2	64.7
6	Human Eval	87.0	93.0	86.1	98.2	66.7	74.6	65.9	95.6	100	82.1	87.7	92.9	77.8	81.0	75.9	96.9	99.6	90.7	94.6	95.3
7	InternVL2.5-8B LMM	70.8	96.7	52.4	41.7	47.6	55.6	63.1	72.6	91.7	90.3	67.8	69.1	56.3	52.2	62.2	84.4	90.0	84.3	75.0	84.4
8	GPT-4o mini LMM	70.8	71.3	35.7	36.1	33.3	56.9	77.4	82.1	100	76.4	62.0	95.2	65.6	58.9	73.8	96.7	83.3	97.2	53.9	86.8
9	SciScore VLM	93.1	98.3	90.5	100	71.4	66.7	97.6	100	100	100	94.9	100	68.8	97.8	81.0	100	100	100	100	100

#	Model	Overall	Physics	Chemistry	Biology
1	CLIP-H VLM	59.5	75.0	57.1	66.7	64.3	58.3	78.6	21.4	0.0	66.7	56.6	35.7	50.0	46.7	44.4	80.0	60.0	88.9	75.0	76.7
2	BLIPScore VLM	51.5	40.0	42.9	58.3	50.0	62.5	50.0	28.6	50.0	29.2	49.8	57.1	62.5	60.0	60.0	53.3	46.7	75.0	54.2	58.3
3	SigLIP VLM	61.7	95.0	42.9	58.3	64.3	66.7	85.7	42.9	30.0	41.7	61.5	28.6	62.5	60.0	51.1	46.7	53.3	94.4	83.3	70.0
4	Qwen2-VL-7B LMM	69.8	100	57.1	41.7	57.1	66.7	64.3	67.9	55.0	70.8	66.8	60.7	53.1	36.7	50.0	93.3	96.7	94.4	75.0	90.8
5	LLaVA-OV-7B LMM	70.3	100	71.4	45.8	57.1	75.0	71.4	64.3	100	79.2	74.6	53.6	59.4	36.7	50.0	63.3	86.7	77.8	79.2	76.7
6	Human Eval	86.0	98.6	77.6	91.0	78.6	83.3	66.8	90.9	95.7	76.8	84.7	92.9	86.6	77.1	85.4	88.6	84.8	96.8	83.8	89.1
7	InternVL2.5-8B LMM	75.3	97.5	53.6	58.3	57.1	75.0	92.9	57.1	95.0	70.8	73.8	75.0	71.9	50.0	65.6	90.0	93.3	75.0	87.5	85.8
8	GPT-4o mini LMM	74.8	97.5	50.0	67.7	50.0	54.2	67.9	64.3	90.0	75.0	69.3	89.3	68.8	53.3	70.0	100	83.3	88.9	87.5	90.0
9	SciScore VLM	91.2	100	92.9	100	71.4	41.7	85.7	85.7	100	100	86.9	92.9	81.3	100	91.1	100	100	100	100	100

Image Generation Benchmark

Evaluation Setup. We evaluate 18 image generation models on the Science-T2I test set using implicit prompts. Each model generates two images per prompt, scored by Qwen3.5-27B using per-tuple grading criteria. We report the normalized Reality Score (0–100) across Physics, Chemistry, and Biology.

#	Model	Overall	Physics	Chemistry	Biology
1	SDXL	19.60	16.11	20.92	25.56
2	FLUX.1[schnell]	23.72	23.19	38.30	13.33
3	LongCat-Image	23.86	21.94	39.01	15.83
4	GLM-Image	24.23	22.22	41.84	14.44
5	SANA	26.14	23.89	46.45	14.72
6	Lumina-Image-2	26.51	20.28	47.87	22.22
7	SD 3.5 Medium	26.73	23.06	39.01	24.44
8	Z-Image	26.73	26.53	32.98	22.22
9	FLUX.1[dev]	26.87	22.64	50.00	17.22
10	SD 3.5 Large	28.71	27.64	44.68	18.33
11	Bagel	29.00	25.28	45.75	23.33
12	FLUX.2[klein-4B-Base]	29.08	27.22	43.97	21.11
13	Z-Image-Turbo	29.22	26.81	36.53	28.33
14	FLUX.2[klein-4B]	30.84	28.75	50.36	19.72
15	FLUX.2[klein-9B-Base]	31.57	30.28	45.75	23.06
16	FLUX.2[klein-9B]	32.60	29.44	57.45	19.44
17	Qwen-Image	33.99	34.58	46.45	23.06
18	FLUX.2[dev]	47.80	53.19	53.55	32.50

Current models lack scientific reasoning. Even the best model (FLUX.2[dev]) scores only 47.80 out of 100, and the majority cluster between 20 and 35. Biology poses the greatest challenge: no model exceeds 33%. When evaluated with explicit prompts that directly describe the intended scientific outcome, scores increase by roughly 35 points on average, confirming that the bottleneck is not in visual rendering but in scientific reasoning.

SciScore: A Reward Model for Scientific Image Assessment

CLIP aligns textual and visual data effectively for general purposes, but struggles with implicit scientific prompts: it tends to embed implicit prompts closer to their superficial counterparts than their explicit counterparts. We introduce SciScore, a reward model fine-tuned on the Science-T2I training set that extends CLIP-H to assess whether a generated image reflects the scientific principles implied by a prompt.

**Tab 1:** (Left) SciScore surpasses all VLMs, LMMs, and human evaluators on both test splits. (Right) Nearly all failures concentrate in subject-oriented tasks.

SciScore surpasses human evaluators. SciScore achieves 93.14 on Science-T2I-S and 91.19 on Science-T2I-C, surpassing human evaluators (87.01 and 86.02) by roughly 6 and 5 points, respectively. The strong performance on Science-T2I-C is particularly notable, as this split embeds scientific reasoning tasks within diverse environmental contexts not present during training.

Subject-oriented tasks remain the primary challenge. Nearly all failure cases concentrate in subject-oriented tasks, which demand subject-specific knowledge (e.g., which metals produce which flame colors). Condition-oriented tasks, by contrast, involve generalizable visual patterns that transfer more easily across subjects.

Reality Alignment: Two-stage Fine-tuning Framework

We combine the Science-T2I training set with SciScore as a reward signal to close the reasoning gap. Our framework proceeds in two stages: SFT exposes the model to scientific visual phenomena it has never encountered during pretraining, and OFT then optimizes for the implicit reasoning ability measured by SciScore.

Supervised Fine-tuning. Pre-trained models have never been exposed to images depicting scientific phenomena, so no amount of preference optimization over existing outputs can teach them what these phenomena look like. We therefore begin with supervised fine-tuning on the Science-T2I training set using FLUX.1[dev] as our base model.

Masked Online Fine-tuning. SFT teaches the model what scientific images look like, but does not directly optimize for the ability to infer the correct visual outcome from an implicit prompt. We apply online fine-tuning with SciScore as the reward signal and a subject-based masking strategy that restricts gradients to the scientifically relevant region.

**Fig 5:** Online fine-tuning uses SciScore and subject-based masking to align generation with scientific principles. For each prompt, two images are generated and scored by SciScore to determine preference. GroundingDINO extracts subject masks from each image, restricting gradient propagation to the scientifically relevant regions.

Results and Ablation

Relative improvement metric. Raw SciScore values are useful for comparing models but do not reveal how close a fine-tuned model is to its own ceiling. We observe that SciScore under explicit prompts consistently surpasses that under implicit prompts, providing a natural upper bound. To quantify progress toward this ceiling, we define the Relative Improvement (RI) metric: \[ \begin{equation} RI = \frac{\text{SciScore}_F^{IP}-\text{SciScore}_B^{IP}}{\text{SciScore}_B^{EP}-\text{SciScore}_B^{IP}} \end{equation} \] An RI of 100% would indicate that fine-tuning has fully closed the gap between implicit and explicit prompts.

SFT and OFT together exceed the baseline by over 50%. The combined two-stage framework increases the score from 23.56 to 28.52 on Science-T2I-S (RI = 53.39%) and from 27.26 to 30.11 on Science-T2I-C (RI = 38.31%). SFT contributes the larger share of the improvement: it provides the foundational knowledge of what scientific images look like, while OFT refines the model's ability to activate that knowledge from implicit prompts.

**Tab 2:** The two-stage framework improves FLUX.1[dev] by over 50% in RI.

Generalization to Complex Scenes. While the training set primarily contains images in straightforward scenarios, the fine-tuned model shows clear improvement on Science-T2I-C. This indicates that the model has internalized underlying scientific principles rather than memorizing training examples.

**Fig 6:** SFT is essential before OFT, and masking stabilizes training. Without prior SFT (purple), OFT fails to improve SciScore. Without masking (yellow/red), performance becomes erratic or stalls.

OFT without SFT fails. Without the scientific knowledge base provided by SFT (purple curve), the model receives two scientifically incorrect images; the preference signal between two poor samples provides insufficient gradient information for meaningful learning.

Masking suppresses noise from irrelevant features. Without masking (yellow curve), the model collapses because it tries to match all visual features of the preferred image. Halving the learning rate (red) prevents collapse but fails to improve SciScore. The masked configuration (blue) produces stable and consistent improvement.

**Fig 7:** The two-stage framework corrects scientific errors while preserving visual quality. Upper row: base FLUX.1[dev]. Lower row: our fine-tuned model. Each pair uses an identical random seed. Displayed prompts are simplified summaries.

Science-T2I

Addressing Scientific Illusions in Image Synthesis