Watch Before You Answer — Learning from Visually Grounded Post-Training

Abstract

It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long-video understanding benchmarks contain 40–60% of questions that can be answered using text cues alone. These issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding.

Guided by this observation, we introduce a simple yet effective solution: using only the actual visually grounded (VG) questions for post-training. Combined with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original data. Data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs.

40–60%TA questions in benchmarks
+6.2points over Full data
69.1%of original data used
3benchmarks, 16/32/64 frames

Introduction

Video understanding is central to real-world applications such as autonomous driving, assistive robotics, and long-form content analysis. Despite recent advances in VLMs, performance on video understanding has lagged behind text-only reasoning, especially for tasks involving long-context video.

Visual-gain breakdown for Qwen-2.5-VL variants — Even for leading VLMs (Qwen-2.5-VL), most performance on video benchmarks comes from language priors (pink) rather than visual comprehension. Gains as model size increases are driven almost entirely by text-only reasoning improvements, with visual grounding sometimes worsening in the larger variant.

In this work, we show that community progress in video understanding for VLMs is worse than initially thought: a majority of gains come from models’ ability to answer questions without access to the video. This phenomenon, known as linguistic shortcutting, is well-established as a serious problem in Vision Question Answering (VQA). We find that most apparent video gains come from being able to answer a larger portion of the benchmark without access to the video, making these benchmarks problematic for measuring genuine video understanding.

Prior work on VQA-shortcutting has long documented that language priors can dominate multimodal models, but the phenomenon has been less scrutinized in the video setting. Our study sits alongside a recent wave of RL-based video post-training efforts — Video-R1, LongVILA-R1, TW-GRPO, and Video-RTS — which all train on the same noisy corpora and which we compare against as baselines.

Key Contributions

Identifying linguistic shortcutting — pervasive in both video benchmarks (40–60% text-only answerable) and post-training datasets.
Visually Grounded filtering — a drop-in post-training data filter that keeps only questions requiring visual reasoning.
Data efficiency — using 69.1% of the original post-training data yields consistent gains of up to 6.2 points.
Outperforming complex methods — a simple GRPO + VG filter beats several more advanced post-training strategies.

Text-Only Answerability Analysis

To quantify linguistic shortcutting, we probe frontier LLMs with only the question text and answer options — no video input — across three popular video benchmarks. Accuracies far above random chance indicate a large fraction of questions are solvable without visual grounding at all.

We further categorize text-only answerable (TA) questions into four recurring failure modes: textual shortcuts (surface cues in the question wording), external knowledge (answerable from world priors), inferential elimination (implausible distractors), and imagined content (plausible hallucinated scenes). These four types account for nearly all TA cases we observed in Video-R1-260K and in the benchmarks.

Nested pie chart: TA composition across benchmarks — Left: breakdown of text-only answerable (TA) vs. visually grounded (VG) questions in Video-R1-260K. Right: representative TA examples from the training set where the answer can be inferred from question text alone.

Representative text-only-answerable (TA) cases in training data — Left: breakdown of text-only answerable (TA) vs. visually grounded (VG) questions in Video-R1-260K. Right: representative TA examples from the training set where the answer can be inferred from question text alone.

Representative TA cases from VideoMME — Representative TA examples from **VideoMME**, confirming that the same four failure modes are present in evaluation data — not only in post-training corpora.

Model	VideoMME	VideoMMMU	MMVU
Random Choice	25.0	9.8	19.8
GPT-4o	47.0 (+22.0)	38.6 (+28.8)	46.6 (+26.8)
GPT-5-mini	45.2 (+20.2)	37.9 (+28.1)	53.3 (+33.5)
GPT-5	48.2 (+23.2)	41.0 (+31.2)	57.1 (+37.3)
Gemini-2.5-Pro	53.3 (+28.3)	52.7 (+42.9)	60.6 (+40.8)
Gemini-3.1-Pro	58.2 (+33.2)	61.1 (+51.3)	63.4 (+43.6)
Claude-Sonnet-4.5	47.7 (+22.7)	44.3 (+34.5)	55.4 (+35.6)
Claude-Opus-4.6	51.3 (+26.3)	52.7 (+42.9)	61.0 (+41.2)

Table 1: Text-only answerability for frontier models across three video benchmarks. Numbers in green indicate improvement over random choice. 40–60% of questions are answerable from text alone.

Method

We introduce a simple post-training recipe that combines reinforcement-learning-based post-training with a visually grounded (VG) data filter. Our RL backbone is GRPO with a token-level policy-gradient loss and asymmetric "clip-higher" clipping (the latter explored as an ablation in Table 4). RL-based post-training is known to improve underlying visual-recognition capabilities while exhibiting less catastrophic forgetting than supervised fine-tuning (SFT). Our contribution is orthogonal: we simply drop questions that are answerable from text alone, leaving only those that genuinely require visual reasoning.

Starting from Video-R1-260K, we partition the corpus into three variants based on text answerability:

Variant	Samples	TA Ratio	Description
Full	263,071	30.9%	Common post-training practice without filtering
TA	81,361	100%	Only text-only answerable questions
VG (VidGround)	181,710	0%	Only visually grounded questions

Table 2: Post-training data variants. Our VG filter keeps only questions that require visual reasoning to answer.

Experiments

We evaluate on three video understanding benchmarks — VideoMME, VideoMMMU, and MMVU — at 16, 32, and 64 frames per video. All methods (except LongVILA-R1) are post-trained from Qwen2.5-VL-7B. Averages are reported both on the full benchmarks (Full) and on the VG subsets (VG) containing only visually grounded questions.

Main Results

Frames	Method	VideoMME	VideoMMMU	MMVU	Avg. (Full)		Avg. (VG)
16	Qwen2.5-VL-7B	58.2	45.0	60.5	54.6		42.9
	TW-GRPO	58.2	48.6	61.8	56.2	(+1.6)	44.1	(+1.2)
	LongVILA-R1-7B	55.5	38.8	59.1	51.1	(−3.5)	39.7	(−3.2)
	Video-RTS	58.7	47.1	61.8	55.9	(+1.3)	43.5	(+0.6)
	Qwen2.5-VL-7B-SFT	58.2	43.1	51.3	50.9	(−3.7)	41.1	(−1.8)
	Video-R1	56.9	44.7	54.5	52.0	(−2.6)	41.7	(−1.2)
	VidGround	58.7	47.4	64.2	56.8	(+2.2)	45.2	(+2.3)
32	Qwen2.5-VL-7B	60.7	45.4	62.3	56.1		44.4
	TW-GRPO	61.2	47.9	63.1	57.4	(+1.3)	45.9	(+1.5)
	LongVILA-R1-7B	60.2	40.7	61.5	54.1	(−2.0)	42.9	(−1.5)
	Video-RTS	61.3	47.7	65.0	58.0	(+1.9)	46.3	(+1.9)
	Qwen2.5-VL-7B-SFT	60.7	47.8	51.0	53.2	(−2.9)	44.6	(+0.2)
	Video-R1	60.2	45.4	56.2	53.9	(−2.2)	43.1	(−1.3)
	VidGround	61.5	48.3	65.8	58.5	(+2.4)	47.6	(+3.2)
64	Qwen2.5-VL-7B	62.3	46.6	62.6	57.2		46.3
	TW-GRPO	62.7	48.3	64.2	58.4	(+1.2)	48.2	(+1.9)
	LongVILA-R1-7B	61.6	41.2	58.8	53.9	(−3.3)	42.0	(−4.3)
	Video-RTS	62.9	46.4	63.9	57.7	(+0.5)	46.4	(+0.1)
	Qwen2.5-VL-7B-SFT	62.2	47.6	55.4	55.1	(−2.1)	45.8	(−0.5)
	Video-R1	61.2	45.4	53.2	53.3	(−3.9)	42.9	(−3.4)
	VidGround	63.4	49.4	65.6	59.5	(+2.3)	47.9	(+1.6)

Table 3: 7B-scale post-training methods on three video benchmarks at 16/32/64 frames. Our VG-filter + GRPO recipe consistently achieves the best Full and VG averages across frame counts. Highlighted rows are ours.

Frame-count scaling: VG-trained models scale consistently with frames — Models trained on VG data scale steadily with frame count, while models trained on the Full variant show inconsistent gains — evidence that language shortcutting interferes with temporal grounding.

Ablation

Frames	Method	Data	VideoMME	VideoMMMU	MMVU	Avg. (Full)		Avg. (VG)
16	Base	—	58.2	45.0	60.5	54.6		42.9
	GRPO	Full	56.9	44.7	54.5	52.0	(−2.6)	41.7	(−1.2)
	GRPO	VG	58.7	47.4	64.2	56.8	(+2.2)	45.2	(+2.3)
	+clip-higher	VG	58.2	47.7	63.6	56.5	(+1.9)	45.1	(+2.2)
32	Base	—	60.7	45.4	62.3	56.1		44.4
	GRPO	Full	60.2	45.4	56.2	53.9	(−2.2)	43.1	(−1.3)
	GRPO	VG	61.5	48.3	65.8	58.5	(+2.4)	47.6	(+3.2)
	+clip-higher	VG	61.4	49.2	64.2	58.3	(+2.2)	46.8	(+2.4)
64	Base	—	62.3	46.6	62.6	57.2		46.3
	GRPO	Full	61.2	45.4	53.2	53.3	(−3.9)	42.9	(−3.4)
	GRPO	VG	63.4	49.4	65.6	59.5	(+2.3)	47.9	(+1.6)
	+clip-higher	VG	63.5	48.6	65.3	59.1	(+1.9)	48.5	(+2.2)

Table 4: Ablation on post-training data composition. GRPO on the VG subset consistently outperforms training on the Full variant despite using 31% less data.

Frames	Method	VideoMME	VideoMMMU	MMVU	Avg. (Full)		Avg. (VG)
16	Qwen2.5-VL-7B	58.2	45.0	60.5	54.6		42.9
	Video-R1 (Full, 263K)	56.9	44.7	54.5	52.0	(−2.6)	41.7	(−1.2)
	VidGround (VG, 181K)	58.7	47.4	64.2	56.8	(+2.2)	45.2	(+2.3)
	VidGround-M1 (161K)	58.5	48.0	64.4	57.0	(+2.4)	45.9	(+3.0)
	VidGround-M2 (148K)	57.7	47.0	62.5	55.7	(+1.1)	43.8	(+0.9)
32	Qwen2.5-VL-7B	60.7	45.4	62.3	56.1		44.4
	Video-R1 (Full, 263K)	60.2	45.4	56.2	53.9	(−2.2)	43.1	(−1.3)
	VidGround (VG, 181K)	61.5	48.3	65.8	58.5	(+2.4)	47.6	(+3.2)
	VidGround-M1 (161K)	62.1	50.9	63.7	58.9	(+2.8)	47.5	(+3.1)
	VidGround-M2 (148K)	61.4	50.4	62.8	58.2	(+2.1)	46.6	(+2.2)

Table 5: Filter-variant robustness. Single-model curation (VidGround) vs. progressively stricter multi-model consensus curation (M1: ≥2 models agree; M2: stricter ≥2-model agreement). All variants improve over the Full baseline across benchmarks; we report 16- and 32-frame settings here; stricter M1/M2 consensus gives small further gains on average.

Qualitative Results

Qualitative example: VG-trained model references frame-level observations — Our VG-trained model demonstrates stronger visual grounding by explicitly referencing frame-level observations, while baselines rely on general knowledge or linguistic patterns.

Qualitative example 1 — Additional qualitative example: the VG-trained model grounds its answer in concrete visual evidence observed across frames, where baselines fall back on language priors.

Qualitative example 3 — Additional qualitative example: even on a question that looks linguistically "easy", the VG-trained model still appeals to the video, avoiding the shortcut other baselines take.

Conclusion

We identify the pervasive presence of linguistic shortcutting in both video understanding benchmarks and post-training datasets — some of the most popular benchmarks contain 40–60% of questions answerable from text alone. Filtering those questions out before post-training yields an exceedingly simple yet effective recipe that outperforms seven more complex post-training approaches and provides notable training-data efficiency.

Our findings highlight the importance of curating post-training data and evaluation benchmarks that truly require visual grounding, offering a simple yet powerful direction for building more robust and visually grounded VLMs.