Watch Before You Answer

Learning from Visually Grounded Post-Training

Yuxuan Zhang1,2,3 EunJeong Hwang1,2 Huaisong Zhang4 Penghui Du4 Yiming Jia5 Dongfu Jiang2,6 Xuan He7 Shenhui Zhang4 Ping Nie6 Peter West1 Kelsey R. Allen1,2
1University of British Columbia 2Vector Institute 3Etude AI 4Kolors Team, Kuaishou Technology 5University of Toronto 6University of Waterloo 7University of Illinois at Urbana-Champaign
Teaser: visual-grounding gains

Abstract

It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long-video understanding benchmarks contain 40–60% of questions that can be answered using text cues alone. These issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding.

Guided by this observation, we introduce a simple yet effective solution: using only the actual visually grounded (VG) questions for post-training. Combined with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original data. Data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs.

  • 40–60%TA questions in benchmarks
  • +6.2points over Full data
  • 69.1%of original data used
  • 3benchmarks, 16/32/64 frames

Introduction

Video understanding is central to real-world applications such as autonomous driving, assistive robotics, and long-form content analysis. Despite recent advances in VLMs, performance on video understanding has lagged behind text-only reasoning, especially for tasks involving long-context video.

Visual-gain breakdown for Qwen-2.5-VL variants
Even for leading VLMs (Qwen-2.5-VL), most performance on video benchmarks comes from language priors (pink) rather than visual comprehension. Gains as model size increases are driven almost entirely by text-only reasoning improvements, with visual grounding sometimes worsening in the larger variant.

In this work, we show that community progress in video understanding for VLMs is worse than initially thought: a majority of gains come from models’ ability to answer questions without access to the video. This phenomenon, known as linguistic shortcutting, is well-established as a serious problem in Vision Question Answering (VQA). We find that most apparent video gains come from being able to answer a larger portion of the benchmark without access to the video, making these benchmarks problematic for measuring genuine video understanding.

Prior work on VQA-shortcutting has long documented that language priors can dominate multimodal models, but the phenomenon has been less scrutinized in the video setting. Our study sits alongside a recent wave of RL-based video post-training efforts — Video-R1, LongVILA-R1, TW-GRPO, and Video-RTS — which all train on the same noisy corpora and which we compare against as baselines.

Key Contributions

  • Identifying linguistic shortcutting — pervasive in both video benchmarks (40–60% text-only answerable) and post-training datasets.
  • Visually Grounded filtering — a drop-in post-training data filter that keeps only questions requiring visual reasoning.
  • Data efficiency — using 69.1% of the original post-training data yields consistent gains of up to 6.2 points.
  • Outperforming complex methods — a simple GRPO + VG filter beats several more advanced post-training strategies.

Text-Only Answerability Analysis

To quantify linguistic shortcutting, we probe frontier LLMs with only the question text and answer options — no video input — across three popular video benchmarks. Accuracies far above random chance indicate a large fraction of questions are solvable without visual grounding at all.

We further categorize text-only answerable (TA) questions into four recurring failure modes: textual shortcuts (surface cues in the question wording), external knowledge (answerable from world priors), inferential elimination (implausible distractors), and imagined content (plausible hallucinated scenes). These four types account for nearly all TA cases we observed in Video-R1-260K and in the benchmarks.

Nested pie chart: TA composition across benchmarks Representative text-only-answerable (TA) cases in training data
Left: breakdown of text-only answerable (TA) vs. visually grounded (VG) questions in Video-R1-260K. Right: representative TA examples from the training set where the answer can be inferred from question text alone.
Representative TA cases from VideoMME
Representative TA examples from VideoMME, confirming that the same four failure modes are present in evaluation data — not only in post-training corpora.
Model VideoMME VideoMMMU MMVU
Random Choice25.09.819.8
GPT-4o47.0 (+22.0)38.6 (+28.8)46.6 (+26.8)
GPT-5-mini45.2 (+20.2)37.9 (+28.1)53.3 (+33.5)
GPT-548.2 (+23.2)41.0 (+31.2)57.1 (+37.3)
Gemini-2.5-Pro53.3 (+28.3)52.7 (+42.9)60.6 (+40.8)
Gemini-3.1-Pro58.2 (+33.2)61.1 (+51.3)63.4 (+43.6)
Claude-Sonnet-4.547.7 (+22.7)44.3 (+34.5)55.4 (+35.6)
Claude-Opus-4.651.3 (+26.3)52.7 (+42.9)61.0 (+41.2)
Table 1: Text-only answerability for frontier models across three video benchmarks. Numbers in green indicate improvement over random choice. 40–60% of questions are answerable from text alone.

Method

We introduce a simple post-training recipe that combines reinforcement-learning-based post-training with a visually grounded (VG) data filter. Our RL backbone is GRPO with a token-level policy-gradient loss and asymmetric "clip-higher" clipping (the latter explored as an ablation in Table 4). RL-based post-training is known to improve underlying visual-recognition capabilities while exhibiting less catastrophic forgetting than supervised fine-tuning (SFT). Our contribution is orthogonal: we simply drop questions that are answerable from text alone, leaving only those that genuinely require visual reasoning.

Starting from Video-R1-260K, we partition the corpus into three variants based on text answerability:

VariantSamplesTA RatioDescription
Full263,07130.9%Common post-training practice without filtering
TA81,361100%Only text-only answerable questions
VG (VidGround)181,7100%Only visually grounded questions
Table 2: Post-training data variants. Our VG filter keeps only questions that require visual reasoning to answer.

Experiments

We evaluate on three video understanding benchmarks — VideoMME, VideoMMMU, and MMVU — at 16, 32, and 64 frames per video. All methods (except LongVILA-R1) are post-trained from Qwen2.5-VL-7B. Averages are reported both on the full benchmarks (Full) and on the VG subsets (VG) containing only visually grounded questions.

Main Results

Frames Method VideoMME VideoMMMU MMVU Avg. (Full) Avg. (VG)
16Qwen2.5-VL-7B58.245.060.554.642.9
TW-GRPO58.248.661.856.2(+1.6)44.1(+1.2)
LongVILA-R1-7B55.538.859.151.1(−3.5)39.7(−3.2)
Video-RTS58.747.161.855.9(+1.3)43.5(+0.6)
Qwen2.5-VL-7B-SFT58.243.151.350.9(−3.7)41.1(−1.8)
Video-R156.944.754.552.0(−2.6)41.7(−1.2)
VidGround58.747.464.256.8(+2.2)45.2(+2.3)
32Qwen2.5-VL-7B60.745.462.356.144.4
TW-GRPO61.247.963.157.4(+1.3)45.9(+1.5)
LongVILA-R1-7B60.240.761.554.1(−2.0)42.9(−1.5)
Video-RTS61.347.765.058.0(+1.9)46.3(+1.9)
Qwen2.5-VL-7B-SFT60.747.851.053.2(−2.9)44.6(+0.2)
Video-R160.245.456.253.9(−2.2)43.1(−1.3)
VidGround61.548.365.858.5(+2.4)47.6(+3.2)
64Qwen2.5-VL-7B62.346.662.657.246.3
TW-GRPO62.748.364.258.4(+1.2)48.2(+1.9)
LongVILA-R1-7B61.641.258.853.9(−3.3)42.0(−4.3)
Video-RTS62.946.463.957.7(+0.5)46.4(+0.1)
Qwen2.5-VL-7B-SFT62.247.655.455.1(−2.1)45.8(−0.5)
Video-R161.245.453.253.3(−3.9)42.9(−3.4)
VidGround63.449.465.659.5(+2.3)47.9(+1.6)
Table 3: 7B-scale post-training methods on three video benchmarks at 16/32/64 frames. Our VG-filter + GRPO recipe consistently achieves the best Full and VG averages across frame counts. Highlighted rows are ours.
Frame-count scaling: VG-trained models scale consistently with frames
Models trained on VG data scale steadily with frame count, while models trained on the Full variant show inconsistent gains — evidence that language shortcutting interferes with temporal grounding.

Ablation

FramesMethodDataVideoMMEVideoMMMUMMVUAvg. (Full)Avg. (VG)
16Base58.245.060.554.642.9
GRPOFull56.944.754.552.0(−2.6)41.7(−1.2)
GRPOVG58.747.464.256.8(+2.2)45.2(+2.3)
+clip-higherVG58.247.763.656.5(+1.9)45.1(+2.2)
32Base60.745.462.356.144.4
GRPOFull60.245.456.253.9(−2.2)43.1(−1.3)
GRPOVG61.548.365.858.5(+2.4)47.6(+3.2)
+clip-higherVG61.449.264.258.3(+2.2)46.8(+2.4)
64Base62.346.662.657.246.3
GRPOFull61.245.453.253.3(−3.9)42.9(−3.4)
GRPOVG63.449.465.659.5(+2.3)47.9(+1.6)
+clip-higherVG63.548.665.359.1(+1.9)48.5(+2.2)
Table 4: Ablation on post-training data composition. GRPO on the VG subset consistently outperforms training on the Full variant despite using 31% less data.
FramesMethod VideoMMEVideoMMMUMMVU Avg. (Full)Avg. (VG)
16Qwen2.5-VL-7B58.245.060.554.642.9
Video-R1 (Full, 263K)56.944.754.552.0(−2.6)41.7(−1.2)
VidGround (VG, 181K)58.747.464.256.8(+2.2)45.2(+2.3)
VidGround-M1 (161K)58.548.064.457.0(+2.4)45.9(+3.0)
VidGround-M2 (148K)57.747.062.555.7(+1.1)43.8(+0.9)
32Qwen2.5-VL-7B60.745.462.356.144.4
Video-R1 (Full, 263K)60.245.456.253.9(−2.2)43.1(−1.3)
VidGround (VG, 181K)61.548.365.858.5(+2.4)47.6(+3.2)
VidGround-M1 (161K)62.150.963.758.9(+2.8)47.5(+3.1)
VidGround-M2 (148K)61.450.462.858.2(+2.1)46.6(+2.2)
Table 5: Filter-variant robustness. Single-model curation (VidGround) vs. progressively stricter multi-model consensus curation (M1: ≥2 models agree; M2: stricter ≥2-model agreement). All variants improve over the Full baseline across benchmarks; we report 16- and 32-frame settings here; stricter M1/M2 consensus gives small further gains on average.

Qualitative Results

Qualitative example: VG-trained model references frame-level observations
Our VG-trained model demonstrates stronger visual grounding by explicitly referencing frame-level observations, while baselines rely on general knowledge or linguistic patterns.
Qualitative example 1
Additional qualitative example: the VG-trained model grounds its answer in concrete visual evidence observed across frames, where baselines fall back on language priors.
Qualitative example 3
Additional qualitative example: even on a question that looks linguistically "easy", the VG-trained model still appeals to the video, avoiding the shortcut other baselines take.

Conclusion

We identify the pervasive presence of linguistic shortcutting in both video understanding benchmarks and post-training datasets — some of the most popular benchmarks contain 40–60% of questions answerable from text alone. Filtering those questions out before post-training yields an exceedingly simple yet effective recipe that outperforms seven more complex post-training approaches and provides notable training-data efficiency.

Our findings highlight the importance of curating post-training data and evaluation benchmarks that truly require visual grounding, offering a simple yet powerful direction for building more robust and visually grounded VLMs.

Citation

Cite as: Yuxuan Zhang, EunJeong Hwang, Huaisong Zhang, Penghui Du, Yiming Jia, Dongfu Jiang, Xuan He, Shenhui Zhang, Ping Nie, Peter West, Kelsey R. Allen. Watch Before You Answer: Learning from Visually Grounded Post-Training. arXiv preprint arXiv:2604.05117, 2026.  arxiv.org/abs/2604.05117

@article{zhang2026watch,
  title   = {Watch Before You Answer: Learning from Visually Grounded Post-Training},
  author  = {Zhang, Yuxuan and Hwang, EunJeong and Zhang, Huaisong and
             Du, Penghui and Jia, Yiming and Jiang, Dongfu and He, Xuan and
             Zhang, Shenhui and Nie, Ping and West, Peter and Allen, Kelsey R.},
  journal = {arXiv preprint arXiv:2604.05117},
  year    = {2026},
  url     = {https://arxiv.org/abs/2604.05117}
}