Introduction
Video understanding is central to real-world applications such as autonomous driving, assistive robotics, and long-form content analysis. Despite recent advances in VLMs, performance on video understanding has lagged behind text-only reasoning, especially for tasks involving long-context video.
In this work, we show that community progress in video understanding for VLMs is worse than initially thought: a majority of gains come from models’ ability to answer questions without access to the video. This phenomenon, known as linguistic shortcutting, is well-established as a serious problem in Vision Question Answering (VQA). We find that most apparent video gains come from being able to answer a larger portion of the benchmark without access to the video, making these benchmarks problematic for measuring genuine video understanding.
Prior work on VQA-shortcutting has long documented that language priors can dominate multimodal models, but the phenomenon has been less scrutinized in the video setting. Our study sits alongside a recent wave of RL-based video post-training efforts — Video-R1, LongVILA-R1, TW-GRPO, and Video-RTS — which all train on the same noisy corpora and which we compare against as baselines.
Key Contributions
- Identifying linguistic shortcutting — pervasive in both video benchmarks (40–60% text-only answerable) and post-training datasets.
- Visually Grounded filtering — a drop-in post-training data filter that keeps only questions requiring visual reasoning.
- Data efficiency — using 69.1% of the original post-training data yields consistent gains of up to 6.2 points.
- Outperforming complex methods — a simple GRPO + VG filter beats several more advanced post-training strategies.
Text-Only Answerability Analysis
To quantify linguistic shortcutting, we probe frontier LLMs with only the question text and answer options — no video input — across three popular video benchmarks. Accuracies far above random chance indicate a large fraction of questions are solvable without visual grounding at all.
We further categorize text-only answerable (TA) questions into four recurring failure modes: textual shortcuts (surface cues in the question wording), external knowledge (answerable from world priors), inferential elimination (implausible distractors), and imagined content (plausible hallucinated scenes). These four types account for nearly all TA cases we observed in Video-R1-260K and in the benchmarks.


| Model | VideoMME | VideoMMMU | MMVU |
|---|---|---|---|
| Random Choice | 25.0 | 9.8 | 19.8 |
| GPT-4o | 47.0 (+22.0) | 38.6 (+28.8) | 46.6 (+26.8) |
| GPT-5-mini | 45.2 (+20.2) | 37.9 (+28.1) | 53.3 (+33.5) |
| GPT-5 | 48.2 (+23.2) | 41.0 (+31.2) | 57.1 (+37.3) |
| Gemini-2.5-Pro | 53.3 (+28.3) | 52.7 (+42.9) | 60.6 (+40.8) |
| Gemini-3.1-Pro | 58.2 (+33.2) | 61.1 (+51.3) | 63.4 (+43.6) |
| Claude-Sonnet-4.5 | 47.7 (+22.7) | 44.3 (+34.5) | 55.4 (+35.6) |
| Claude-Opus-4.6 | 51.3 (+26.3) | 52.7 (+42.9) | 61.0 (+41.2) |
Method
We introduce a simple post-training recipe that combines reinforcement-learning-based post-training with a visually grounded (VG) data filter. Our RL backbone is GRPO with a token-level policy-gradient loss and asymmetric "clip-higher" clipping (the latter explored as an ablation in Table 4). RL-based post-training is known to improve underlying visual-recognition capabilities while exhibiting less catastrophic forgetting than supervised fine-tuning (SFT). Our contribution is orthogonal: we simply drop questions that are answerable from text alone, leaving only those that genuinely require visual reasoning.
Starting from Video-R1-260K, we partition the corpus into three variants based on text answerability:
| Variant | Samples | TA Ratio | Description |
|---|---|---|---|
| Full | 263,071 | 30.9% | Common post-training practice without filtering |
| TA | 81,361 | 100% | Only text-only answerable questions |
| VG (VidGround) | 181,710 | 0% | Only visually grounded questions |
Experiments
We evaluate on three video understanding benchmarks — VideoMME, VideoMMMU, and MMVU — at 16, 32, and 64 frames per video. All methods (except LongVILA-R1) are post-trained from Qwen2.5-VL-7B. Averages are reported both on the full benchmarks (Full) and on the VG subsets (VG) containing only visually grounded questions.
Main Results
| Frames | Method | VideoMME | VideoMMMU | MMVU | Avg. (Full) | Avg. (VG) | ||
|---|---|---|---|---|---|---|---|---|
| 16 | Qwen2.5-VL-7B | 58.2 | 45.0 | 60.5 | 54.6 | 42.9 | ||
| TW-GRPO | 58.2 | 48.6 | 61.8 | 56.2 | (+1.6) | 44.1 | (+1.2) | |
| LongVILA-R1-7B | 55.5 | 38.8 | 59.1 | 51.1 | (−3.5) | 39.7 | (−3.2) | |
| Video-RTS | 58.7 | 47.1 | 61.8 | 55.9 | (+1.3) | 43.5 | (+0.6) | |
| Qwen2.5-VL-7B-SFT | 58.2 | 43.1 | 51.3 | 50.9 | (−3.7) | 41.1 | (−1.8) | |
| Video-R1 | 56.9 | 44.7 | 54.5 | 52.0 | (−2.6) | 41.7 | (−1.2) | |
| VidGround | 58.7 | 47.4 | 64.2 | 56.8 | (+2.2) | 45.2 | (+2.3) | |
| 32 | Qwen2.5-VL-7B | 60.7 | 45.4 | 62.3 | 56.1 | 44.4 | ||
| TW-GRPO | 61.2 | 47.9 | 63.1 | 57.4 | (+1.3) | 45.9 | (+1.5) | |
| LongVILA-R1-7B | 60.2 | 40.7 | 61.5 | 54.1 | (−2.0) | 42.9 | (−1.5) | |
| Video-RTS | 61.3 | 47.7 | 65.0 | 58.0 | (+1.9) | 46.3 | (+1.9) | |
| Qwen2.5-VL-7B-SFT | 60.7 | 47.8 | 51.0 | 53.2 | (−2.9) | 44.6 | (+0.2) | |
| Video-R1 | 60.2 | 45.4 | 56.2 | 53.9 | (−2.2) | 43.1 | (−1.3) | |
| VidGround | 61.5 | 48.3 | 65.8 | 58.5 | (+2.4) | 47.6 | (+3.2) | |
| 64 | Qwen2.5-VL-7B | 62.3 | 46.6 | 62.6 | 57.2 | 46.3 | ||
| TW-GRPO | 62.7 | 48.3 | 64.2 | 58.4 | (+1.2) | 48.2 | (+1.9) | |
| LongVILA-R1-7B | 61.6 | 41.2 | 58.8 | 53.9 | (−3.3) | 42.0 | (−4.3) | |
| Video-RTS | 62.9 | 46.4 | 63.9 | 57.7 | (+0.5) | 46.4 | (+0.1) | |
| Qwen2.5-VL-7B-SFT | 62.2 | 47.6 | 55.4 | 55.1 | (−2.1) | 45.8 | (−0.5) | |
| Video-R1 | 61.2 | 45.4 | 53.2 | 53.3 | (−3.9) | 42.9 | (−3.4) | |
| VidGround | 63.4 | 49.4 | 65.6 | 59.5 | (+2.3) | 47.9 | (+1.6) | |
Ablation
| Frames | Method | Data | VideoMME | VideoMMMU | MMVU | Avg. (Full) | Avg. (VG) | ||
|---|---|---|---|---|---|---|---|---|---|
| 16 | Base | — | 58.2 | 45.0 | 60.5 | 54.6 | 42.9 | ||
| GRPO | Full | 56.9 | 44.7 | 54.5 | 52.0 | (−2.6) | 41.7 | (−1.2) | |
| GRPO | VG | 58.7 | 47.4 | 64.2 | 56.8 | (+2.2) | 45.2 | (+2.3) | |
| +clip-higher | VG | 58.2 | 47.7 | 63.6 | 56.5 | (+1.9) | 45.1 | (+2.2) | |
| 32 | Base | — | 60.7 | 45.4 | 62.3 | 56.1 | 44.4 | ||
| GRPO | Full | 60.2 | 45.4 | 56.2 | 53.9 | (−2.2) | 43.1 | (−1.3) | |
| GRPO | VG | 61.5 | 48.3 | 65.8 | 58.5 | (+2.4) | 47.6 | (+3.2) | |
| +clip-higher | VG | 61.4 | 49.2 | 64.2 | 58.3 | (+2.2) | 46.8 | (+2.4) | |
| 64 | Base | — | 62.3 | 46.6 | 62.6 | 57.2 | 46.3 | ||
| GRPO | Full | 61.2 | 45.4 | 53.2 | 53.3 | (−3.9) | 42.9 | (−3.4) | |
| GRPO | VG | 63.4 | 49.4 | 65.6 | 59.5 | (+2.3) | 47.9 | (+1.6) | |
| +clip-higher | VG | 63.5 | 48.6 | 65.3 | 59.1 | (+1.9) | 48.5 | (+2.2) | |
| Frames | Method | VideoMME | VideoMMMU | MMVU | Avg. (Full) | Avg. (VG) | ||
|---|---|---|---|---|---|---|---|---|
| 16 | Qwen2.5-VL-7B | 58.2 | 45.0 | 60.5 | 54.6 | 42.9 | ||
| Video-R1 (Full, 263K) | 56.9 | 44.7 | 54.5 | 52.0 | (−2.6) | 41.7 | (−1.2) | |
| VidGround (VG, 181K) | 58.7 | 47.4 | 64.2 | 56.8 | (+2.2) | 45.2 | (+2.3) | |
| VidGround-M1 (161K) | 58.5 | 48.0 | 64.4 | 57.0 | (+2.4) | 45.9 | (+3.0) | |
| VidGround-M2 (148K) | 57.7 | 47.0 | 62.5 | 55.7 | (+1.1) | 43.8 | (+0.9) | |
| 32 | Qwen2.5-VL-7B | 60.7 | 45.4 | 62.3 | 56.1 | 44.4 | ||
| Video-R1 (Full, 263K) | 60.2 | 45.4 | 56.2 | 53.9 | (−2.2) | 43.1 | (−1.3) | |
| VidGround (VG, 181K) | 61.5 | 48.3 | 65.8 | 58.5 | (+2.4) | 47.6 | (+3.2) | |
| VidGround-M1 (161K) | 62.1 | 50.9 | 63.7 | 58.9 | (+2.8) | 47.5 | (+3.1) | |
| VidGround-M2 (148K) | 61.4 | 50.4 | 62.8 | 58.2 | (+2.1) | 46.6 | (+2.2) | |
Qualitative Results



Conclusion
We identify the pervasive presence of linguistic shortcutting in both video understanding benchmarks and post-training datasets — some of the most popular benchmarks contain 40–60% of questions answerable from text alone. Filtering those questions out before post-training yields an exceedingly simple yet effective recipe that outperforms seven more complex post-training approaches and provides notable training-data efficiency.
Our findings highlight the importance of curating post-training data and evaluation benchmarks that truly require visual grounding, offering a simple yet powerful direction for building more robust and visually grounded VLMs.