VidGround

Watch Before You Answer: Learning from Visually Grounded Post-Training

TL;DR

Many "video" benchmarks and post-training datasets can be solved without watching the video. We show that 40–60% of questions in commonly used long-video benchmarks are answerable from text alone, and that the same bias pervades widely used post-training datasets. Filtering down to visually grounded questions only — what we call VidGround — improves RL post-training by up to +6.2 points while using only 69.1% of the original data.

Key findings

📉 40–60% of questions in popular long-video benchmarks are answerable from text alone.
🧹 Filtering for visually grounded questions yields a smaller (69.1% of original) but cleaner training set.
🚀 Combined with vanilla RL post-training, VidGround improves accuracy by up to +6.2 points.
🏆 Simple data curation beats several more sophisticated post-training techniques.

Repository structure

vidground/
├── README.md
├── LICENSE
├── pyproject.toml
├── citation.bib
├── data/                          # dataset preparation instructions
│   └── README.md
├── eval/                          # evaluation protocol notes
│   └── README.md
├── scripts/
│   ├── filter_visually_grounded.py
│   └── run_rl_posttrain.sh
└── src/vidground/
    ├── __init__.py
    ├── filter.py                  # text-only-solvability filtering
    └── eval.py                    # eval helpers

Installation

git clone https://github.com/TODO/vidground.git
cd vidground
pip install -e .

Quick start

1. Prepare data

See data/README.md for dataset download and formatting instructions.

2. Filter for visually grounded questions

python scripts/filter_visually_grounded.py \
    --input  data/raw/your_dataset.jsonl \
    --output data/filtered/your_dataset.vidground.jsonl

3. RL post-training

bash scripts/run_rl_posttrain.sh

4. Evaluation

See eval/README.md.

Results

Model	Data	VideoMME	LongVideoBench	Notes
Baseline (full data)	100%	—	—	TODO
+ VidGround filter	69.1%	+6.2	TODO	TODO

Citation

If you find this work useful, please cite:

@article{zhang2025vidground,
  title   = {Watch Before You Answer: Learning from Visually Grounded Post-Training},
  author  = {Zhang, Yuxuan and Hwang, EunJeong and Zhang, Huaisong and Du, Penghui
             and Jia, Yiming and Jiang, Dongfu and He, Xuan and Zhang, Shenhui
             and Nie, Ping and West, Peter and Allen, Kelsey R.},
  journal = {arXiv preprint arXiv:2604.05117},
  year    = {2026}
}

Acknowledgements

We thank our collaborators at UBC, Vector Institute, Etude AI, Kuaishou (Kolors Team), University of Toronto, University of Waterloo, and UIUC.

License

Released under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VidGround

TL;DR

Key findings

Repository structure

Installation

Quick start

1. Prepare data

2. Filter for visually grounded questions

3. RL post-training

4. Evaluation

Results

Citation

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
eval		eval
scripts		scripts
src/vidground		src/vidground
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
citation.bib		citation.bib
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

VidGround

TL;DR

Key findings

Repository structure

Installation

Quick start

1. Prepare data

2. Filter for visually grounded questions

3. RL post-training

4. Evaluation

Results

Citation

Acknowledgements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages