SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models. [NeurIPS 2025]

🔥 Code for the SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models.

🚀 Updates

[2025/12/12] We are in the process of preparing the data. Please wait a moment.
[2025/9/21] SAMA is accepted to NeurIPS 2025🔥! See you in San Diego!😉

Citation

If you find SAMA useful for your work, please kindly cite using the BibTeX 🙏🙏🙏:

@inproceedings{sun2025sama,
  title={SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models},
  author={Sun, Ye and Zhang, Hao and Ding, Henghui and Zhang, Tiehua and Ma, Xingjun and Jiang, Yu-Gang},
  booktitle={NeurIPS},
  year={2025}
}

Installation

Please install the python and pytorch first:

> conda create -n vlm python=3.10
> conda activate vlm
> conda install pytorch==2.3.1 torchvision==0.18.1 pytorch-cuda=12.1 cuda -c pytorch  -c "nvidia/label/cuda-12.1.0" -c "nvidia/label/cuda-12.1.1"

Install mmcv:

> pip install mmcv==2.2.0 -f https://download.openmmlab.com/mmcv/dist/cu121/torch2.3/index.html

Install other dependencies:

> pip install -r requirements.txt

Model Weights

Training Data preparation

Data Preparation

Please first download the Sa2VA training datasets and place them in the data directory. The download link is here.
To support the training of SAMA239K, please first download the LVVIS dataset and the RefYoutube-VOS dataset into the sama239k_data folder.
Create symbolic links in sama239k_data folder for the mevis dataset and the sav_train dataset (sam_v_full). These two datasets can be obtained from the Sa2VA training data.
For the VidSTG dataset, we have performed frame extraction. Please download this dataset first and conduct frame extraction using our provided /tools/vidstg_process.py.
Download our json files here and put them into sama239k_data folder.

The final data structure should be like:

data/
├── sama239k_data
|   ├── mevis
|   |   └── train
|   ├── lvvis
|   |   └── train
|   ├── ref_youtube_vos
|   |   └── train
|   ├── sav_train
|   |   └── sav_000
|   |   └── .....
|   ├── VidSTG
|   |   └── train
|   |       └── 2399224635
|   |           └── frame_0.jpg
|   |           └── frame_4.jpg
|   |           └── .....
├── video_datas
|   ├── revos
|   ├── mevis
|   └── davis17
|   └── chat_univi
|   └── sam_v_full # [!important] please download this from sam-2 directly.
|   └── Ref-SAV.json
├── ref_seg
|   ├── refclef
|   ├── refcoco
|   ├── refcoco+
|   ├── refcocog
|   ├── 
├── glamm_data
|   ├── images
|   ├── annotations
├── osprey-724k
|   ├── Osprey-724K
|   ├── coco
├── llava_data
|   ├── llava_images
|   ├── LLaVA-Instruct-150K
|   ├── LLaVA-Pretrain

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models. [NeurIPS 2025]

🚀 Updates

Citation

Contents

Installation

Model Weights

Training Data preparation

Training

Evaluation & Benchmark

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models. [NeurIPS 2025]

🚀 Updates

Citation

Contents

Installation

Model Weights

Training Data preparation

Training

Evaluation & Benchmark

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages