π₯ Code for the SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models.
- [2025/12/12] We are in the process of preparing the data. Please wait a moment.
- [2025/9/21] SAMA is accepted to NeurIPS 2025π₯! See you in San Diego!π
If you find SAMA useful for your work, please kindly cite using the BibTeX πππ:
@inproceedings{sun2025sama,
title={SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models},
author={Sun, Ye and Zhang, Hao and Ding, Henghui and Zhang, Tiehua and Ma, Xingjun and Jiang, Yu-Gang},
booktitle={NeurIPS},
year={2025}
}- Installation
- Model Weights
- Training Data preparation
- Training
- Evaluation & Benchmark
- Acknowledgments
Installation
- Please install the python and pytorch first:
> conda create -n vlm python=3.10
> conda activate vlm
> conda install pytorch==2.3.1 torchvision==0.18.1 pytorch-cuda=12.1 cuda -c pytorch -c "nvidia/label/cuda-12.1.0" -c "nvidia/label/cuda-12.1.1"- Install mmcv:
> pip install mmcv==2.2.0 -f https://download.openmmlab.com/mmcv/dist/cu121/torch2.3/index.html- Install other dependencies:
> pip install -r requirements.txtData Preparation
-
Please first download the Sa2VA training datasets and place them in the
datadirectory. The download link is here. -
To support the training of SAMA239K, please first download the LVVIS dataset and the RefYoutube-VOS dataset into the sama239k_data folder.
-
Create symbolic links in sama239k_data folder for the mevis dataset and the sav_train dataset (sam_v_full). These two datasets can be obtained from the Sa2VA training data.
-
For the VidSTG dataset, we have performed frame extraction. Please download this dataset first and conduct frame extraction using our provided
/tools/vidstg_process.py. -
Download our json files here and put them into sama239k_data folder.
The final data structure should be like:
data/
βββ sama239k_data
| βββ mevis
| | βββ train
| βββ lvvis
| | βββ train
| βββ ref_youtube_vos
| | βββ train
| βββ sav_train
| | βββ sav_000
| | βββ .....
| βββ VidSTG
| | βββ train
| | βββ 2399224635
| | βββ frame_0.jpg
| | βββ frame_4.jpg
| | βββ .....
βββ video_datas
| βββ revos
| βββ mevis
| βββ davis17
| βββ chat_univi
| βββ sam_v_full # [!important] please download this from sam-2 directly.
| βββ Ref-SAV.json
βββ ref_seg
| βββ refclef
| βββ refcoco
| βββ refcoco+
| βββ refcocog
| βββ
βββ glamm_data
| βββ images
| βββ annotations
βββ osprey-724k
| βββ Osprey-724K
| βββ coco
βββ llava_data
| βββ llava_images
| βββ LLaVA-Instruct-150K
| βββ LLaVA-Pretrain