Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. We introduce AVoCaDO, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance under visual-only settings.
An illustration of a video caption generated by AVoCaDO, featuring both precise audiovisual temporal alignment and accurate dialogue rendering.Follow these simple steps to set up and run AVoCaDO on your machine.
First, clone the project and navigate into the directory:
git clone https://github.com/AVoCaDO-Captioner/AVoCaDO.git
cd AVoCaDOCreate and activate the Conda environment using the provided environment.yml file.
conda env create -f environment.yml
conda activate AVoCaDOpython inference.py assets/case_1.mp4We provide evaluation scripts for all evaluated benchmarks in our paper.
-
video-SALMONN2-testset:
bash eval_scripts/video-SALMONN2-testset/eval_video-SALMONN2-test.sh <your_save_directory>
-
UGC-VideoCap:
bash eval_scripts/UGC-VideoCap/eval_UGC-VideoCap.sh <your_save_directory>
-
Daily-Omni:
bash eval_scripts/Daily-Omni/Daily-Omni_pipeline.sh <your_save_directory>
-
WorldSense:
bash eval_scripts/WorldSense/WorldSense_pipeline.sh <your_save_directory>
-
VDC: First, generate captions for the videos in the VDC benchmark.
python eval_scripts/VDC/generate_caption.py \ --model_path <path_to_AVoCaDO> \ --fout_path <your_save_path>
Next, set up the judge server. This requires installing SGLang to deploy the Llama-3.1-8B as the judge model.
# Deploy the judge model using SGLang python -m sglang.launch_server \ --model-path path_to_Meta-Llama-3.1-8B-Instruct \ --port 30000 \ --dp 2 --tp 4
Once the judge model is successfully deployed and running, you can start the evaluation.
bash AVoCaDO/eval_scripts/VDC/evaluation.sh <your_save_path>
-
DREAM-1K:
bash eval_scripts/DREAM-1K/eval_DREAM-1K.sh <your_save_directory>
If you find our work helpful for your research, please consider giving a star ⭐ and citing our paper. We appreciate your support!
@article{chen2025avocado,
title={AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration},
author={Chen, Xinlong and Ding, Yue and Lin, Weihong and Hua, Jingyun and Yao, Linli and Shi, Yang and Li, Bozhou and Zhang, Yuanxing and Liu, Qiang and Wan, Pengfei and others},
journal={arXiv preprint arXiv:2510.10395},
year={2025}
}