AVoCaDO: An AudioVisual Video Captioner Driven by Temporal Orchestration

✨ Overview

Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. We introduce AVoCaDO, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance under visual-only settings.

🎬 Captioning Case of AVoCaDO

An illustration of a video caption generated by AVoCaDO, featuring both precise audiovisual temporal alignment and accurate dialogue rendering.

🚀 Getting Started

Follow these simple steps to set up and run AVoCaDO on your machine.

1. Clone the repository

First, clone the project and navigate into the directory:

git clone https://github.com/AVoCaDO-Captioner/AVoCaDO.git
cd AVoCaDO

2. Set Up the Environment

Create and activate the Conda environment using the provided environment.yml file.

conda env create -f environment.yml
conda activate AVoCaDO

3. Quick Usage

python inference.py assets/case_1.mp4

📈 Benchmark Evaluation

We provide evaluation scripts for all evaluated benchmarks in our paper.

Direct Audiovisual Caption Evaluation

video-SALMONN2-testset:

bash eval_scripts/video-SALMONN2-testset/eval_video-SALMONN2-test.sh <your_save_directory>

UGC-VideoCap:

bash eval_scripts/UGC-VideoCap/eval_UGC-VideoCap.sh <your_save_directory>

QA-based Audiovisual Caption Evaluation

Daily-Omni:

bash eval_scripts/Daily-Omni/Daily-Omni_pipeline.sh <your_save_directory>

WorldSense:

bash eval_scripts/WorldSense/WorldSense_pipeline.sh <your_save_directory>

Visual-only Caption Evaluation

VDC: First, generate captions for the videos in the VDC benchmark.

python eval_scripts/VDC/generate_caption.py \
    --model_path <path_to_AVoCaDO> \
    --fout_path <your_save_path>

Next, set up the judge server. This requires installing SGLang to deploy the Llama-3.1-8B as the judge model.

# Deploy the judge model using SGLang
python -m sglang.launch_server \
    --model-path path_to_Meta-Llama-3.1-8B-Instruct \
    --port 30000 \
    --dp 2 --tp 4

Once the judge model is successfully deployed and running, you can start the evaluation.

bash AVoCaDO/eval_scripts/VDC/evaluation.sh <your_save_path>

DREAM-1K:

bash eval_scripts/DREAM-1K/eval_DREAM-1K.sh <your_save_directory>

✒️ Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citing our paper. We appreciate your support!

@article{chen2025avocado,
  title={AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration},
  author={Chen, Xinlong and Ding, Yue and Lin, Weihong and Hua, Jingyun and Yao, Linli and Shi, Yang and Li, Bozhou and Zhang, Yuanxing and Liu, Qiang and Wan, Pengfei and others},
  journal={arXiv preprint arXiv:2510.10395},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AVoCaDO: An AudioVisual Video Captioner Driven by Temporal Orchestration

✨ Overview

🎬 Captioning Case of AVoCaDO

🚀 Getting Started

1. Clone the repository

2. Set Up the Environment

3. Quick Usage

📈 Benchmark Evaluation

Direct Audiovisual Caption Evaluation

QA-based Audiovisual Caption Evaluation

Visual-only Caption Evaluation

✒️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
eval_scripts		eval_scripts
README.md		README.md
environment.yml		environment.yml
inference.py		inference.py

Folders and files

Latest commit

History

Repository files navigation

AVoCaDO: An AudioVisual Video Captioner Driven by Temporal Orchestration

✨ Overview

🎬 Captioning Case of AVoCaDO

🚀 Getting Started

1. Clone the repository

2. Set Up the Environment

3. Quick Usage

📈 Benchmark Evaluation

Direct Audiovisual Caption Evaluation

QA-based Audiovisual Caption Evaluation

Visual-only Caption Evaluation

✒️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages