Skip to content

AVoCaDO-Captioner/AVoCaDO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AVoCaDO icon AVoCaDO: An AudioVisual Video Captioner Driven by Temporal Orchestration


✨ Overview

Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. We introduce AVoCaDO, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance under visual-only settings.

🎬 Captioning Case of AVoCaDO

AVoCaDO caption

An illustration of a video caption generated by AVoCaDO, featuring both precise audiovisual temporal alignment and accurate dialogue rendering.

🚀 Getting Started

Follow these simple steps to set up and run AVoCaDO on your machine.

1. Clone the repository

First, clone the project and navigate into the directory:

git clone https://github.com/AVoCaDO-Captioner/AVoCaDO.git
cd AVoCaDO

2. Set Up the Environment

Create and activate the Conda environment using the provided environment.yml file.

conda env create -f environment.yml
conda activate AVoCaDO

3. Quick Usage

python inference.py assets/case_1.mp4

📈 Benchmark Evaluation

We provide evaluation scripts for all evaluated benchmarks in our paper.

Direct Audiovisual Caption Evaluation

  1. video-SALMONN2-testset:

    bash eval_scripts/video-SALMONN2-testset/eval_video-SALMONN2-test.sh <your_save_directory>
  2. UGC-VideoCap:

    bash eval_scripts/UGC-VideoCap/eval_UGC-VideoCap.sh <your_save_directory>

QA-based Audiovisual Caption Evaluation

  1. Daily-Omni:

    bash eval_scripts/Daily-Omni/Daily-Omni_pipeline.sh <your_save_directory>
  2. WorldSense:

    bash eval_scripts/WorldSense/WorldSense_pipeline.sh <your_save_directory>

Visual-only Caption Evaluation

  1. VDC: First, generate captions for the videos in the VDC benchmark.

    python eval_scripts/VDC/generate_caption.py \
        --model_path <path_to_AVoCaDO> \
        --fout_path <your_save_path>

    Next, set up the judge server. This requires installing SGLang to deploy the Llama-3.1-8B as the judge model.

    # Deploy the judge model using SGLang
    python -m sglang.launch_server \
        --model-path path_to_Meta-Llama-3.1-8B-Instruct \
        --port 30000 \
        --dp 2 --tp 4 

    Once the judge model is successfully deployed and running, you can start the evaluation.

    bash AVoCaDO/eval_scripts/VDC/evaluation.sh <your_save_path>
  2. DREAM-1K:

    bash eval_scripts/DREAM-1K/eval_DREAM-1K.sh <your_save_directory>

✒️ Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citing our paper. We appreciate your support!

@article{chen2025avocado,
  title={AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration},
  author={Chen, Xinlong and Ding, Yue and Lin, Weihong and Hua, Jingyun and Yao, Linli and Shi, Yang and Li, Bozhou and Zhang, Yuanxing and Liu, Qiang and Wan, Pengfei and others},
  journal={arXiv preprint arXiv:2510.10395},
  year={2025}
}

Releases

No releases published

Packages

 
 
 

Contributors