Detailed Events
Detailed Events
Detailed Events
Detailed Events
Detailed Events
Detailed Events
TimeChat-Captioner generates script-like, temporally dense, and structurally rich audio-visual narratives for multi-scene videos. At its core is the innovative six-dimensional structural schema that comprehensively describes video content like a professional film script.
The six structural dimensions include:
TimeChat-Captioner generates temporally-dense and description-dense captions with six structural dimensions.
Evaluating continuous captions faces the "boundary ambiguity" challenge—scene boundaries are subjective, and predictions may be more fine-grained than ground truth. We propose SodaM, a unified metric that addresses these challenges through a sophisticated two-stage alignment strategy.
TimeChat-Captioner-7B is built on Qwen2.5-Omni with key architectural innovations for precise temporal localization and synchronous audio-visual perception.
Interleaved audio and visual tokens in a single sequence, enabling holistic audio-visual understanding without separate processing streams.
Multimodal Rotary Position Embedding that encodes precise absolute temporal information, crucial for generating accurate MM:SS timestamps in captions.
Stage 1: Supervised Fine-Tuning on the TimeChatCap-42K dataset teaches the complex "script" format. Stage 2: Group Relative Policy Optimization (GRPO) jointly optimizes temporal accuracy and caption quality through a composite reward function combining multiple reward signals.
The architecture of TimeChat-Captioner-7B with synchronous audio-visual perception and M-RoPE for temporal localization.
We construct two complementary datasets to support the dense video captioning task: OmniDCBench as a high-quality human-annotated benchmark and TimeChatCap-42K as a large-scale training set.
OmniDCBench is entirely human-annotated by experts with cinematography knowledge, ensuring high-quality annotations that capture the nuances of professional video production.
TimeChatCap-42K is synthesized using Gemini-2.5-Pro via a three-stage pipeline: (1) Boundary Segmentation for establishing rough timestamps, (2) Detailed Caption Generation for expanding into the six-dimensional structural schema, and (3) Quality Filtering to ensure high-quality annotations.
Overview of the synthetic training data construction pipeline for the training dataset TimeChatCap-42K.
TimeChat-Captioner-7B achieves a SodaM score of 35.0, surpassing Gemini-2.5-Pro (33.7) and Gemini-2.5-Flash (30.0), establishing state-of-the-art performance. The model significantly outperforms previous open-source models like Qwen3-Omni (14.3) and MiniCPM-o-2.6 (5.4).
Main experimental results on OmniDCBench. TimeChat-Captioner-7B shows particular strength in Camera (12.4), Acoustics (38.2), and Dialogue (54.3) dimensions, which are traditionally difficult for MLLMs.
TimeChat-Captioner-7B demonstrates strong generalization to audio-visual reasoning tasks, achieving 52.8 on DailyOmni and 22.6 on WorldSense (best among open-source models), showing that the fine-grained audio-visual understanding learned from dense video captioning transfers effectively to other multimodal tasks.
Performance on audio-visual reasoning benchmarks (DailyOmni and WorldSense).
On the temporal grounding task Charades-STA, TimeChat-Captioner-7B achieves an [email protected] of 48.3, outperforming specialized expert models like TimeSuite (43.0). This demonstrates that the precise temporal localization capabilities learned through M-RoPE and GRPO training generalize effectively to temporal grounding tasks.
Results on Charades-STA temporal grounding benchmark, showing superior performance compared to specialized models.
@article{yao2026timechat,
title={TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions},
author={Yao, Linli and Wei, Yuancheng and Zhang, Yaojie and Li, Lei and Chen, Xinlong and Song, Feifan and Wang, Ziyue and Ouyang, Kun and Liu, Yuanxin and Kong, Lingpeng and Liu, Qi and Wan, Pengfei and Gai, Kun and Zhang, Yuanxing and Sun, Xu},
journal={arXiv preprint arXiv:2602.08711},
year={2026}
}
Usage and License Notices: The data, code and checkpoints are intended and licensed for research use only. They are also restricted to uses that follow the license agreements of the respective datasets and models used in this work.
Related Projects: Qwen2.5-Omni, Gemini