TimeChat-Captioner: Scripting Multi-Scene Videos
with Time-Aware and Structural
Audio-Visual Captions

Linli Yao^1,2, Yuancheng Wei³, Yaojie Zhang⁴, Lei Li⁵, Xinlong Chen^6,2, Feifan Song¹, Ziyue Wang¹, Kun Ouyang¹, Yuanxin Liu¹, Lingpeng Kong⁵, Qi Liu⁵, Pengfei Wan², Kun Gai², Yuanxing Zhang², Xu Sun¹

¹Peking University, ²Kling Team, Kuaishou Technology, ³South China University of Technology,
⁴University of Electronic Science and Technology of China, ⁵The University of Hong Kong, ⁶Institute of Automation, Chinese Academy of Sciences

Paper Code & Data Benchmark Dataset Model

Interactive Demo

💡 Tip: Select a segment below to view detailed annotations

00:00-00:10

Detailed Events

00:11-00:20

Detailed Events

00:21-00:28

Detailed Events

00:29-00:36

Detailed Events

00:37-00:49

Detailed Events

00:50-00:59

Detailed Events

Abstract

We introduce TimeChat-Captioner, a model that generates script-like, temporally dense, and structurally rich audio-visual narratives. Unlike traditional video captioning approaches that produce single-sentence summaries, TimeChat-Captioner provides temporally-dense scene segmentation with explicit MM:SS timestamps and description-dense structured captions spanning six dimensions: Detailed Events (narrating audiovisual content and actions), Visual Background (depicting setting, location, and atmosphere), Camera State (describing camera movements, angles, and framing), Shot Editing Style (analyzing post-production editing), Dialogue Content (transcription and summary with speaker ID), and Acoustics Content (portraying background sounds, music, and speaking tones).

To address the challenge of evaluating continuous captions with subjective boundary ambiguity, we propose SodaM, a unified metric that employs a two-stage alignment strategy combining IoU-based Dynamic Programming with many-to-one merging, and measures semantic completeness through LLM-based CheckList scoring.

We construct two complementary datasets: OmniDCBench (1,122 videos with human annotations, averaging 995 words per video) as a high-quality benchmark, and TimeChatCap-42K (42,000 synthesized video-caption pairs) for training. Our model, TimeChat-Captioner-7B, built on Qwen2.5-Omni with multimodal RoPE for precise temporal localization, employs a two-stage training strategy: supervised fine-tuning followed by Group Relative Policy Optimization (GRPO) that jointly optimizes temporal accuracy and caption quality through a composite reward function.

TimeChat-Captioner-7B achieves a SodaM score of 35.0, surpassing Gemini-2.5-Pro (33.7) and Gemini-2.5-Flash (30.0), while significantly outperforming previous open-source models. The model shows particular strength in Camera (12.4), Acoustics (38.2), and Dialogue (54.3) dimensions and generalizes effectively to downstream tasks, achieving 52.8 on DailyOmni, 22.6 on WorldSense, and 48.3 [email protected] on Charades-STA, outperforming specialized expert models.

Time-Aware and Structural Audio-Visual Captions

TimeChat-Captioner generates script-like, temporally dense, and structurally rich audio-visual narratives for multi-scene videos. At its core is the innovative six-dimensional structural schema that comprehensively describes video content like a professional film script.

The six structural dimensions include:

Detailed Events: narrating audiovisual content and actions
Visual Background: depicting setting, location, and atmosphere
Camera State: describing camera movements, angles, and framing
Shot Editing Style: analyzing post-production editing
Dialogue Content: transcription and summary with speaker ID
Acoustics Content: portraying background sounds, music, and speaking tones

TimeChat-Captioner generates temporally-dense and description-dense captions with six structural dimensions.

SodaM: A Unified Evaluation Metric

Evaluating continuous captions faces the "boundary ambiguity" challenge—scene boundaries are subjective, and predictions may be more fine-grained than ground truth. We propose SodaM, a unified metric that addresses these challenges through a sophisticated two-stage alignment strategy.

How SodaM Works

IoU-based Dynamic Programming Alignment: Uses Dynamic Programming to find the optimal alignment path through the (M, N) scene grid based on temporal Intersection over Union (IoU), ensuring the best match between predicted and ground-truth scene boundaries.
Many-to-One Merging: Gracefully handles cases where models generate finer-grained segments than human references by concatenating predicted captions for a single ground-truth match, avoiding unfair penalties for detailed predictions.
CheckList Score for Semantic Completeness: Uses an LLM (Gemini-2.5-Flash) to judge if predicted captions cover atomic elements (keypoints) from the ground truth, providing a holistic measure of semantic coverage across all six dimensions.

TimeChat-Captioner-7B Architecture

TimeChat-Captioner-7B is built on Qwen2.5-Omni with key architectural innovations for precise temporal localization and synchronous audio-visual perception.

Key Architectural Features

Synchronous Perception:
Interleaved audio and visual tokens in a single sequence, enabling holistic audio-visual understanding without separate processing streams.
Multimodal RoPE (M-RoPE):
Multimodal Rotary Position Embedding that encodes precise absolute temporal information, crucial for generating accurate MM:SS timestamps in captions.
Two-Stage Training (SFT + GRPO):
Stage 1: Supervised Fine-Tuning on the TimeChatCap-42K dataset teaches the complex "script" format. Stage 2: Group Relative Policy Optimization (GRPO) jointly optimizes temporal accuracy and caption quality through a composite reward function combining multiple reward signals.

Composite Reward Function

Format Reward: Binary reward for valid JSON-formatted output (1 if parseable, 0 otherwise).
Length Reward: Regularizes output length to prevent hallucinations and repetitive content.
Timestamp Reward: Average F1 score at IoU thresholds {0.3, 0.5, 0.7, 0.9} for temporal accuracy.
Time-aware Caption Reward: Uses the SodaM metric to measure semantic completeness and temporal alignment across all six dimensions.

The architecture of TimeChat-Captioner-7B with synchronous audio-visual perception and M-RoPE for temporal localization.

Dataset Construction

We construct two complementary datasets to support the dense video captioning task: OmniDCBench as a high-quality human-annotated benchmark and TimeChatCap-42K as a large-scale training set.

OmniDCBench: Human Benchmark

OmniDCBench is entirely human-annotated by experts with cinematography knowledge, ensuring high-quality annotations that capture the nuances of professional video production.

OmniDCBench Statistics:

Scale: 1,122 high-resolution video clips
Annotation Depth: Averaging 995 words per video
Expert Annotation: Entirely human-annotated by experts with cinematography knowledge
Coverage: All six structural dimensions (Detailed Events, Visual Background, Camera State, Shot Editing Style, Dialogue Content, Acoustics Content)

Statistics of human-annotated OmniDCBench

TimeChatCap-42K: Training Set

TimeChatCap-42K is synthesized using Gemini-2.5-Pro via a three-stage pipeline: (1) Boundary Segmentation for establishing rough timestamps, (2) Detailed Caption Generation for expanding into the six-dimensional structural schema, and (3) Quality Filtering to ensure high-quality annotations.

TimeChatCap-42K Statistics:

Scale: 42,000 high-quality video-script pairs
Synthesis Pipeline: Three-stage process using Gemini-2.5-Pro (Boundary Segmentation → Detailed Caption Generation → Quality Filtering)
Quality: Structured captions following the six-dimensional schema
Purpose: Large-scale training for teaching the model complex "script" format

Statistics of the training dataset TimeChatCap-42K

Overview of the synthetic training data construction pipeline for the training dataset TimeChatCap-42K.

Experimental Results

Main Results on OmniDCBench

TimeChat-Captioner-7B achieves a SodaM score of 35.0, surpassing Gemini-2.5-Pro (33.7) and Gemini-2.5-Flash (30.0), establishing state-of-the-art performance. The model significantly outperforms previous open-source models like Qwen3-Omni (14.3) and MiniCPM-o-2.6 (5.4).

Main experimental results on OmniDCBench. TimeChat-Captioner-7B shows particular strength in Camera (12.4), Acoustics (38.2), and Dialogue (54.3) dimensions, which are traditionally difficult for MLLMs.

Downstream Generalization: Audio-Visual Reasoning

TimeChat-Captioner-7B demonstrates strong generalization to audio-visual reasoning tasks, achieving 52.8 on DailyOmni and 22.6 on WorldSense (best among open-source models), showing that the fine-grained audio-visual understanding learned from dense video captioning transfers effectively to other multimodal tasks.

Downstream Generalization Results (Table 2)

Performance on audio-visual reasoning benchmarks (DailyOmni and WorldSense).

Temporal Grounding on Charades-STA

On the temporal grounding task Charades-STA, TimeChat-Captioner-7B achieves an [email protected] of 48.3, outperforming specialized expert models like TimeSuite (43.0). This demonstrates that the precise temporal localization capabilities learned through M-RoPE and GRPO training generalize effectively to temporal grounding tasks.

Results on Charades-STA temporal grounding benchmark, showing superior performance compared to specialized models.

BibTeX

@article{yao2026timechat,
  title={TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions},
  author={Yao, Linli and Wei, Yuancheng and Zhang, Yaojie and Li, Lei and Chen, Xinlong and Song, Feifan and Wang, Ziyue and Ouyang, Kun and Liu, Yuanxin and Kong, Lingpeng and Liu, Qi and Wan, Pengfei and Gai, Kun and Zhang, Yuanxing and Sun, Xu},
  journal={arXiv preprint arXiv:2602.08711},
  year={2026}
}

Acknowledgement

Usage and License Notices: The data, code and checkpoints are intended and licensed for research use only. They are also restricted to uses that follow the license agreements of the respective datasets and models used in this work.

Related Projects: Qwen2.5-Omni, Gemini

TimeChat-Captioner: Scripting Multi-Scene Videoswith Time-Aware and StructuralAudio-Visual Captions

Interactive Demo

00:00-00:10

00:11-00:20

00:21-00:28

00:29-00:36

00:37-00:49

00:50-00:59

Time Segment Details

Abstract

Time-Aware and Structural Audio-Visual Captions

SodaM: A Unified Evaluation Metric

How SodaM Works

TimeChat-Captioner-7B Architecture

Key Architectural Features

Composite Reward Function

Dataset Construction

OmniDCBench: Human Benchmark

OmniDCBench Statistics:

TimeChatCap-42K: Training Set

TimeChatCap-42K Statistics:

Experimental Results

Main Results on OmniDCBench

Downstream Generalization: Audio-Visual Reasoning

Temporal Grounding on Charades-STA

BibTeX

Acknowledgement

TimeChat-Captioner: Scripting Multi-Scene Videos
with Time-Aware and Structural
Audio-Visual Captions