GitHub - aim-uofa/OmniJigsaw

Yiduo Jia*¹ · Muzhi Zhu*¹ · Hao Zhong¹ · Mingyu Liu¹ · Yuling Xi¹
Hao Chen^†1 · Bin Qin² · Yongjie Yang² · Zhenbo Luo² · Chunhua Shen^†1

¹ 🎓 Zhejiang University ² 🏢 Xiaomi Inc.
^* Equal contribution ^† Corresponding authors

A self-supervised RL post-training framework that enhances omni-modal reasoning through modality-orchestrated temporal reordering.

📖 Abstract

To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative reasoning, we propose OmniJigsaw, a generic self-supervised framework built upon a temporal reordering proxy task.

Centered on the chronological reconstruction of shuffled audio-visual clips, this paradigm strategically orchestrates visual and auditory signals to compel cross-modal integration through three distinct strategies: Joint Modality Integration (JMI), Sample-level Modality Selection (SMS), and Clip-level Modality Masking (CMM).

Recognizing that the efficacy of such proxy tasks is fundamentally tied to puzzle quality, we design a two-stage coarse-to-fine data filtering pipeline, facilitating the efficient adaptation of OmniJigsaw to massive unannotated omni-modal data. Extensive evaluations on 15 benchmarks show substantial gains in video, audio, and collaborative reasoning.

✨ Highlights

🧩 Self-Supervised Proxy Task: Pioneers jigsaw-based RL post-training in the omni-modal domain using temporal reordering of shuffled audio-visual clips—requiring zero manual annotation.
🎯 Modality Orchestration: Three strategies (JMI, SMS, CMM) that govern cross-modal information flow, investigating the bi-modal shortcut phenomenon and compelling deep multi-modal reasoning.
🛠️ Scalable Data Pipeline: A two-stage coarse-to-fine filtering pipeline (signal-based + semantic CoT screening) that transforms massive unannotated data into high-quality training puzzles.
📈 15 Benchmark Gains: CMM achieves +4.38 on MLVU-Test, +2.50 on MMAR, and +1.70 on OmniVideoBench over a strong Qwen3-Omni baseline. (Full quantitative tables are available on our Project Page)

🚀 Framework & Data Pipeline

OmniJigsaw Strategies

JMI (Baseline): Retains complete synchronized visual and acoustic information for all clips.
SMS (Intermediate): Deploys the model as a global dominance analyzer to identify the primary modality per sample.
CMM (Advanced & Best): Evaluates semantic density per clip and selectively masks the less salient modality, creating a cross-modal information bottleneck.

Data Filtering

High-quality puzzles are critical for our proxy task. We design a two-stage pipeline: signal-based heuristic filtering ensures omni-modal integrity, followed by semantic-based Chain-of-Thought (CoT) screening for narrative logic and state transitions.

^{Figure: Two-stage coarse-to-fine data filtering pipeline.}

💡 Ablations & Insights

Our extensive analysis reveals several critical insights regarding cross-modal learning dynamics:

Bi-Modal Shortcut Phenomenon: Under JMI, redundant audio-visual cues allow the model to rely on the dominant modality alone, bypassing deep cross-modal reasoning.
Clip-level > Sample-level: CMM consistently outperforms SMS, conforming to the dynamic flow of audio-visual information and maximizing local information entropy.
Data Quality is Critical: Training without the filtering pipeline leads to significant degradation.
Discount Factor as Catalyst: The accuracy-dependent discount factor suppresses sub-optimal solutions, preventing premature convergence.

(Specific quantitative ablation results are available on our Project Page)

^{Figures (from top to bottom): Radar comparison of strategies, Sub-capability comparison, Task reward dynamics, and Optimization dynamics with/without discount factor.}

🎬 Qualitative Examples

CoT Reasoning: CMM vs JMI

CMM compels the model to jointly analyze visual and auditory cues by masking less salient modalities, while JMI exhibits a bi-modal shortcut by "solely relying on linguistic cues."

Reasoning Capabilities & Data Filtering Cases

^{Top: Qualitative improvements in sub-scene captioning and video summarization. Bottom: Rejection cases caught by our semantic data filtering pipeline.}

For more details, please visit our Project Page.

📜 Citation

If you find OmniJigsaw useful for your research, please cite:

@misc{jia2026omnijigsawenhancingomnimodalreasoning,
      title={OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering}, 
      author={Yiduo Jia and Muzhi Zhu and Hao Zhong and Mingyu Liu and Yuling Xi and Hao Chen and Bin Qin and Yongjie Yang and Zhenbo Luo and Chunhua Shen},
      year={2026},
      eprint={2604.08209},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.08209}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
images		images
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📖 Abstract

✨ Highlights