Skip to content

ZeyueT/Audio-Omni

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸŽ›οΈ Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

arXiv Project Page πŸ€— Model πŸ€— Dataset πŸ€— Demo


Official repository for "Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing" (SIGGRAPH 2026).

✨ Teaser

An overview of the Audio-Omni framework and its capabilities.

An overview of the Audio-Omni framework and its capabilities.


✨ Abstract

Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored.

We introduce Audio-Omni, the first end-to-end framework to unify understanding, generation, and editing across general sound, music, and speech domains. Our architecture synergizes a frozen Multimodal Large Language Model (Qwen2.5-Omni) for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis.

To overcome the critical data scarcity in audio editing, we construct a large-scale, high-quality dataset for audio editing tasks. Audio-Omni demonstrates remarkable emergent abilities inherited from the MLLM, enabling sophisticated audio manipulation through natural language instructions.

✨ Method

The Audio-Omni Framework.

The Audio-Omni Framework.


πŸ› οΈ Installation

Prerequisites

  • Python 3.11+
  • CUDA-capable GPU
  • FFmpeg and libsndfile

Install

git clone https://github.com/Audio-Omni/Audio-Omni.git
cd Audio-Omni

conda create -n audio-omni python=3.11 -y
conda activate audio-omni

pip install -e .
conda install -c conda-forge ffmpeg libsndfile

Additional packages (install separately if needed):

pip install flash-attn --no-build-isolation   # optional, faster attention
pip install "qwen-omni-utils[decord]"          # Qwen2.5-Omni utilities

Download Model

huggingface-cli download HKUSTAudio/Audio-Omni --local-dir model/

Or via Python:

from huggingface_hub import snapshot_download
snapshot_download(repo_id="HKUSTAudio/Audio-Omni", local_dir="model/")

After downloading:

model/
β”œβ”€β”€ Audio-Omni.json              # Model configuration
β”œβ”€β”€ model.ckpt                   # Model checkpoint
└── synchformer_state_dict.pth   # Synchformer checkpoint (for V2A/V2M)

πŸ€— Gradio Demo

bash infer_demo.sh
# or directly:
CUDA_VISIBLE_DEVICES=0 python3 run_gradio.py \
    --model-config model/Audio-Omni.json \
    --ckpt-path model/model.ckpt \
    --server-port 7777

The demo will be available at http://localhost:7777.


🎯 Supported Tasks

Audio-Omni supports understanding, generation, and editing in a single model:

Task Type Text Prompt Audio Input Video Input Voice Prompt
Understanding Understanding Question about the audio/video Optional Optional β€”
Text-to-Audio (T2A) Generation "A clock ticking." β€” β€” β€”
Text-to-Music (T2M) Generation "Compose a bright jazz swing instrumental..." β€” β€” β€”
Video-to-Audio (V2A) Generation β€” β€” example.mp4 β€”
Video-to-Music (V2M) Generation β€” β€” example.mp4 β€”
Text-to-Speech (TTS) Generation "Hello, welcome to Audio-Omni." β€” β€” Optional
Voice Conversion (VC) Generation Transcript of target speech β€” β€” ref_voice.wav
Add Editing β€” source.wav β€” β€”
Remove Editing β€” source.wav β€” β€”
Extract Editing β€” source.wav β€” β€”
Style Transfer Editing β€” source.wav β€” β€”

πŸ–₯️ Python API

Load Model

from audio_omni import AudioOmni
import torchaudio

model = AudioOmni("model/Audio-Omni.json", "model/model.ckpt")

Understanding

Freely combine text, audio, and video inputs β€” omit any that are not needed:

# Audio understanding
response = model.understand("Describe the sounds in this audio.", audio="example/example.wav")

# Video understanding
response = model.understand("What is happening in this video?", video="example/example.mp4")

# Audio + Video
response = model.understand("Does the audio match the video?", audio="example/example.wav", video="example/example.mp4")

# Text-only
response = model.understand("What instruments are commonly used in jazz music?")

Text-to-Audio

audio = model.generate("T2A", prompt="A clock ticking.")
torchaudio.save("output.wav", audio, model.sample_rate)

# Text-to-Music
audio = model.generate("T2M", prompt="Compose a bright jazz swing instrumental with walking bass, brushed drums, and a lively horn melody.")
torchaudio.save("output_music.wav", audio, model.sample_rate)

# Video-to-Audio / Video-to-Music
audio = model.generate("V2A", video_path="example/example.mp4")
audio = model.generate("V2M", video_path="example/example.mp4")

Text-to-Speech

audio = model.generate("TTS", prompt="Hello, welcome to Audio-Omni.")
torchaudio.save("tts_output.wav", audio, model.sample_rate)

# With voice cloning
audio = model.generate(
    "TTS",
    prompt="Hello, welcome to Audio-Omni.",
    voice_prompt_path="ref_voice.wav",
    voice_ref_text="This is the reference transcript.",
)

Audio Editing

# Add a sound
audio = model.edit("Add", "example/edit/add/add.mp3", desc="skateboarding")
torchaudio.save("output_add.wav", audio, model.sample_rate)

# Remove a sound
audio = model.edit("Remove", "example/edit/remove/remove.mp3", desc="female singing")
torchaudio.save("output_remove.wav", audio, model.sample_rate)

# Extract a sound
audio = model.edit("Extract", "example/edit/extract/extract.mp3", desc="wood thrush calling")
torchaudio.save("output_extract.wav", audio, model.sample_rate)

# Style transfer
audio = model.edit("Style Transfer", "example/edit/transfer/example.mp3",
                   source_category="playing electric guitar", target_category="playing saxophone")
torchaudio.save("output_transfer.wav", audio, model.sample_rate)

πŸ“ Project Structure

Audio-Omni/
β”œβ”€β”€ audio_omni/                 # Main package
β”‚   β”œβ”€β”€ api.py                  # High-level Python API (AudioOmni class)
β”‚   β”œβ”€β”€ prompts.py              # Prompt templates for all tasks
β”‚   β”œβ”€β”€ models/                 # Model implementations
β”‚   β”œβ”€β”€ interface/              # Gradio UI
β”‚   β”œβ”€β”€ inference/              # Generation & sampling
β”‚   └── data/                   # Data utilities
β”œβ”€β”€ model/                      # Model config & checkpoint
β”œβ”€β”€ output/                     # Generated outputs
β”œβ”€β”€ docs/                       # Documentation
└── README.md

πŸ“ Citation

If you find our work useful, please cite:

@article{tian2026audio,
  title={Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing},
  author={Tian, Zeyue and Yang, Binxin and Liu, Zhaoyang and Zhang, Jiexuan and Yuan, Ruibin and Yin, Hubery and Chen, Qifeng and Li, Chen and Lv, Jing and Xue, Wei and others},
  journal={arXiv preprint arXiv:2604.10708},
  year={2026}
}

πŸ“­ Contact

If you have any comments or questions, feel free to contact:


πŸ“„ License

The code repository is released under the CC BY-NC 4.0 License.

Note: Model weights are for research use only. Commercial use requires authorization from the authors.


πŸ™ Acknowledgments

We thank AudioX, VidMuse, MMAudio, F5-TTS, and stable-audio-tools for their valuable contributions.


⭐ Star us on GitHub if you like our project!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages