Official repository for "Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing" (SIGGRAPH 2026).
An overview of the Audio-Omni framework and its capabilities.
Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored.
We introduce Audio-Omni, the first end-to-end framework to unify understanding, generation, and editing across general sound, music, and speech domains. Our architecture synergizes a frozen Multimodal Large Language Model (Qwen2.5-Omni) for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis.
To overcome the critical data scarcity in audio editing, we construct a large-scale, high-quality dataset for audio editing tasks. Audio-Omni demonstrates remarkable emergent abilities inherited from the MLLM, enabling sophisticated audio manipulation through natural language instructions.
The Audio-Omni Framework.
- Python 3.11+
- CUDA-capable GPU
- FFmpeg and libsndfile
git clone https://github.com/Audio-Omni/Audio-Omni.git
cd Audio-Omni
conda create -n audio-omni python=3.11 -y
conda activate audio-omni
pip install -e .
conda install -c conda-forge ffmpeg libsndfileAdditional packages (install separately if needed):
pip install flash-attn --no-build-isolation # optional, faster attention
pip install "qwen-omni-utils[decord]" # Qwen2.5-Omni utilitieshuggingface-cli download HKUSTAudio/Audio-Omni --local-dir model/Or via Python:
from huggingface_hub import snapshot_download
snapshot_download(repo_id="HKUSTAudio/Audio-Omni", local_dir="model/")After downloading:
model/
βββ Audio-Omni.json # Model configuration
βββ model.ckpt # Model checkpoint
βββ synchformer_state_dict.pth # Synchformer checkpoint (for V2A/V2M)
bash infer_demo.sh
# or directly:
CUDA_VISIBLE_DEVICES=0 python3 run_gradio.py \
--model-config model/Audio-Omni.json \
--ckpt-path model/model.ckpt \
--server-port 7777The demo will be available at http://localhost:7777.
Audio-Omni supports understanding, generation, and editing in a single model:
| Task | Type | Text Prompt | Audio Input | Video Input | Voice Prompt |
|---|---|---|---|---|---|
| Understanding | Understanding | Question about the audio/video | Optional | Optional | β |
| Text-to-Audio (T2A) | Generation | "A clock ticking." |
β | β | β |
| Text-to-Music (T2M) | Generation | "Compose a bright jazz swing instrumental..." |
β | β | β |
| Video-to-Audio (V2A) | Generation | β | β | example.mp4 |
β |
| Video-to-Music (V2M) | Generation | β | β | example.mp4 |
β |
| Text-to-Speech (TTS) | Generation | "Hello, welcome to Audio-Omni." |
β | β | Optional |
| Voice Conversion (VC) | Generation | Transcript of target speech | β | β | ref_voice.wav |
| Add | Editing | β | source.wav |
β | β |
| Remove | Editing | β | source.wav |
β | β |
| Extract | Editing | β | source.wav |
β | β |
| Style Transfer | Editing | β | source.wav |
β | β |
from audio_omni import AudioOmni
import torchaudio
model = AudioOmni("model/Audio-Omni.json", "model/model.ckpt")Freely combine text, audio, and video inputs β omit any that are not needed:
# Audio understanding
response = model.understand("Describe the sounds in this audio.", audio="example/example.wav")
# Video understanding
response = model.understand("What is happening in this video?", video="example/example.mp4")
# Audio + Video
response = model.understand("Does the audio match the video?", audio="example/example.wav", video="example/example.mp4")
# Text-only
response = model.understand("What instruments are commonly used in jazz music?")audio = model.generate("T2A", prompt="A clock ticking.")
torchaudio.save("output.wav", audio, model.sample_rate)
# Text-to-Music
audio = model.generate("T2M", prompt="Compose a bright jazz swing instrumental with walking bass, brushed drums, and a lively horn melody.")
torchaudio.save("output_music.wav", audio, model.sample_rate)
# Video-to-Audio / Video-to-Music
audio = model.generate("V2A", video_path="example/example.mp4")
audio = model.generate("V2M", video_path="example/example.mp4")audio = model.generate("TTS", prompt="Hello, welcome to Audio-Omni.")
torchaudio.save("tts_output.wav", audio, model.sample_rate)
# With voice cloning
audio = model.generate(
"TTS",
prompt="Hello, welcome to Audio-Omni.",
voice_prompt_path="ref_voice.wav",
voice_ref_text="This is the reference transcript.",
)# Add a sound
audio = model.edit("Add", "example/edit/add/add.mp3", desc="skateboarding")
torchaudio.save("output_add.wav", audio, model.sample_rate)
# Remove a sound
audio = model.edit("Remove", "example/edit/remove/remove.mp3", desc="female singing")
torchaudio.save("output_remove.wav", audio, model.sample_rate)
# Extract a sound
audio = model.edit("Extract", "example/edit/extract/extract.mp3", desc="wood thrush calling")
torchaudio.save("output_extract.wav", audio, model.sample_rate)
# Style transfer
audio = model.edit("Style Transfer", "example/edit/transfer/example.mp3",
source_category="playing electric guitar", target_category="playing saxophone")
torchaudio.save("output_transfer.wav", audio, model.sample_rate)Audio-Omni/
βββ audio_omni/ # Main package
β βββ api.py # High-level Python API (AudioOmni class)
β βββ prompts.py # Prompt templates for all tasks
β βββ models/ # Model implementations
β βββ interface/ # Gradio UI
β βββ inference/ # Generation & sampling
β βββ data/ # Data utilities
βββ model/ # Model config & checkpoint
βββ output/ # Generated outputs
βββ docs/ # Documentation
βββ README.md
If you find our work useful, please cite:
@article{tian2026audio,
title={Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing},
author={Tian, Zeyue and Yang, Binxin and Liu, Zhaoyang and Zhang, Jiexuan and Yuan, Ruibin and Yin, Hubery and Chen, Qifeng and Li, Chen and Lv, Jing and Xue, Wei and others},
journal={arXiv preprint arXiv:2604.10708},
year={2026}
}If you have any comments or questions, feel free to contact:
- Zeyue Tian: [email protected]
The code repository is released under the CC BY-NC 4.0 License.
Note: Model weights are for research use only. Commercial use requires authorization from the authors.
We thank AudioX, VidMuse, MMAudio, F5-TTS, and stable-audio-tools for their valuable contributions.
β Star us on GitHub if you like our project!

