DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation
Abstract
DreamID-Omni is a unified framework for controllable human-centric audio-video generation that uses a symmetric conditional diffusion transformer with dual-level disentanglement and multi-task progressive training to achieve state-of-the-art performance.
Recent advancements in foundation models have revolutionized joint audio-video generation. However, existing approaches typically treat human-centric tasks including reference-based audio-video generation (R2AV), video editing (RV2AV) and audio-driven video animation (RA2V) as isolated objectives. Furthermore, achieving precise, disentangled control over multiple character identities and voice timbres within a single framework remains an open challenge. In this paper, we propose DreamID-Omni, a unified framework for controllable human-centric audio-video generation. Specifically, we design a Symmetric Conditional Diffusion Transformer that integrates heterogeneous conditioning signals via a symmetric conditional injection scheme. To resolve the pervasive identity-timbre binding failures and speaker confusion in multi-person scenarios, we introduce a Dual-Level Disentanglement strategy: Synchronized RoPE at the signal level to ensure rigid attention-space binding, and Structured Captions at the semantic level to establish explicit attribute-subject mappings. Furthermore, we devise a Multi-Task Progressive Training scheme that leverages weakly-constrained generative priors to regularize strongly-constrained tasks, preventing overfitting and harmonizing disparate objectives. Extensive experiments demonstrate that DreamID-Omni achieves comprehensive state-of-the-art performance across video, audio, and audio-visual consistency, even outperforming leading proprietary commercial models. We will release our code to bridge the gap between academic research and commercial-grade applications.
Community
We introduce DreamID-Omni, a unified framework for controllable human-centric audio-video generation.
Project page: https://guoxu1233.github.io/DreamID-Omni/
Code: https://github.com/Guoxu1233/DreamID-Omni
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning (2026)
- Apollo: Unified Multi-Task Audio-Video Joint Generation (2026)
- ALIVE: Animate Your World with Lifelike Audio-Video Generation (2026)
- JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion (2026)
- LTX-2: Efficient Joint Audio-Visual Foundation Model (2026)
- MOVA: Towards Scalable and Synchronized Video-Audio Generation (2026)
- Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Will you open source the model?
arXivLens breakdown of this paper ๐ https://arxivlens.com/PaperView/Details/dreamid-omni-unified-framework-for-controllable-human-centric-audio-video-generation-768-496fce80
- Executive Summary
- Detailed Breakdown
- Practical Applications
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper