NAVA: Joint Audio-Video Generation from a Single Prompt
Share this post:
NAVA: Joint Audio-Video Generation from a Single Prompt
NAVA (Native Audio-Visual Alignment for Generation) is an open source model from Baidu's ERNIE Team that generates synchronized 720p video and stereo audio from a single text prompt. Released May 28, 2026 on arXiv with Apache 2.0 licensing, it is the first model in this class to deliver joint audio and video output from one prompt without processing audio as a separate post generation step.
The model runs at 6.3 billion parameters and achieves inference in roughly one minute on an 8 GPU setup.
NAVA demo. Synchronized video and stereo audio from a single text prompt
What NAVA Does
You give NAVA a text prompt. It returns video and audio together, synchronized at the generation level. The audio is not added after the video is rendered. Both streams emerge from the same model pass.
The output supports dual channel stereo, multi speaker timbre control, and language guided camera direction. Resolution targets 720p. The model handles a wide range of prompt types: nature scenes, music performances, dialogue clips, and abstract motion.
The Align-then-Fuse Architecture
NAVA is built on the Wan 2.2 backbone and uses an architecture the authors call Align-then-Fuse MMDiT (Multi-Modal Diffusion Transformer). Audio and video are processed in separate streams first, then merged at the diffusion transformer level before generation completes.
This is the opposite approach to models that generate video and then synthesize matching audio as a post-processing pass. Because both modalities share the same latent representation during generation, events in the audio are causally linked to events in the video frame by frame, not aligned retrospectively.
That structural difference explains NAVA's Verse-Bench Sync-C and Sync-D scores, which measure audio and video synchronization and distance. The paper reports new SOTA results on both metrics, as well as on video quality and audio word error rate, using 2 to 5 times fewer parameters than open source baselines at comparable quality.
Output Examples
NAVA generation example. Audio and video from one prompt
NAVA generation example. Synchronized stereo output
NAVA generation example. 720p video with native stereo audio
Verse-Bench Results
NAVA sets a new SOTA on the Verse-Bench evaluation suite across four metrics: Sync-C (audio and video synchronization), Sync-D (audio and video distance), video quality, and audio word error rate. The ERNIE Team reports these results using 2 to 5 times fewer parameters than the open source baselines they compare against.
The efficiency figure matters for production use. A 6.3B parameter model that runs on 8 GPUs in under a minute is meaningfully closer to practical deployment than models requiring far larger compute budgets to reach comparable synchronization scores.
License and Access
NAVA is released under Apache 2.0, which permits commercial use. Weights are available on HuggingFace under ernie-research/NAVA. The paper is on arXiv at 2605.30073.
The Wan 2.2 backbone that NAVA extends is already a well-documented architecture for character animation and replacement. NAVA adds native audio output on top of that foundation, opening production workflows where synchronized audio and video need to be generated together rather than assembled in post.
For text-to-video and image-to-video generation in AI FILMS Studio, explore the video workspace to try the latest models.
Sources
arXiv: NAVA: Native Audio-Visual Alignment for Generation HuggingFace: ernie-research/NAVA
Continue Reading
Video & LipSync
- Video Generator
- Text to Video
- Image to Video
- Start-End Frame to Video
- Draw to Video
- Motion Control
- Video Enhancer
- Video Upscaler
- Video to Video LipSync
- Audio to Video LipSync
- Image to Video LipSync
- Video FaceSwap
- Seedance 2
- OpenAI Sora 2
- Kling 3.0
- Kling O1
- Google Veo 3.1
- LTX 2.3
- Kling O1
- Hailuo AI
- Luma Ray
- Kling 3.0 Motion
- Topaz Upscaler
- InfiniteTalk Face Swap


