NAVA: Joint Audio-Video Generation from a Single Prompt

May 30, 2026

Share this post:

NAVA: Joint Audio-Video Generation from a Single Prompt

NAVA (Native Audio-Visual Alignment for Generation) is an open source model from Baidu's ERNIE Team that generates synchronized 720p video and stereo audio from a single text prompt. Released May 28, 2026 on arXiv with Apache 2.0 licensing, it is the first model in this class to deliver joint audio and video output from one prompt without processing audio as a separate post generation step.

The model runs at 6.3 billion parameters and achieves inference in roughly one minute on an 8 GPU setup.

NAVA demo. Synchronized video and stereo audio from a single text prompt

What NAVA Does

You give NAVA a text prompt. It returns video and audio together, synchronized at the generation level. The audio is not added after the video is rendered. Both streams emerge from the same model pass.

The output supports dual channel stereo, multi speaker timbre control, and language guided camera direction. Resolution targets 720p. The model handles a wide range of prompt types: nature scenes, music performances, dialogue clips, and abstract motion.

The Align-then-Fuse Architecture

NAVA is built on the Wan 2.2 backbone and uses an architecture the authors call Align-then-Fuse MMDiT (Multi-Modal Diffusion Transformer). Audio and video are processed in separate streams first, then merged at the diffusion transformer level before generation completes.

This is the opposite approach to models that generate video and then synthesize matching audio as a post-processing pass. Because both modalities share the same latent representation during generation, events in the audio are causally linked to events in the video frame by frame, not aligned retrospectively.

That structural difference explains NAVA's Verse-Bench Sync-C and Sync-D scores, which measure audio and video synchronization and distance. The paper reports new SOTA results on both metrics, as well as on video quality and audio word error rate, using 2 to 5 times fewer parameters than open source baselines at comparable quality.

Output Examples

NAVA generation example. Audio and video from one prompt

NAVA generation example. Synchronized stereo output

NAVA generation example. 720p video with native stereo audio

Verse-Bench Results

NAVA sets a new SOTA on the Verse-Bench evaluation suite across four metrics: Sync-C (audio and video synchronization), Sync-D (audio and video distance), video quality, and audio word error rate. The ERNIE Team reports these results using 2 to 5 times fewer parameters than the open source baselines they compare against.

The efficiency figure matters for production use. A 6.3B parameter model that runs on 8 GPUs in under a minute is meaningfully closer to practical deployment than models requiring far larger compute budgets to reach comparable synchronization scores.

License and Access

NAVA is released under Apache 2.0, which permits commercial use. Weights are available on HuggingFace under ernie-research/NAVA. The paper is on arXiv at 2605.30073.

The Wan 2.2 backbone that NAVA extends is already a well-documented architecture for character animation and replacement. NAVA adds native audio output on top of that foundation, opening production workflows where synchronized audio and video need to be generated together rather than assembled in post.

For text-to-video and image-to-video generation in AI FILMS Studio, explore the video workspace to try the latest models.

AI FILMS Studio video generation workspace

Try AI FILMS Studio

Generate text-to-video and image-to-video with the latest AI models in the video workspace.

Nodes Graph Editor

Build custom AI workflows by connecting models visually in the Nodes Graph Editor.

Sources

arXiv: NAVA: Native Audio-Visual Alignment for Generation HuggingFace: ernie-research/NAVA

Continue Reading

Jul 13, 2026

Luma Ray 3.2 Tutorial: Text to Video and Image to Video

Step by step guide to Luma Ray 3.2 on AI FILMS Studio. Generate text-to-video and image-to-video with cinematic AI video generation in the workspace.

Jul 11, 2026

ARDY: NVIDIA Open Real Time Text to Motion Model for Digital Humans and Robots

NVIDIA's ARDY generates 3D human and humanoid motion from text in real time with kinematic constraints, accepted to SIGGRAPH 2026, code under Apache 2.0.