EditorNodesPricingBlog

JavisDiT++: Open Source Model That Syncs Video and Audio

February 28, 2026
JavisDiT++: Open Source Model That Syncs Video and Audio

Share this post:

JavisDiT++: Open Source Model That Syncs Video and Audio

JavisDiT++ is an open source joint audio video generation model that creates synchronized video and audio from a single text prompt. Accepted at ICLR 2026, it is one of the first open source systems to approach the synchronization quality of commercial models like Google Veo 3. Both the code and model weights are publicly available and cleared for commercial use.

What JavisDiT++ Does

Most open source video generation tools produce silent clips. JavisDiT++ changes that. Given a text description, the model outputs a video with fully synchronized audio. Sound effects, ambient noise, and scene specific audio are generated alongside the visual frames rather than added in a separate pass.

The model builds on Wan2.1-1.3B, a text-to-video base, and expands it to 2.1 billion parameters. It supports resolutions from 240P to 480P and generates clips between 2 and 5 seconds. Despite the parameter increase, inference cost remains the same as the base model, a result of the architectural choices described below.

Three Technical Innovations

The JavisDiT++ team introduces three components that work together to produce temporally aligned audio and video output.

Modality Specific Mixture of Experts

The first is Modality Specific Mixture of Experts (MS-MoE). This architecture expands the model from 1.3 billion to 2.1 billion parameters without adding inference overhead. Specialized expert networks handle audio and video refinement independently, while shared self attention layers enable dense cross modal interaction. The result is stronger single-modality output quality and tighter coordination between the two streams.

Temporal Aligned Rotary Position Encoding

Standard Rotary Position Encoding assigns positions to tokens independently, which can cause drift between audio and video timing. Temporal Aligned RoPE (TA-RoPE) assigns shared time indices to both audio and video tokens at each frame, with spatial dimensions offset to prevent token overlap. This produces precise frame level synchronization between what is seen and what is heard.

Audio Video Direct Preference Optimization

The third contribution is Audio Video Direct Preference Optimization (AV-DPO), described by the authors as the first preference alignment technique applied to joint audio video generation. Using multiple reward models, the system ranks candidate outputs across three dimensions: visual quality, cross modal consistency, and audio video synchrony. This optimization stage closes the gap between raw model output and human preference without requiring additional labeled data.

Output Samples

The following clips were generated by JavisDiT++ from text prompts alone. Each includes audio rendered in sync with the video.

Several pigeons are gathered on a rocky shore near the water. One pigeon splashes energetically, sending up ripples, while the others stand watching. The sound of splashing water, gurgling, and flapping wings fills the air.

A brown bear is walking towards the camera, growling in a natural setting with greenery in the background.

A young girl plays the piano.

A sports car races around a track bordered by grass and fences. The engine roars through the air.

A black-and-white image shows suited musicians on stage playing saxophones, trumpets, and a tuba, music filling the air before a curtain backdrop.

At night, a narrow alley is lined with traffic cones and a rope barrier, its wet ground reflecting street lamps. Worn walls show faded posters, and a bright light glows at the far end. The sound of rain falls over the empty scene.

A woman in a red dress walks barefoot across a sandy desert at sunset, her hair flowing as wind blows and faint footsteps sound in the sand.

Training Scale and Efficiency

JavisDiT++ achieves its results with approximately 1 million publicly available training entries, a modest dataset size compared to the proprietary systems it competes with. Training runs in three stages: audio pretraining, audio video supervised fine-tuning, and preference optimization via AV-DPO.

The authors attribute much of the quality to the MS-MoE and AV-DPO components rather than raw data volume. Ablation studies in the paper confirm that removing either component causes measurable drops across quality and synchronization metrics.

Open Source and Commercial Use

JavisDiT++ is fully open source. The code is available on GitHub under the Apache-2.0 license, which permits commercial use with attribution. Model weights are hosted on Hugging Face under the MIT license, which also allows commercial modification, redistribution, and production deployment.

Both licenses have no restrictions on commercial use beyond attribution requirements. This places JavisDiT++ among the small group of audio video generation models that researchers and studios can build on without licensing barriers. The paper is available at arXiv 2602.19163.

What This Changes for Filmmakers

Audio video generation from a single prompt changes the prototyping workflow for short form content. Instead of generating a silent video clip, sourcing reference audio, and syncing them in post, a filmmaker can produce a draft of a scene with audio in a single step. At 240P to 480P, the output is best suited for storyboarding and concept validation rather than final delivery, but the workflow compression is significant.

For production of complete short films, trailers, and social content, models available through AI FILMS Studio offer a more complete pipeline at production quality. JavisDiT++ is a research baseline showing how much is possible with open training data and a focused architecture. For filmmakers already working with audio synced video, see how HunyuanVideo-Foley approaches the video-to-audio direction, turning silent clips into synced Foley.

Sources

arXiv | GitHub (JavisVerse/JavisDiT) | Hugging Face (JavisVerse/JavisDiT-v1.0-jav) | ICLR 2026