EditorNodesPricingBlog

Foley-Omni Generates Speech, Sound Effects and Music from Video in One Pass

June 9, 2026
Foley-Omni Generates Speech, Sound Effects and Music from Video in One Pass

Share this post:

Foley-Omni Generates Speech, Sound Effects and Music from Video in One Pass

Researchers at NJU-Speech published Foley-Omni on June 2, 2026. The 5.5 billion parameter model takes video footage and a text prompt, then returns a complete synchronized soundtrack covering dialogue, sound effects, and music in a single inference pass. The code and weights are available under the MIT license, permitting commercial use.

Foley-Omni model overview showing unified audio generation pipeline for speech, sound effects and music from video
Foley-Omni project overview. Courtesy of NJU-Speech.

The Audio Post Production Problem

A short film typically requires three separate audio workflows: dialogue replacement sessions, a foley recording pass for sound effects, and a dedicated music scoring session. Each runs independently, and the outputs are layered in a DAW and aligned to picture afterward.

Foley-Omni replaces all three steps with a single model call per 10 seconds of video. The model reads the video frames alongside a structured text prompt, then returns one stereo audio file where all three modalities are already synchronized to the picture.

How the Model Works

The architecture uses a diffusion transformer (DiT) backbone derived from Wan2.2-TI2V-5B, with audio encoding components from MMAudio. Both are redistributed within the Foley-Omni architecture, trained on the joint task of generating all three audio modalities together rather than independently.

Foley-Omni featured demo: speech, sound effects, and music generated in a single pass. Courtesy of NJU-Speech.

Audio output is at 16 kHz. That sample rate works for reference tracks, scratch audio, and guide mixes during editorial. Final delivery for broadcast or theatrical requires upsampling through a dedicated audio post step.

The Structured Text Interface

Foley-Omni routes three audio modalities through a shared textual interface using field tags. [WORDS] handles spoken dialogue, [AUDIO] handles sound events and foley, and [MUSIC] handles musical accompaniment. All three tags map to a shared latent space inside the DiT backbone.

A filmmaker writes a single prompt covering the entire soundscape of a scene. A description of a character speaking, a door closing, and a piano cue in the background routes each element to the correct audio modality, and the model generates all three synchronized to the picture in one pass. No previous open source model achieves this under a commercially usable license.

The model also runs in single task mode. A scene requiring only sound effects uses the [AUDIO] tag alone, with the other fields left empty.

Sound Design Across Scene Types

The following pairs show Foley-Omni generating sound design across different scene types, from ambient environments to precise synchronous effects.

Sound effects and ambient audio generation. Courtesy of NJU-Speech.

Synchronous sound design for scene footage. Courtesy of NJU-Speech.

Music Generation

Foley-Omni reads both the visual content and the [MUSIC] field to time compositional elements to cuts and motion within the clip.

Music generation synchronized to video footage. Courtesy of NJU-Speech.

For comparison, Magenta RealTime 2 released the same week takes a different approach. It runs inside a DAW as an Audio Unit plugin, generating live music in response to MIDI and text prompts with under 200 milliseconds of latency. Foley-Omni generates music locked to existing footage rather than generating it interactively. Both carry commercially usable licenses.

The V2ST Benchmark

The paper introduces V2ST-Bench, a new evaluation dataset for the video to soundtrack task. No standardized benchmark existed before this release for measuring how well a model generates all three audio modalities simultaneously from video.

The benchmark results show that joint training across speech, sound effects, and music improves performance on each individual subtask compared to models trained on a single modality. The shared latent representation allows information from one audio type to improve generation quality in the others.

Video to audio synthesis from the V2ST benchmark evaluation set. Courtesy of NJU-Speech.

Second video to audio benchmark example showing synchronized generation. Courtesy of NJU-Speech.

Getting Started

The model requires Python 3.10, CUDA 12.4, PyTorch 2.6.0, and FlashAttention 2.7.4. It runs on a GPU with enough VRAM to hold a 5.5B parameter model, consistent with other models in that size range.

Weights are available on HuggingFace at CocoBro/Foley-Omni. The GitHub repository at NJU-Speech/Foley-Omni includes an inference script and full environment setup. The model processes clips up to 10 seconds per pass. Longer footage requires splitting into segments and processing each independently.

For projects combining sound generation with video work in a single session, the AI FILMS Studio sound workspace provides sound design tools alongside the video workspace. For voice cloning specifically, generating new speech in a target speaker's voice rather than extracting audio from video, dots.tts from Hilab achieves 54ms streaming latency with emotion inferred directly from the target text.


Sources

arXiv: Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack Generation
GitHub: NJU-Speech/Foley-Omni
Hugging Face: CocoBro/Foley-Omni