Foley-Omni Generates Speech, Sound Effects and Music from Video in One Pass

June 9, 2026

Updated: July 22, 2026

Share this post:

Foley-Omni Generates Speech, Sound Effects and Music from Video in One Pass

Researchers at NJU-Speech published Foley-Omni on June 2, 2026. The 5.5 billion parameter model takes video footage and a text prompt, then returns a complete synchronized soundtrack covering dialogue, sound effects, and music in a single inference pass. The code and weights are available under the MIT license, permitting commercial use.

The Audio Post Production Problem

A short film typically requires three separate audio workflows: dialogue replacement sessions, a foley recording pass for sound effects, and a dedicated music scoring session. Each runs independently, and the outputs are layered in a DAW and aligned to picture afterward.

Foley-Omni replaces all three steps with a single model call per 10 seconds of video. The model reads the video frames alongside a structured text prompt, then returns one stereo audio file where all three modalities are already synchronized to the picture.

How the Model Works

The architecture uses a diffusion transformer (DiT) backbone derived from Wan2.2-TI2V-5B, with audio encoding components from MMAudio. Both are redistributed within the Foley-Omni architecture, trained on the joint task of generating all three audio modalities together rather than independently.

Foley-Omni featured demo: speech, sound effects, and music generated in a single pass. Courtesy of NJU-Speech.

Audio output is at 16 kHz. That sample rate works for reference tracks, scratch audio, and guide mixes during editorial. Final delivery for broadcast or theatrical requires upsampling through a dedicated audio post step.

The Structured Text Interface

Foley-Omni routes three audio modalities through a shared textual interface using field tags. [WORDS] handles spoken dialogue, [AUDIO] handles sound events and foley, and [MUSIC] handles musical accompaniment. All three tags map to a shared latent space inside the DiT backbone.

A filmmaker writes a single prompt covering the entire soundscape of a scene. A description of a character speaking, a door closing, and a piano cue in the background routes each element to the correct audio modality, and the model generates all three synchronized to the picture in one pass. No previous open source model achieves this under a commercially usable license.

The model also runs in single task mode. A scene requiring only sound effects uses the [AUDIO] tag alone, with the other fields left empty.

Sound Design Across Scene Types

The following pairs show Foley-Omni generating sound design across different scene types, from ambient environments to precise synchronous effects.

Sound effects and ambient audio generation. Courtesy of NJU-Speech.

Synchronous sound design for scene footage. Courtesy of NJU-Speech.

Music Generation

Foley-Omni reads both the visual content and the [MUSIC] field to time compositional elements to cuts and motion within the clip.

Music generation synchronized to video footage. Courtesy of NJU-Speech.

For comparison, Magenta RealTime 2 released the same week takes a different approach. It runs inside a DAW as an Audio Unit plugin, generating live music in response to MIDI and text prompts with under 200 milliseconds of latency. Foley-Omni generates music locked to existing footage rather than generating it interactively. Both carry commercially usable licenses.

The V2ST Benchmark

The paper introduces V2ST-Bench, a new evaluation dataset for the video to soundtrack task. No standardized benchmark existed before this release for measuring how well a model generates all three audio modalities simultaneously from video.

The benchmark results show that joint training across speech, sound effects, and music improves performance on each individual subtask compared to models trained on a single modality. The shared latent representation allows information from one audio type to improve generation quality in the others.

Video to audio synthesis from the V2ST benchmark evaluation set. Courtesy of NJU-Speech.

Second video to audio benchmark example showing synchronized generation. Courtesy of NJU-Speech.

What Joint Training Achieves

The V2ST-Bench results demonstrate the practical value of training all three audio modalities simultaneously. On the individual subtasks measured by the benchmark, joint training improves each score compared to models trained on a single modality. Speech generation improves because the model learns acoustic environment representations from sound effects training data. Music generation improves because temporal alignment patterns from speech training carry over to score generation.

The mechanism is the shared latent space inside the DiT backbone. All three audio types occupy the same representational space during generation. When Foley-Omni generates a scene where dialogue and environmental audio coexist, it draws on representations learned from both modalities at once rather than generating them as separate parallel tracks. That shared processing is why the joint model produces better individual subtask results than siloed models.

The improvement is not uniform across all subtasks. Speech generation, which requires the closest alignment to specific phoneme sequences and prosodic patterns, shows a more modest relative gain than music generation, where the joint training signal from sound effects and speech provides broader acoustic context. The benchmark is designed to surface these variations rather than aggregate them into a single score. That granularity makes V2ST-Bench more useful as an evaluation standard than a composite metric, showing where joint training delivered a gain and where the modality boundaries remained a limiting factor during training.

Processing Longer Footage

The 10 second processing window is a hardware constraint rather than a creative limit. For a short dialogue exchange or a sound effect drop, 10 seconds covers the duration of most in-camera shot lengths. For scenes that run longer, the workflow is to identify natural edit points in the footage, process each segment independently, and assemble the audio tracks in the editing session.

The challenge is maintaining continuity across segment boundaries. The model generates each 10 second clip as an independent audio environment. When two segments are joined, the resulting audio tracks may show perceptible discontinuities in ambience level or reverb character at the cut point. A short crossfade between segments, typically 0.5 to 1 second, bridges these transitions in the assembled audio without introducing gaps or overlaps.

Scenes involving extended action sequences, sustained ambient environments, or continuous dialogue runs are the primary cases for segment processing. A 30 second sequence processed in three 10 second segments produces three audio files, each covering a distinct portion of the scene. Aligning and crossfading these in a DAW takes significantly less time than a traditional recording session for the same material, because the model generates all three audio types in a single inference pass per segment.

Getting Started

The model requires Python 3.10, CUDA 12.4, PyTorch 2.6.0, and FlashAttention 2.7.4. It runs on a GPU with enough VRAM to hold a 5.5B parameter model, consistent with other models in that size range.

Weights are available on HuggingFace at CocoBro/Foley-Omni. The GitHub repository at NJU-Speech/Foley-Omni includes an inference script and full environment setup. The model processes clips up to 10 seconds per pass. Longer footage requires splitting into segments and processing each independently.

For teams processing multiple scenes in sequence, the inference script accepts batch inputs, running multiple clips through the model without reloading weights between segments. The first inference pass on a cold start incurs a model loading overhead. Subsequent passes run at full generation speed, making batch processing significantly more efficient per segment than running the script independently for each clip.

Productions targeting 48 kHz stereo for broadcast delivery should plan an upsampling step after Foley-Omni generation, as the model's 16 kHz output requires conversion before final mixing in Pro Tools or similar professional DAWs.

For projects combining sound generation with video work in a single session, the AI FILMS Studio sound workspace provides sound design tools alongside the video workspace. For voice cloning specifically, generating new speech in a target speaker's voice rather than extracting audio from video, dots.tts from Hilab achieves 54ms streaming latency with emotion inferred directly from the target text.

AI FILMS Studio video generation workspace

Try AI FILMS Studio

Generate text-to-video and image-to-video with the latest AI models in the video workspace.

Nodes Graph Editor

Build custom AI workflows by connecting models visually in the Nodes Graph Editor.

Sources

arXiv: Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack Generation
GitHub: NJU-Speech/Foley-Omni
Hugging Face: CocoBro/Foley-Omni

Continue Reading

Jul 26, 2026

Wan-Dancer-14B: Alibaba's Music Driven Dance Video Model Reaches Minute Scale

Wan-Dancer-14B from Alibaba generates music driven dance videos at minute scale, with outfit customization and movement reference controls across multiple dance styles.

Jul 25, 2026

Venice 83 Competition Lineup: Danny Boyle Opens, Barbera Declares AI Films Welcome 'Under Artistic Control'

Venice 83 unveils its lineup as Barbera declares AI films welcome under complete artistic control, with an Italian AI film selected out of competition.

Jul 24, 2026

Guillermo Del Toro at Comic-Con: 'Absolutely No Goddamn AI' in the Pan's Labyrinth Restoration

At SDCC 2026, Del Toro confirmed the Pan's Labyrinth 4K 3D restoration used no AI, framing the decision as an obligation to preserve generational craft.

View all Posts