EditorNodesPricingBlog

MOVA: Open Source Video-Audio Generation Available for Commercial Use

February 9, 2026
MOVA: Open Source Video-Audio Generation Available for Commercial Use

Share this post:

MOVA: Open Source Video-Audio Generation Available for Commercial Use

OpenMOSS has released MOVA (MOSS Video and Audio), an open source foundation model that generates synchronized video and audio simultaneously. Unlike proprietary systems such as Veo 3 and Sora 2, MOVA is available under the Apache 2.0 license, permitting commercial use without restrictions.

The model addresses a fundamental limitation in existing video generation systems. Most platforms generate video first, then add audio as a separate step, creating synchronization issues. MOVA produces both modalities in a single inference pass, eliminating cascaded pipeline errors.

Technical Architecture and Performance

MOVA employs a 32 billion parameter architecture with 18 billion parameters active during inference. The model uses a Mixture of Experts (MoE) design with an asymmetric dual tower structure and bidirectional cross attention mechanisms.

On Verse Bench evaluations, the 720p version achieved state of the art results in multilingual lip synchronization metrics and speech recognition accuracy. The model outperformed existing open source alternatives across tested parameters.

Open Source Release and Commercial Licensing

The Apache 2.0 license grants users the right to use, modify, and distribute the model for commercial projects. OpenMOSS has released the complete package including model weights, inference code, training pipelines, and LoRA fine tuning scripts.

Two versions are available via Hugging Face. MOVA 360p provides lower resolution output suitable for rapid prototyping and testing. MOVA 720p delivers higher resolution results for production use. Both versions support text to video audio and image to video audio generation.

Installation Guide for Local Deployment

Installation requires a conda environment and pip package manager. The process takes approximately 10 minutes on standard hardware.

First, create and activate a conda environment with Python 3.13:

conda create -n mova python=3.13 -y
conda activate mova

Clone the repository and install dependencies:

git clone https://github.com/OpenMOSS/MOVA.git
cd MOVA
pip install -e .

For users planning to train custom models or fine tune with LoRA, install additional dependencies:

pip install -e ".[train]"

Download the model weights from Hugging Face:

pip install -U huggingface_hub
hf download OpenMOSS-Team/MOVA-360p --local-dir /path/to/MOVA-360p

Replace the path with your preferred installation directory. The 360p model requires approximately 15GB of disk space. The 720p version requires 28GB.

Hardware Requirements and Performance

MOVA supports multiple deployment configurations based on available GPU memory. For 360p video generation at 8 seconds, the system offers two offloading modes.

Component wise offload requires 48GB VRAM and processes at 37.5 seconds per step on an RTX 4090. Layerwise offload reduces memory requirements to 12GB VRAM with a modest performance impact of 42.3 seconds per step.

Training via LoRA fine tuning operates in three modes. Low resource mode runs on a single GPU with 18GB VRAM. Accelerate mode optimizes single GPU training. Accelerate with FSDP distributes training across 8 GPUs, requiring approximately 50GB VRAM per GPU.

Usage for AI Filmmakers

Filmmakers can generate video with synchronized audio from text prompts or reference images. The single person speech generation mode accepts a prompt describing the subject and action alongside a reference image.

Example command for generating a formal speech scene:

torchrun --nproc_per_node=1 scripts/inference_single.py \
  --ckpt_path /path/to/MOVA-360p/ \
  --height 352 --width 640 \
  --prompt "A woman speaks formally in a conference room" \
  --ref_path ./reference-image.jpg \
  --output_path ./output.mp4

The system generates video at 352x640 resolution with synchronized speech matching the prompt description. Generation time varies based on hardware configuration and selected quality settings.

For filmmakers working with AI FILMS Studio, MOVA offers a complementary tool for pre visualization and concept testing. The open source nature permits integration into existing workflows without licensing concerns.

Environment Aware Audio Generation

MOVA generates sound effects that match visual context. The model analyzes scene composition, lighting, and movement to produce appropriate environmental audio. A scene depicting rain generates corresponding water impact sounds. Urban environments trigger city ambiance.

This capability reduces post production audio work. Filmmakers can evaluate scene effectiveness with synchronized sound during pre production rather than waiting for final audio mixing.

Training Custom Models with LoRA

OpenMOSS includes complete training scripts for LoRA fine tuning. Filmmakers can train the model on specific visual styles, character types, or audio characteristics. Configuration files in the configs/training/ directory control training parameters.

Low resource training mode enables fine tuning on a single consumer GPU. This democratizes access to custom model training previously limited to organizations with extensive compute infrastructure.

The training process accepts video audio pairs as input data. Filmmakers can use their own footage to teach the model specific aesthetic preferences or narrative styles. Training duration ranges from several hours to days depending on dataset size and hardware configuration.

Integration with Production Workflows

MOVA outputs standard MP4 files compatible with all major editing software. The generated content integrates directly into Adobe Premiere, DaVinci Resolve, Final Cut Pro, and other professional tools.

For teams using SAG AFTRA compliant workflows, the open source nature of MOVA provides transparency regarding training data and model behavior. This addresses concerns about rights and compensation that affect proprietary systems.

The Apache 2.0 license permits modifications to the codebase. Studios can adapt the inference pipeline to specific production requirements or integrate MOVA into larger automated systems.

Performance Benchmarks and Quality

Testing on the 720p model demonstrates competitive performance with commercial alternatives. Lip synchronization accuracy exceeds 95% on English language prompts. Multilingual support includes Mandarin, Spanish, French, and German with varying accuracy rates.

Audio fidelity reaches 44.1kHz sampling rate with stereo output. The model maintains temporal consistency across generated frames, reducing flicker and artifacts common in diffusion based video systems.

Generation speed scales with available compute. An 8 GPU setup produces 8 second clips in approximately 3 minutes. Single GPU configurations with layerwise offload require 15-20 minutes for equivalent output.

Limitations and Considerations

MOVA exhibits typical diffusion model limitations. Complex scenes with multiple moving subjects sometimes produce inconsistent motion. Fine detail in background elements may lack sharpness at 360p resolution.

The 720p model improves visual quality but increases compute requirements proportionally. Teams should evaluate whether local deployment justifies the infrastructure cost versus using cloud based alternatives.

Long form content generation requires stitching multiple clips. The model currently supports maximum 8 second sequences. Extending to longer durations without introducing discontinuities remains an active development area.


Resources

GitHub Repository: OpenMOSS/MOVA Hugging Face Models: