EditorNodesPricingBlog

SANA-WM: NVIDIA's Open Source World Model for Minute-Scale Video

May 14, 2026
SANA-WM: NVIDIA's Open Source World Model for Minute-Scale Video

Share this post:

SANA-WM: NVIDIA's Open Source World Model for Minute-Scale Video

NVIDIA NVLabs released SANA-WM, a 2.6 billion parameter world model that generates 60-second 720p video with six degrees of freedom camera control on a single GPU. The model delivers 36 times higher throughput than prior open source baselines while matching the visual quality of closed source alternatives. It is released under Apache 2.0 with weights and code available publicly.

SANA-WM world generation reel

What SANA-WM Does

SANA-WM generates 720p (1280×720) video clips up to 60 seconds long in a single pass. The model takes a starting image and a 6-DoF camera trajectory as input, then synthesizes a spatially consistent world along that path.

A distilled variant runs on a single RTX 5090 with NVFP4 quantization. It produces a 60-second clip in 34 seconds, a 2.1 times real time generation rate. Standard precision inference on H100 class hardware completes the same clip in roughly two to three minutes.

NVIDIA benchmarks show 36 times higher throughput than the nearest open source baseline. Visual quality scores are comparable to closed source industrial systems LingBot-World and HY-WorldPlay according to the published paper.

Minute Long Worlds

60 second world generation, example 1

60 second world generation, example 2

Twenty Second Worlds

20 second world generation, example 1

20 second world generation, example 2

Same First Frame, Different Paths

Starting from the same initial image, SANA-WM generates separate world explorations based solely on the camera trajectory given as input. The three clips below share an identical first frame but follow distinct 6-DoF paths.

Path A

Path B

Path C

Refiner Effect

SANA-WM uses a two stage generation pipeline. The base model generates an initial draft of the full sequence. A separate long video refiner then passes through it again, correcting inconsistencies that accumulate over long durations. The pair below shows the same clip before and after the refiner stage.

Base model output

After long video refiner

Architecture: Hybrid Linear Diffusion Transformer

SANA-WM extends the SANA model family (covered in the NVIDIA SANA video model article) with a hybrid attention design built for long video. Standard transformer attention scales quadratically with sequence length. At 720p and 60 seconds, that cost is prohibitive. SANA-WM addresses this with Gated DeltaNet (GDN), a linear recurrent attention mechanism that processes frames sequentially with constant memory cost per step.

Each transformer block combines GDN for frame wise local processing with standard softmax attention for global context. The hybrid pairing keeps total sequence length manageable while preserving long range temporal modeling.

A dual branch camera control module runs in parallel with the main generation trunk. One branch encodes the 6-DoF trajectory. The other handles content synthesis. They merge at each transformer block, giving the model continuous access to camera position data throughout the full generation pass.

The encoder uses DC-AE (Deep Compression AutoEncoder), which compresses video into a latent space 32 times smaller than the input. Standard VAEs used in most video diffusion models compress at 8 times. The larger compression factor reduces the number of tokens the transformer processes, cutting both memory use and inference time.

Model Parameters Max Duration Camera Control License
SANA-WM 2.6B 60 seconds 6-DoF full trajectory Apache 2.0
LingBot-World N/A N/A Camera control Apache 2.0
HY-WorldPlay N/A N/A Camera control Closed
MosaicMem N/A 120 seconds Camera tokens Research

License and Commercial Use

SANA-WM is released under Apache 2.0, which permits commercial use, modification, and redistribution without royalty obligations. Weights are hosted on HuggingFace and the full training and inference code is on GitHub under the same license.

NVIDIA trained the model on 213,000 public video clips annotated with metric-scale 6-DoF camera poses. Training ran on 64 H100 GPUs over 15 days. The paper credits the annotated dataset as the primary factor behind the model's stronger camera trajectory accuracy versus other open source world models.

How to Run SANA-WM Locally

Hardware requirements

Component Minimum Recommended
GPU CUDA GPU, 8GB VRAM (quantized) RTX 4090 (24GB) or RTX 5090
CUDA 12.x 12.4+
RAM 32GB 64GB
OS Linux or Windows Linux

The minimum configuration requires either the NVFP4 or 8-bit quantized checkpoint. Full precision inference requires A100 or H100 class hardware.

Installation

git clone https://github.com/NVlabs/Sana.git
cd Sana && ./environment_setup.sh sana

The setup script creates a conda environment, installs PyTorch with CUDA support, and downloads the model weights automatically. No manual dependency management is needed.

For a reference on setting up a local GPU video generation environment, the LTX-2 RTX GPU setup tutorial covers the CUDA and driver configuration that applies to any video diffusion model.

Quantization options

Mode Target GPU Inference time for 60s clip
Full precision H100 / A100 2 to 3 minutes
8-bit RTX 4090 ~5 minutes (estimated)
NVFP4 RTX 5090 34 seconds

The model also integrates with the diffusers library and has ComfyUI nodes available for node based workflow use without custom Python scripts.

What to Build With It

SANA-WM targets world simulation tasks where a fixed camera path needs to traverse a consistent synthetic environment. The most direct filmmaking application is background plate generation: provide a reference image and a camera move, and the model returns a spatially consistent 60-second clip for compositing behind live action or animation.

For shorter text-to-video generation, the AI FILMS Studio video workspace covers that workflow with the latest commercial models. For minute-long, camera controlled shots where spatial consistency across the full duration matters, SANA-WM is currently the only open source option at 720p and 60 seconds.

The open source world model space now includes several alternatives. LingBot-World targets shorter durations with its own camera control system.

Sources