SANA-WM: NVIDIA's Open Source World Model for Minute-Scale Video

May 14, 2026

Share this post:

SANA-WM: NVIDIA's Open Source World Model for Minute-Scale Video

NVIDIA NVLabs released SANA-WM, a 2.6 billion parameter world model that generates 60-second 720p video with six degrees of freedom camera control on a single GPU. The model delivers 36 times higher throughput than prior open source baselines while matching the visual quality of closed source alternatives. It is released under Apache 2.0 with weights and code available publicly.

SANA-WM world generation reel

What SANA-WM Does

SANA-WM generates 720p (1280×720) video clips up to 60 seconds long in a single pass. The model takes a starting image and a 6-DoF camera trajectory as input, then synthesizes a spatially consistent world along that path.

A distilled variant runs on a single RTX 5090 with NVFP4 quantization. It produces a 60-second clip in 34 seconds, a 2.1 times real time generation rate. Standard precision inference on H100 class hardware completes the same clip in roughly two to three minutes.

NVIDIA benchmarks show 36 times higher throughput than the nearest open source baseline. Visual quality scores are comparable to closed source industrial systems LingBot-World and HY-WorldPlay according to the published paper.

Minute Long Worlds

60 second world generation, example 1

60 second world generation, example 2

Twenty Second Worlds

20 second world generation, example 1

20 second world generation, example 2

Same First Frame, Different Paths

Starting from the same initial image, SANA-WM generates separate world explorations based solely on the camera trajectory given as input. The three clips below share an identical first frame but follow distinct 6-DoF paths.

Path A

Path B

Path C

Refiner Effect

SANA-WM uses a two stage generation pipeline. The base model generates an initial draft of the full sequence. A separate long video refiner then passes through it again, correcting inconsistencies that accumulate over long durations. The pair below shows the same clip before and after the refiner stage.

Base model output

After long video refiner

Architecture: Hybrid Linear Diffusion Transformer

SANA-WM extends the SANA model family (covered in the NVIDIA SANA video model article) with a hybrid attention design built for long video. Standard transformer attention scales quadratically with sequence length. At 720p and 60 seconds, that cost is prohibitive. SANA-WM addresses this with Gated DeltaNet (GDN), a linear recurrent attention mechanism that processes frames sequentially with constant memory cost per step.

Each transformer block combines GDN for frame wise local processing with standard softmax attention for global context. The hybrid pairing keeps total sequence length manageable while preserving long range temporal modeling.

A dual branch camera control module runs in parallel with the main generation trunk. One branch encodes the 6-DoF trajectory. The other handles content synthesis. They merge at each transformer block, giving the model continuous access to camera position data throughout the full generation pass.

The encoder uses DC-AE (Deep Compression AutoEncoder), which compresses video into a latent space 32 times smaller than the input. Standard VAEs used in most video diffusion models compress at 8 times. The larger compression factor reduces the number of tokens the transformer processes, cutting both memory use and inference time.

Model	Parameters	Max Duration	Camera Control	License
SANA-WM	2.6B	60 seconds	6-DoF full trajectory	Apache 2.0
LingBot-World	N/A	N/A	Camera control	Apache 2.0
HY-WorldPlay	N/A	N/A	Camera control	Closed
MosaicMem	N/A	120 seconds	Camera tokens	Research

License and Commercial Use

SANA-WM is released under Apache 2.0, which permits commercial use, modification, and redistribution without royalty obligations. Weights are hosted on HuggingFace and the full training and inference code is on GitHub under the same license.

NVIDIA trained the model on 213,000 public video clips annotated with metric-scale 6-DoF camera poses. Training ran on 64 H100 GPUs over 15 days. The paper credits the annotated dataset as the primary factor behind the model's stronger camera trajectory accuracy versus other open source world models.

How to Run SANA-WM Locally

Hardware requirements

Component	Minimum	Recommended
GPU	CUDA GPU, 8GB VRAM (quantized)	RTX 4090 (24GB) or RTX 5090
CUDA	12.x	12.4+
RAM	32GB	64GB
OS	Linux or Windows	Linux

The minimum configuration requires either the NVFP4 or 8-bit quantized checkpoint. Full precision inference requires A100 or H100 class hardware.

Installation

git clone https://github.com/NVlabs/Sana.git
cd Sana && ./environment_setup.sh sana

The setup script creates a conda environment, installs PyTorch with CUDA support, and downloads the model weights automatically. No manual dependency management is needed.

For a reference on setting up a local GPU video generation environment, the LTX-2 RTX GPU setup tutorial covers the CUDA and driver configuration that applies to any video diffusion model.

Quantization options

Mode	Target GPU	Inference time for 60s clip
Full precision	H100 / A100	2 to 3 minutes
8-bit	RTX 4090	~5 minutes (estimated)
NVFP4	RTX 5090	34 seconds

The model also integrates with the diffusers library and has ComfyUI nodes available for node based workflow use without custom Python scripts.

What to Build With It

SANA-WM targets world simulation tasks where a fixed camera path needs to traverse a consistent synthetic environment. The most direct filmmaking application is background plate generation: provide a reference image and a camera move, and the model returns a spatially consistent 60-second clip for compositing behind live action or animation.

For shorter text-to-video generation, the AI FILMS Studio video workspace covers that workflow with the latest commercial models. For minute-long, camera controlled shots where spatial consistency across the full duration matters, SANA-WM is currently the only open source option at 720p and 60 seconds.

The open source world model space now includes several alternatives. LingBot-World targets shorter durations with its own camera control system.

AI FILMS Studio video generation workspace

Try AI FILMS Studio

Generate text-to-video and image-to-video with the latest AI models in the video workspace.

Nodes Graph Editor

Build custom AI workflows by connecting models visually in the Nodes Graph Editor.

Sources

arXiv: SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
GitHub: NVlabs/Sana
Project page: nvlabs.github.io/Sana/WM/

Continue Reading

May 13, 2026

Cate Blanchett Launches RSL Media: Open AI Consent Standard Backed by Clooney, Streep, and Hanks

Cate Blanchett started RSL Media on May 12, a nonprofit building an open consent standard for AI use of likenesses, backed by Clooney, Streep, and Hanks.

May 13, 2026

David Mamet's Speed the Plow Uses New Hollywood AI Platform at Cannes

David Mamet's Speed the Plow, starring Anthony Mackie and Sharon Stone, becomes the first prestige production to use the New Hollywood AI platform.

May 13, 2026

Guillermo del Toro at Cannes: 'Art Can't Be Done With a F---ing App'

Del Toro presented Pan's Labyrinth's 4K restoration at Cannes Classics and delivered what Frémaux called the festival's first political statement against AI.

View all Posts