SANA-WM: NVIDIA's Open Source World Model for Minute-Scale Video
Share this post:
SANA-WM: NVIDIA's Open Source World Model for Minute-Scale Video
NVIDIA NVLabs released SANA-WM, a 2.6 billion parameter world model that generates 60-second 720p video with six degrees of freedom camera control on a single GPU. The model delivers 36 times higher throughput than prior open source baselines while matching the visual quality of closed source alternatives. It is released under Apache 2.0 with weights and code available publicly.
SANA-WM world generation reel
What SANA-WM Does
SANA-WM generates 720p (1280×720) video clips up to 60 seconds long in a single pass. The model takes a starting image and a 6-DoF camera trajectory as input, then synthesizes a spatially consistent world along that path.
A distilled variant runs on a single RTX 5090 with NVFP4 quantization. It produces a 60-second clip in 34 seconds, a 2.1 times real time generation rate. Standard precision inference on H100 class hardware completes the same clip in roughly two to three minutes.
NVIDIA benchmarks show 36 times higher throughput than the nearest open source baseline. Visual quality scores are comparable to closed source industrial systems LingBot-World and HY-WorldPlay according to the published paper.
Minute Long Worlds
60 second world generation, example 1
60 second world generation, example 2
Twenty Second Worlds
20 second world generation, example 1
20 second world generation, example 2
Same First Frame, Different Paths
Starting from the same initial image, SANA-WM generates separate world explorations based solely on the camera trajectory given as input. The three clips below share an identical first frame but follow distinct 6-DoF paths.
Path A
Path B
Path C
Refiner Effect
SANA-WM uses a two stage generation pipeline. The base model generates an initial draft of the full sequence. A separate long video refiner then passes through it again, correcting inconsistencies that accumulate over long durations. The pair below shows the same clip before and after the refiner stage.
Base model output
After long video refiner
Architecture: Hybrid Linear Diffusion Transformer
SANA-WM extends the SANA model family (covered in the NVIDIA SANA video model article) with a hybrid attention design built for long video. Standard transformer attention scales quadratically with sequence length. At 720p and 60 seconds, that cost is prohibitive. SANA-WM addresses this with Gated DeltaNet (GDN), a linear recurrent attention mechanism that processes frames sequentially with constant memory cost per step.
Each transformer block combines GDN for frame wise local processing with standard softmax attention for global context. The hybrid pairing keeps total sequence length manageable while preserving long range temporal modeling.
A dual branch camera control module runs in parallel with the main generation trunk. One branch encodes the 6-DoF trajectory. The other handles content synthesis. They merge at each transformer block, giving the model continuous access to camera position data throughout the full generation pass.
The encoder uses DC-AE (Deep Compression AutoEncoder), which compresses video into a latent space 32 times smaller than the input. Standard VAEs used in most video diffusion models compress at 8 times. The larger compression factor reduces the number of tokens the transformer processes, cutting both memory use and inference time.
| Model | Parameters | Max Duration | Camera Control | License |
|---|---|---|---|---|
| SANA-WM | 2.6B | 60 seconds | 6-DoF full trajectory | Apache 2.0 |
| LingBot-World | N/A | N/A | Camera control | Apache 2.0 |
| HY-WorldPlay | N/A | N/A | Camera control | Closed |
| MosaicMem | N/A | 120 seconds | Camera tokens | Research |
License and Commercial Use
SANA-WM is released under Apache 2.0, which permits commercial use, modification, and redistribution without royalty obligations. Weights are hosted on HuggingFace and the full training and inference code is on GitHub under the same license.
NVIDIA trained the model on 213,000 public video clips annotated with metric-scale 6-DoF camera poses. Training ran on 64 H100 GPUs over 15 days. The paper credits the annotated dataset as the primary factor behind the model's stronger camera trajectory accuracy versus other open source world models.
How to Run SANA-WM Locally
Hardware requirements
| Component | Minimum | Recommended |
|---|---|---|
| GPU | CUDA GPU, 8GB VRAM (quantized) | RTX 4090 (24GB) or RTX 5090 |
| CUDA | 12.x | 12.4+ |
| RAM | 32GB | 64GB |
| OS | Linux or Windows | Linux |
The minimum configuration requires either the NVFP4 or 8-bit quantized checkpoint. Full precision inference requires A100 or H100 class hardware.
Installation
git clone https://github.com/NVlabs/Sana.git
cd Sana && ./environment_setup.sh sana
The setup script creates a conda environment, installs PyTorch with CUDA support, and downloads the model weights automatically. No manual dependency management is needed.
For a reference on setting up a local GPU video generation environment, the LTX-2 RTX GPU setup tutorial covers the CUDA and driver configuration that applies to any video diffusion model.
Quantization options
| Mode | Target GPU | Inference time for 60s clip |
|---|---|---|
| Full precision | H100 / A100 | 2 to 3 minutes |
| 8-bit | RTX 4090 | ~5 minutes (estimated) |
| NVFP4 | RTX 5090 | 34 seconds |
The model also integrates with the diffusers library and has ComfyUI nodes available for node based workflow use without custom Python scripts.
What to Build With It
SANA-WM targets world simulation tasks where a fixed camera path needs to traverse a consistent synthetic environment. The most direct filmmaking application is background plate generation: provide a reference image and a camera move, and the model returns a spatially consistent 60-second clip for compositing behind live action or animation.
For shorter text-to-video generation, the AI FILMS Studio video workspace covers that workflow with the latest commercial models. For minute-long, camera controlled shots where spatial consistency across the full duration matters, SANA-WM is currently the only open source option at 720p and 60 seconds.
The open source world model space now includes several alternatives. LingBot-World targets shorter durations with its own camera control system.
Sources
- arXiv: SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
- GitHub: NVlabs/Sana
- Project page: nvlabs.github.io/Sana/WM/
Continue Reading
Video & LipSync
- Video Generator
- Text to Video
- Image to Video
- Start-End Frame to Video
- Draw to Video
- Motion Control
- Video Enhancer
- Video Upscaler
- Video to Video LipSync
- Audio to Video LipSync
- Image to Video LipSync
- Video FaceSwap
- Seedance 2
- OpenAI Sora 2
- Kling 3.0
- Kling O1
- Google Veo 3.1
- LTX 2.3
- Kling O1
- Hailuo AI
- Luma Ray
- Kling 3.0 Motion
- Topaz Upscaler
- InfiniteTalk Face Swap
.jpg?w=3840)

