SANA Video | small, fast text to video with linear attention

Share this post:
SANA Video | what matters
SANA Video is a small diffusion model for fast, long form video generation. The project page describes a target of up to 720 by 1280 resolution, around one minute maximum duration, and results that track prompts closely while keeping latency low enough to feel practical. The design centers on a Linear Diffusion Transformer, which replaces standard quadratic attention with a linear variant to cut compute on large token counts. On top of that, the authors introduce a constant memory state derived from the cumulative properties of linear attention. Instead of caching a growing set of keys and values as a sequence gets longer, the model maintains a fixed state that preserves global context at steady memory cost. That is the trick that lets it extend to longer clips without running out of VRAM or exploding inference time. They also report a training recipe that emphasizes data filtering and staged resolution increases to contain cost while preserving motion quality and aesthetics.
For filmmaking the appeal is straightforward. Many text to video systems can produce beautiful five second shots, but long sequences usually bog down on memory or drift off prompt. SANA Video aims to hold both length and coherence while running on accessible hardware. The page highlights deployment on an RTX 5090 using NVFP4 precision with a reported speedup over standard settings, and the paper compares favorably against other small models in latency. There are also image to video examples and references to multi scene transitions for longer narratives. If those claims hold when the code lands, SANA Video could serve as a fast previz engine for story beats, blocking, and tone tests where you need more than a single burst of motion. Until then, treat the page and preprint as a technical roadmap and a set of reported benchmarks rather than a drop in tool.
Availability and licensing
As of now the public page lists Code (coming soon) and states that code and model will be publicly released. No repository, license file, or model weights are posted yet. That means there is no open source release at this moment, and no published license to allow commercial use. You can read the paper and review the demos, but you cannot adopt the system until NVIDIA publishes code and weights together with explicit terms. When they do, verify the license on the official repository and model card before integrating SANA Video into a production pipeline.
Sources
Project page: https://nvlabs.github.io/Sana/Video/
Paper (PDF): https://arxiv.org/pdf/2509.24695