SANA Video | small, fast text to video with linear attention

Share this post:
SANA Video | small, fast text to video with linear attention
SANA Video is a video generation model from NVLabs targeting clips up to 720×1280 resolution and approximately one minute in duration. The design centers on a Linear Diffusion Transformer that replaces standard quadratic attention with a linear variant, which keeps memory cost from growing out of control at the sequence lengths required for long video.
The project comes from the same NVLabs group that produced the SANA image generation models. Where those models applied linear attention to high resolution image generation, SANA Video extends the same core architectural idea to the temporal dimension.
Why Standard Attention Fails at Long Video
Transformer attention computes relationships between every token pair in a sequence. The memory and compute cost scales quadratically with sequence length. Double the sequence and the attention cost quadruples. For images, that cost is manageable because a 1024×1024 image produces a fixed number of spatial tokens. For video, adding temporal frames multiplies the token count by the number of frames, pushing the quadratic cost into territory that ordinary GPUs cannot handle.
A 5 second 24fps video at 720p resolution contains roughly 120 frames, each contributing thousands of spatial tokens. The full pairwise attention matrix for that sequence cannot fit in the VRAM of most consumer cards, which is why most open source video models cap at 4 to 10 seconds and lower resolutions.
The linear attention approximation addresses this by replacing the full pairwise computation with a bounded complexity operation. Precision is traded for scalability. You lose some ability to model fine grained global relationships, but the memory cost scales linearly with sequence length instead of quadratically.
Block Linear Attention
SANA Video implements block linear attention, which applies the linear approximation within blocks of the sequence rather than globally. This preserves some local precision within each block while maintaining the global scaling benefit of linear complexity.
Sequences are divided into segments, and full attention runs within each segment. Across segments, linear attention handles the long range dependencies. The block size is a hyperparameter that controls the tradeoff between local precision and overall memory budget.
This hybrid approach is meaningfully different from pure linear attention. Models that apply linear approximations globally can lose fine grained texture detail and short range motion coherence. The block structure keeps motion within short windows accurate while making the full minute length sequence tractable.
The paper reports favorable latency comparisons against other small models at similar resolutions. Specific benchmark numbers are in the preprint and depend on the hardware configuration used.
The Constant Memory State
On top of the block linear architecture, the authors introduce a constant memory state. Standard attention caches key and value tensors for every position in the sequence, and those caches grow as generation progresses. At long sequence lengths, the growing cache dominates memory use more than the model weights themselves.
The constant memory state replaces the growing cache with a fixed representation that summarizes global context at steady memory cost. Instead of tracking every past frame explicitly, the model maintains a running summary that captures accumulated context.
The practical effect is that generating a 60 second sequence does not require proportionally more VRAM than generating a 5-second sequence. Peak memory use for the state component is the same regardless of output duration.
This is the architectural basis for the claimed minute length generation capability. Other video diffusion models that generate long sequences typically do so by generating short clips and concatenating them, which introduces consistency challenges at boundaries between clips.
Resolution and Duration Targets
The project page describes a target of up to 720×1280 resolution and approximately one minute duration. The primary benchmark uses an RTX 5090 with NVFP4 precision, which is a low-precision format available on NVIDIA's Blackwell architecture GPUs.
NVFP4 is four-bit floating point, designed specifically for Blackwell. It halves the memory footprint of FP8 and quarters that of FP16, allowing larger or longer models to run on a given GPU. The RTX 5090 released in early 2025 is the first consumer GPU with dedicated NVFP4 tensor cores.
At RTX 5090 plus NVFP4, the paper reports generation latency that it positions favorably against other small models. The specific figures are in the preprint. No benchmark has been published for older GPU generations, and no consumer Ada or Ampere benchmark appears alongside the initial release.
Hardware and Precision Requirements
The NVFP4 requirement means the published benchmark only applies to Blackwell GPUs. RTX 5090 owners see the best case performance. Owners of Ada GPUs (RTX 4000 series) and Ampere GPUs (RTX 3000 series) can still run FP8 precision, which is supported on those architectures, but not NVFP4.
An Ada GPU running SANA Video in FP8 will work correctly but will not match the RTX 5090 benchmark. The gap depends on the specific model and hardware configuration. The paper does not publish Ada benchmarks, so the practical inference time on an RTX 4090 is not documented at this stage.
The constant memory state and the block linear attention reduce VRAM requirements relative to a quadratic-attention model of the same parameter count. Even on older hardware, SANA Video should be more memory-efficient than a standard transformer video model targeting the same resolution.
Production Applications
The primary production use case for SANA Video, if the claims hold at release, is previsualization and story development at minute length. Current open source models require stitching multiple short clips to produce long sequences, and every clip boundary is a potential consistency failure point.
A single pass system that maintains coherence across 60 seconds, even at small model quality levels, changes that workflow. Editors and directors can evaluate a full scene beat in context rather than reviewing isolated 5-second segments and imagining how they connect.
At 720p, the resolution is adequate for internal review and pitching purposes but below final delivery specs for most exhibition formats. The intended use is iteration and direction, not finishing. Generating a minute long sequence that communicates a sequence's timing, spatial relationships, and visual tone is genuinely useful at that stage even if the final frame quality needs polish.
The image-to-video capability documented in the paper extends the use case to shot extension and transition development. Starting from a reference frame, the model can generate the subsequent sequence while maintaining consistency with the starting image.
Relationship to the SANA Model Family
SANA Video follows two prior NVLabs releases. SANA 1.0 applied linear attention to image generation and produced high resolution images at competitive quality on smaller GPUs than standard diffusion models required. SANA 1.5 extended the parameter count and training to improve quality at the same resolution targets.
The transfer of the linear attention architecture from images to video is the core contribution of SANA Video. The block linear attention and constant memory state are adaptations that address the specific challenges of temporal sequences rather than spatial patches.
SANA-WM, also from NVLabs, targets world modeling at minute length with 6 DoF camera control and is available under Apache 2.0 with weights released. SANA-WM uses a different architecture focused on 3D consistent world representation rather than optimized inference speed. The two projects cover different points on the capability spectrum from the same research group.
Availability and What to Watch For
As of the September 2025 publication, no code or weights have been released. The project page states code is coming soon. No license has been posted, and there is no confirmed timeline for open source release.
When NVIDIA does release SANA Video, the license terms on the repository and model card are the authoritative source. NVLabs releases have varied between research only noncommercial licenses and more permissive terms. The license for the image SANA models was research only at initial release and later updated, so the video version may follow a similar pattern.
Watch the NVLabs GitHub organization and the HuggingFace nvlabs organization for the release. The benchmark hardware requirement (RTX 5090 + NVFP4) suggests the released model will be optimized for Blackwell, but FP8 and FP16 inference on Ada and Ampere should be possible with somewhat longer generation times.
Until weights are available and independently evaluated, treat the benchmark figures as targets rather than verified results. Published latency numbers from authors reflect controlled conditions and optimal configurations that may not match typical user setups.
Generation Quality Characteristics
The paper includes sample outputs generated at 720×1280 that demonstrate the model's ability to maintain visual coherence across fast motion and scene transitions. Based on the published examples, the model produces motion that matches text prompts with reasonable subject stability across the sequence length.
Linear attention models typically trade some fine detail modeling capacity for the efficiency gain. In image generation, this manifests as slightly softer texture rendering compared to full attention models at equivalent resolution. In video, the same tradeoff affects high frequency motion detail, such as the fine movement of hair or fabric in wind driven sequences.
The block linear structure is intended to address this by preserving full attention precision within local windows. For scene-level coherence, such as a character walking through a room, the block structure keeps short range motion accurate. For very fine texture detail across a long sequence, some loss relative to a full attention baseline is expected.
At 720p, the resolution is high enough that the detail loss is less perceptible than it would be at 1080p or higher. For previsualization and pitching purposes at 720p viewing sizes, the quality level should be adequate for communicating composition, timing, and motion intent.
How SANA Video Fits Open Source Workflows
As of September 2025, the primary open source options for long video generation require either chaining multiple short clips or running on hardware budgets that exclude most individual creators. Models like CogVideoX and Wan 2.1 produce good quality but require significant VRAM and still cap practical clip length well below a minute.
A model that can generate a 60 second sequence in a single pass on consumer hardware using Blackwell FP4 would represent a meaningful change in what is practical for individual and small team production workflows. The architectural claims are plausible given the linear attention approach, but they depend on what the released weights actually deliver under typical generation conditions.
When the code releases, the most useful community benchmarks will address what it actually produces at 720p for a 60 second sequence, what GPU is realistically needed, and how quality compares to chaining ten 6 second Wan 2.1 clips for the same subject. Those comparisons are what will determine where SANA Video fits in practice.
The Blackwell first benchmark is an early indicator that NVLabs is optimizing for NVIDIA's newest hardware generation. The model may ship with performance profiles that work across older architectures as secondary targets, similar to how other NVLabs research releases have structured their inference code to run broadly but benchmark on optimal hardware.
For production teams evaluating whether to wait for SANA Video or build pipelines on currently available models, the closest equivalent right now is SANA-WM, which handles world modeling at minute length under Apache 2.0 with released weights. If your use case does not require the specific efficiency tradeoffs of the linear attention approach, SANA-WM is the accessible option right now.
Keep an eye on the NVLabs and HuggingFace nvlabs pages for the release. When it drops, the first meaningful signal will be user reports from RTX 4000 and 3000 series owners who have tested actual generation quality and latency on hardware that predates the Blackwell benchmark setup.
For production teams building pipelines now, SANA-WM covers world modeling at minute length with released weights and Apache 2.0 license. It handles a different use case than SANA Video but demonstrates what the NVLabs approach produces at delivery-ready quality for teams that cannot wait for the linear attention model's release.
Sources
Project page: nvlabs.github.io/Sana/Video arXiv: 2509.24695
Continue Reading
Video & LipSync
- Video Generator
- Text to Video
- Image to Video
- Start-End Frame to Video
- Draw to Video
- Motion Control
- Video Enhancer
- Video Upscaler
- Video to Video LipSync
- Audio to Video LipSync
- Image to Video LipSync
- Video FaceSwap
- Seedance 2
- Vidu Q3 Pro
- Google Veo 3.1
- Kling 3.0 Pro
- LTX 2.3
- Happy Horse 1.1
- Kling 3.0 Motion
- ByteDance Upscaler
- InfiniteTalk
- InsightFace
