SANA-Streaming: Real-Time Video Editing at 24 FPS on a Single Consumer GPU

Share this post:
SANA-Streaming: Real-Time Video Editing at 24 FPS on a Single Consumer GPU
NVIDIA released a project page and arXiv paper on May 28, 2026 for SANA-Streaming, a streaming video editing system that applies stylistic and appearance edits to video at 24 end-to-end frames per second on a single RTX 5090 at 1280×704 resolution. The paper describes a Hybrid Diffusion Transformer architecture optimized for consumer GPU hardware.
SANA-Streaming teaser, real time video editing at 24 FPS on a single RTX 5090
What It Does
SANA-Streaming takes a video stream as input and applies text-guided appearance or style edits to each frame continuously, producing edited output without stopping for batch processing. The system is designed for applications where latency matters: live broadcasting, interactive production, and on set monitoring workflows where a director wants to preview how a scene reads with different visual treatments before committing to them in post.
The DiT core of the system runs at 58 FPS on the same hardware. The gap between the core speed and the 24 FPS end to end figure reflects the overhead added by preprocessing, the encoding and decoding pipeline, and the system codesign components.
Three Technical Components
The paper describes three core contributions.
Hybrid Diffusion Transformer. Standard linear attention is efficient but misses fine local detail. SANA-Streaming introduces softmax attention in a subset of the transformer blocks while keeping linear attention in the rest. This gives the model stronger local spatial modeling without abandoning the efficiency of the linear architecture used in earlier SANA releases.
Cycle-Reverse Regularization. Training a video editing model typically requires large collections of paired videos: an original and an edited version of the same content. SANA-Streaming introduces Cycle-Reverse Regularization, a flow matching strategy that enforces semantic consistency without needing those paired training datasets. The model learns to preserve the underlying structure of the input while applying the requested edits.
System Codesign. The third contribution targets hardware specifically. The system uses fused GDN kernels and mixed precision quantization designed for NVIDIA's Blackwell GPU architecture (RTX 5090 series). This is the layer that converts what would be a powerful research model into something that runs at usable speeds on a single GPU rather than a cluster.
Performance
The benchmark figure from the paper: 24 end to end FPS at 1280×704 resolution on a single RTX 5090. The DiT core alone reaches 58 FPS at the same resolution on the same hardware. The paper reports improvements in temporal coherence over existing methods, though side by side comparisons are available on the project page rather than in numerical table form in the abstract.
The model was trained using 32 NVIDIA H100 GPUs. The paper is authored by Yuyang Zhao, Yicheng Pan, Qiyuan He, Jincheng Yu, Junsong Chen, Tian Ye, Haozhe Liu, Enze Xie, and Song Han.
License and Availability
The project page and arXiv paper are live. Code release is listed as "coming soon" in the NVLabs/Sana GitHub repository, where SANA-Streaming appears on the project's planned releases list. When released, the code is expected to fall under the Apache 2.0 license that governs the parent NVLabs/Sana repository, which permits commercial use.
SANA-Streaming extends a line of NVIDIA open source video models that includes SANA-Video, a high resolution video generation model, and SANA-WM, a world model for minute-scale 720p generation. The streaming model addresses a different problem: real time editing of existing video rather than generation from scratch.
Filmmakers who want to work with AI video generation and editing tools can access them in AI FILMS Studio now.
Sources
Continue Reading
Video & LipSync
- Video Generator
- Text to Video
- Image to Video
- Start-End Frame to Video
- Draw to Video
- Motion Control
- Video Enhancer
- Video Upscaler
- Video to Video LipSync
- Audio to Video LipSync
- Image to Video LipSync
- Video FaceSwap
- Seedance 2
- OpenAI Sora 2
- Kling 3.0
- Kling O1
- Google Veo 3.1
- LTX 2.3
- Kling O1
- Hailuo AI
- Luma Ray
- Kling 3.0 Motion
- Topaz Upscaler
- InfiniteTalk Face Swap

