SANA-Streaming: Real-Time Video Editing at 24 FPS on a Single Consumer GPU

June 1, 2026

Share this post:

SANA-Streaming: Real-Time Video Editing at 24 FPS on a Single Consumer GPU

NVIDIA released a project page and arXiv paper on May 28, 2026 for SANA-Streaming, a streaming video editing system that applies stylistic and appearance edits to video at 24 end-to-end frames per second on a single RTX 5090 at 1280×704 resolution. The paper describes a Hybrid Diffusion Transformer architecture optimized for consumer GPU hardware.

SANA-Streaming teaser, real time video editing at 24 FPS on a single RTX 5090

What It Does

SANA-Streaming takes a video stream as input and applies text-guided appearance or style edits to each frame continuously, producing edited output without stopping for batch processing. The system is designed for applications where latency matters: live broadcasting, interactive production, and on set monitoring workflows where a director wants to preview how a scene reads with different visual treatments before committing to them in post.

The DiT core of the system runs at 58 FPS on the same hardware. The gap between the core speed and the 24 FPS end to end figure reflects the overhead added by preprocessing, the encoding and decoding pipeline, and the system codesign components.

Three Technical Components

The paper describes three core contributions.

Hybrid Diffusion Transformer. Standard linear attention is efficient but misses fine local detail. SANA-Streaming introduces softmax attention in a subset of the transformer blocks while keeping linear attention in the rest. This gives the model stronger local spatial modeling without abandoning the efficiency of the linear architecture used in earlier SANA releases.

SANA-Streaming architecture overview showing the Hybrid Diffusion Transformer design — SANA-Streaming architecture. Source: NVIDIA NVLabs

Cycle-Reverse Regularization. Training a video editing model typically requires large collections of paired videos: an original and an edited version of the same content. SANA-Streaming introduces Cycle-Reverse Regularization, a flow matching strategy that enforces semantic consistency without needing those paired training datasets. The model learns to preserve the underlying structure of the input while applying the requested edits.

System Codesign. The third contribution targets hardware specifically. The system uses fused GDN kernels and mixed precision quantization designed for NVIDIA's Blackwell GPU architecture (RTX 5090 series). This is the layer that converts what would be a powerful research model into something that runs at usable speeds on a single GPU rather than a cluster.

SANA-Streaming video editing results comparing input and output frames across different style prompts — SANA-Streaming editing results across different text prompts. Source: NVIDIA NVLabs

Performance

The benchmark figure from the paper: 24 end to end FPS at 1280×704 resolution on a single RTX 5090. The DiT core alone reaches 58 FPS at the same resolution on the same hardware. The paper reports improvements in temporal coherence over existing methods, though side by side comparisons are available on the project page rather than in numerical table form in the abstract.

The model was trained using 32 NVIDIA H100 GPUs. The paper is authored by Yuyang Zhao, Yicheng Pan, Qiyuan He, Jincheng Yu, Junsong Chen, Tian Ye, Haozhe Liu, Enze Xie, and Song Han.

License and Availability

The project page and arXiv paper are live. Code release is listed as "coming soon" in the NVLabs/Sana GitHub repository, where SANA-Streaming appears on the project's planned releases list. When released, the code is expected to fall under the Apache 2.0 license that governs the parent NVLabs/Sana repository, which permits commercial use.

SANA-Streaming extends a line of NVIDIA open source video models that includes SANA-Video, a high resolution video generation model, and SANA-WM, a world model for minute-scale 720p generation. The streaming model addresses a different problem: real time editing of existing video rather than generation from scratch.

Filmmakers who want to work with AI video generation and editing tools can access them in AI FILMS Studio now.

AI FILMS Studio video generation workspace

Try AI FILMS Studio

Generate text-to-video and image-to-video with the latest AI models in the video workspace.

Nodes Graph Editor

Build custom AI workflows by connecting models visually in the Nodes Graph Editor.

Sources

arXiv: SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer
GitHub: NVlabs/Sana
Project page: nvlabs.github.io/Sana/Streaming/

Continue Reading

Jul 16, 2026

Millennium Media's Jonathan Yunger Spent 15 Years Making Action Films. Now He's Building an AI Production Suite.

Millennium Media president Jonathan Yunger built Arcana Labs after 15 years producing The Expendables and Rambo. His $50K Echo Hunter secured a SAG-AFTRA contract.

Jul 16, 2026

Hollywood Veterans From Sharknado and Doctor Who Are Now Making AI Films

Promise AI pairs Doctor Who director Jamie Magnus Stone and Sharknado producer Micho Rutare with AI filmmakers to produce original features.

Jul 16, 2026

SAM-MT Achieves 36 FPS Multi-Target Video Segmentation With 20 Subjects on One GPU

SAM-MT from Fudan University extends Meta's SAM2 to track 20 targets at 36 FPS on a single RTX A6000, with direct applications for rotoscoping and VFX compositing.

View all Posts