EditorNodesPricingBlog

PiD: NVIDIA Pixel Diffusion for Fast High Resolution Image Decoding

May 24, 2026
PiD: NVIDIA Pixel Diffusion for Fast High Resolution Image Decoding

Share this post:

PiD: NVIDIA Pixel Diffusion for Fast High Resolution Image Decoding

Researchers at NVIDIA's Spatial Intelligence Lab released PiD on May 22, 2026, a pixel diffusion decoder that replaces the VAE in latent diffusion pipelines. PiD reformulates decoding as a conditional diffusion process that denoises directly in pixel space, unifying decoding and upsampling into a single generative module.

The paper (arXiv 2605.23902) comes from Yifan Lu, Qi Wu, Jay Zhangjie Wu, Zian Wang, Huan Ling, Sanja Fidler, and Xuanchi Ren at NVIDIA.

PiD converting latent representations directly to high resolution pixels. Source: NVIDIA Research.

What PiD Does

Most text-to-image pipelines generate images in latent space, then decode them to pixels using a Variational Autoencoder. At high resolutions, the VAE decode step becomes the primary performance bottleneck and introduces softness from the encode and decode round trip.

PiD removes the VAE from this step. A sigma aware adapter injects noise corrupted versions of the source latent into the diffusion process, conditioning each denoising step on the latent content. Four inference steps via DMD2 distillation complete the decode.

Speed and Resolution Numbers

On an RTX 5090, PiD converts 512×512 latents to 2048×2048 pixels in under 1 second using 13 GB of peak memory. On an NVIDIA GB200, latency drops to 210 milliseconds.

Against SeedVR2, a cascaded diffusion baseline, PiD runs 5.9× faster: 211.2 ms versus 1,237.5 ms per image at the same 2K output resolution on identical hardware.

How the Architecture Works

The sigma aware adapter takes a noise corrupted version of the source latent as input at each denoising step, giving the model continuous access to the original encoded content throughout the process. This combines latent semantics with pixel level detail generation in a single forward pass.

PiD is compatible with multiple latent spaces: FLUX.1, FLUX.2, SD3, Z-Image, and RAE. It can be added to existing pipelines without retraining the upstream generation model.

4K Decoding in Action

PiD decoding latents to 4K resolution. Source: NVIDIA Research.

PiD achieves 4× and 8× upscaling from the source latent resolution. The four step distilled model handles this at the latencies above, making 4K decode practical on consumer hardware for the first time at this speed.

Open Source Access

PiD is open source. Code is on GitHub at nv-tlabs/PiD and model weights are on Hugging Face under nvidia/PiD. Commercial use is permitted.

For filmmakers working with AI generated images, this approach addresses a long standing quality ceiling in latent decoding. HiDream-O1-Image removes the VAE from the generation architecture entirely. DyPE takes a third approach, enabling 4K output from existing models through training free position extrapolation. On the training side, Microsoft's Lens demonstrates how dense caption datasets cut foundation model compute requirements to 19% of the standard baseline.

To generate images in the cloud without local GPU setup, AI FILMS Studio provides access to the latest models in the image workspace.

Sources

arXiv: PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion Project Page: NVIDIA SIL: PiD GitHub: nv-tlabs/PiD Hugging Face: nvidia/PiD