L2P: 4K and 8K Image Generation Without a VAE

May 25, 2026

Share this post:

L2P: 4K and 8K Image Generation Without a VAE

Researchers from NJU's PCaLab and Tencent YoutuResearch published L2P on May 12, 2026, a framework that converts pretrained latent diffusion models into pixel space generators by removing the VAE entirely. The result is native 4K generation and 8K zero shot extrapolation from models that were trained at standard resolutions.

The paper, arXiv 2605.12013, comes from Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, and colleagues at Nanjing University and Tencent.

Removing the VAE Bottleneck

Latent diffusion models compress images into a compact latent representation using a Variational Autoencoder, run the diffusion process in that compressed space, then decode back to pixels. The VAE introduces a resolution ceiling because the decode step does not scale cleanly to arbitrary resolutions, and the encode step loses information that cannot be recovered at output.

L2P discards the VAE entirely. The model processes image content directly in pixel space, working at full resolution from start to finish. There is no encode and decode round trip.

How L2P Transfers Latent Models to Pixel Space

The transfer approach is designed to reuse as much of the existing pretrained model as possible. L2P freezes the intermediate layers of the source latent diffusion model and trains only shallow transformation layers that bridge latent and pixel space representations.

Large patch tokenization handles the shift from compressed latent tokens to full resolution pixel patches. Because only the transformation layers are trained, the process runs on 8 GPUs with what the authors describe as negligible computational overhead. Critically, L2P trains entirely on synthetic images generated by the source model itself. No real training data is required.

4K Native Generation

Native 4K image generation output from L2P showing high resolution detail without VAE artifacts — Native 4K generation output from L2P. Courtesy NJU PCaLab and Tencent YoutuResearch.

Second native 4K output example from L2P demonstrating consistent quality across image types — Additional 4K output from L2P. Courtesy NJU PCaLab and Tencent YoutuResearch.

L2P generates 4K images natively, without the tiling or multi pass compositing approaches often used to push latent models past their training resolution. On GenEval, the model scores at 93% of the source LDM's performance, indicating that the pixel space transfer retains most of the generation capability of the original model.

Performance Numbers

8K zero shot extrapolation output from L2P showing ultra high resolution generation capability — 8K zero shot extrapolation from L2P. Courtesy NJU PCaLab and Tencent YoutuResearch.

Single step inference at 4K runs 97.67% faster than the source latent diffusion model at the same resolution. The speed gain comes from removing the VAE decode stage and the architectural changes that come with processing in pixel space directly.

8K generation works via zero shot extrapolation. The model was not trained at 8K but generalizes to that resolution from the 4K native capability, with results shown on the project page from NJU PCaLab. DPG-Bench scores are on par with the source LDM, indicating faithful prompt adherence at these resolutions.

Access and License

L2P code is available on GitHub at TencentYoutuResearch/T2I-L2P. Weights are hosted on Hugging Face. The repository does not specify an explicit open source license, so commercial use terms should be confirmed directly with the research team before deployment.

For filmmakers using AI generated images in production, L2P represents a direct path from existing latent diffusion checkpoints to 4K and 8K capable pixel space generators. NVIDIA's PiD decoder solves the same high resolution problem from the decode end, replacing only the VAE decode step while keeping the latent generation pipeline. Microsoft's Lens approaches the efficiency question from the training side, reaching competitive results at 3.8 billion parameters on 19.3% of standard compute.

For open source pixel native generation at 8 billion parameters under MIT license, HiDream-O1-Image and the DyPE method offer additional approaches to high resolution output without VAE constraints.

To work with the latest image generation models in a browser without local setup, AI FILMS Studio provides cloud access to the image workspace.

AI FILMS Studio video generation workspace

Try AI FILMS Studio

Generate text-to-video and image-to-video with the latest AI models in the video workspace.

Nodes Graph Editor

Build custom AI workflows by connecting models visually in the Nodes Graph Editor.

Sources

arXiv: L2P: Unlocking Latent Potential for Pixel Generation Project Page: NJU PCaLab: L2P GitHub: TencentYoutuResearch/T2I-L2P

Continue Reading

Jul 9, 2026

Google Gemini Omni Flash: Text to Video and Image to Video Tutorial

Step by step guide to Gemini Omni Flash text-to-video and image-to-video in AI FILMS Studio, covering prompts, aspect ratios, duration, and the Nodes Graph Editor.

Jul 9, 2026

Google Nano Banana 2 Lite: Text to Image and Image Editing Tutorial

Step by step guide to Nano Banana 2 Lite text-to-image and image editing in AI FILMS Studio, covering parameters, aspect ratios, and the Nodes Graph Editor.