L2P: 4K and 8K Image Generation Without a VAE
Share this post:
L2P: 4K and 8K Image Generation Without a VAE
Researchers from NJU's PCaLab and Tencent YoutuResearch published L2P on May 12, 2026, a framework that converts pretrained latent diffusion models into pixel space generators by removing the VAE entirely. The result is native 4K generation and 8K zero shot extrapolation from models that were trained at standard resolutions.
The paper, arXiv 2605.12013, comes from Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, and colleagues at Nanjing University and Tencent.
Removing the VAE Bottleneck
Latent diffusion models compress images into a compact latent representation using a Variational Autoencoder, run the diffusion process in that compressed space, then decode back to pixels. The VAE introduces a resolution ceiling because the decode step does not scale cleanly to arbitrary resolutions, and the encode step loses information that cannot be recovered at output.
L2P discards the VAE entirely. The model processes image content directly in pixel space, working at full resolution from start to finish. There is no encode and decode round trip.
How L2P Transfers Latent Models to Pixel Space
The transfer approach is designed to reuse as much of the existing pretrained model as possible. L2P freezes the intermediate layers of the source latent diffusion model and trains only shallow transformation layers that bridge latent and pixel space representations.
Large patch tokenization handles the shift from compressed latent tokens to full resolution pixel patches. Because only the transformation layers are trained, the process runs on 8 GPUs with what the authors describe as negligible computational overhead. Critically, L2P trains entirely on synthetic images generated by the source model itself. No real training data is required.
4K Native Generation
L2P generates 4K images natively, without the tiling or multi pass compositing approaches often used to push latent models past their training resolution. On GenEval, the model scores at 93% of the source LDM's performance, indicating that the pixel space transfer retains most of the generation capability of the original model.
Performance Numbers
Single step inference at 4K runs 97.67% faster than the source latent diffusion model at the same resolution. The speed gain comes from removing the VAE decode stage and the architectural changes that come with processing in pixel space directly.
8K generation works via zero shot extrapolation. The model was not trained at 8K but generalizes to that resolution from the 4K native capability, with results shown on the project page from NJU PCaLab. DPG-Bench scores are on par with the source LDM, indicating faithful prompt adherence at these resolutions.
Access and License
L2P code is available on GitHub at TencentYoutuResearch/T2I-L2P. Weights are hosted on Hugging Face. The repository does not specify an explicit open source license, so commercial use terms should be confirmed directly with the research team before deployment.
For filmmakers using AI generated images in production, L2P represents a direct path from existing latent diffusion checkpoints to 4K and 8K capable pixel space generators. NVIDIA's PiD decoder solves the same high resolution problem from the decode end, replacing only the VAE decode step while keeping the latent generation pipeline. Microsoft's Lens approaches the efficiency question from the training side, reaching competitive results at 3.8 billion parameters on 19.3% of standard compute.
For open source pixel native generation at 8 billion parameters under MIT license, HiDream-O1-Image and the DyPE method offer additional approaches to high resolution output without VAE constraints.
To work with the latest image generation models in a browser without local setup, AI FILMS Studio provides cloud access to the image workspace.
Sources
arXiv: L2P: Unlocking Latent Potential for Pixel Generation Project Page: NJU PCaLab: L2P GitHub: TencentYoutuResearch/T2I-L2P
Continue Reading
Video & LipSync
- Video Generator
- Text to Video
- Image to Video
- Start-End Frame to Video
- Draw to Video
- Motion Control
- Video Enhancer
- Video Upscaler
- Video to Video LipSync
- Audio to Video LipSync
- Image to Video LipSync
- Video FaceSwap
- Seedance 2
- OpenAI Sora 2
- Kling 3.0
- Kling O1
- Google Veo 3.1
- LTX 2.3
- Kling O1
- Hailuo AI
- Luma Ray
- Kling 3.0 Motion
- Topaz Upscaler
- InfiniteTalk Face Swap
