Paris 2.0: Video Generation Without a GPU Cluster

Photo by Evan Lee on Unsplash
Share this post:
Paris 2.0: Video Generation Without a GPU Cluster
Paris 2.0 is the first video generation model trained through decentralized computation, bypassing the monolithic GPU clusters that define how frontier AI video models are built. Released by Bagel Labs and submitted to arXiv on May 25, 2026, the model is open source under MIT license with weights available on HuggingFace.
What Decentralized Training Means
Training a video generation model at the frontier typically requires a monolithic GPU cluster: hundreds or thousands of GPUs in close physical proximity, connected by high bandwidth links, synchronizing gradients at every training step. This infrastructure costs tens of millions of dollars and is accessible only to large technology companies and well funded research labs.
Paris 2.0 uses a Decentralized Diffusion Model (DDM) architecture, first introduced in the Paris 1.0 paper for image generation (arXiv: 2510.03434, October 2025). The approach trains independent expert diffusion models across distributed nodes without gradient synchronization, parameter sharing, or activation exchange between nodes. A lightweight router then selects which experts to use during the denoising process at inference.
How the Architecture Works
Each expert in the DDM trains independently, as if it were a standalone model. The nodes do not communicate during training. After training, the router coordinates them at inference time.
The temporal coherence challenge, keeping generated video frames consistent over time, had remained unsolved for decentralized training. Paris 1.0 demonstrated the DDM approach for static images. Paris 2.0 extends it to video and shows that temporal consistency can be maintained across frames even when the underlying model trained without any communication between nodes.
Performance
Against a monolithic model trained on the same data under a matched total compute budget, Paris 2.0 reduces Frechet Video Distance from 561.04 to 279.01, approximately a 2x improvement. The model also improves CLIP text-video similarity and aesthetic score against the monolithic baseline. Results are reported in the arXiv paper (2605.26064).
The model currently generates low resolution text-to-video output. The paper frames Paris 2.0 as a research contribution establishing the feasibility of decentralized video training, with higher resolution output as a direction for future work.
License and Availability
Paris 2.0 is released under the MIT license, confirmed on the HuggingFace model card. Commercial use is permitted. Weights are hosted under the bageldotcom organization on HuggingFace. The full technical specification, architecture details, and training procedure are documented in the arXiv paper.
The Bagel Labs team, Ali Rouzbayani, Bidhan Roy, Marcos Villagra, and Zhiying Jiang, submitted the paper on May 25, 2026, with a revision posted May 28.
Why This Matters
Frontier video models such as Wan, HunyuanVideo, and Seedance are trained by organizations with dedicated GPU infrastructure. The DDM approach opens the possibility of training video generation models across distributed contributors, which lowers the capital cost of developing and fine tuning models for specialized filmmaking applications.
The model is not a production tool at current resolution. Its significance is in demonstrating that the training paradigm works for video. Filmmakers already experimenting with AI diffusion tools at a technical level, as Gareth Edwards described after nine months of personal testing, will find Paris 2.0 a useful reference point for where open source training research is heading.
Try the latest commercial AI video models available now through AI FILMS Studio.
Sources
arXiv: Paris 2.0: A Decentralized Diffusion Model for Video Generation GitHub: bageldotcom HuggingFace: bageldotcom/paris2 Project Page: bagel.com
Continue Reading
Video & LipSync
- Video Generator
- Text to Video
- Image to Video
- Start-End Frame to Video
- Draw to Video
- Motion Control
- Video Enhancer
- Video Upscaler
- Video to Video LipSync
- Audio to Video LipSync
- Image to Video LipSync
- Video FaceSwap
- Seedance 2
- OpenAI Sora 2
- Kling 3.0
- Kling O1
- Google Veo 3.1
- LTX 2.3
- Kling O1
- Hailuo AI
- Luma Ray
- Kling 3.0 Motion
- Topaz Upscaler
- InfiniteTalk Face Swap

