Paris 2.0: Video Generation Without a GPU Cluster

May 29, 2026

Updated: July 1, 2026

Paris 2.0: Video Generation Without a GPU Cluster

Photo by Evan Lee on Unsplash

Share this post:

Paris 2.0: Video Generation Without a GPU Cluster

Paris 2.0 is the first video generation model trained through decentralized computation, bypassing the monolithic GPU clusters that define how frontier AI video models are built. Released by Bagel Labs and submitted to arXiv on May 25, 2026, the model is open source under MIT license with weights available on HuggingFace.

What Decentralized Training Means

Training a video generation model at the frontier typically requires a monolithic GPU cluster: hundreds or thousands of GPUs in close physical proximity, connected by high bandwidth links, synchronizing gradients at every training step. This infrastructure costs tens of millions of dollars and is accessible only to large technology companies and well funded research labs.

Paris 2.0 uses a Decentralized Diffusion Model (DDM) architecture, first introduced in the Paris 1.0 paper for image generation (arXiv: 2510.03434, October 2025). The approach trains independent expert diffusion models across distributed nodes without gradient synchronization, parameter sharing, or activation exchange between nodes. A lightweight router then selects which experts to use during the denoising process at inference.

How the Architecture Works

Each expert in the DDM trains independently, as if it were a standalone model. The nodes do not communicate during training. After training, the router coordinates them at inference time.

The temporal coherence challenge, keeping generated video frames consistent over time, had remained unsolved for decentralized training. Paris 1.0 demonstrated the DDM approach for static images. Paris 2.0 extends it to video and shows that temporal consistency can be maintained across frames even when the underlying model trained without any communication between nodes.

Performance

Against a monolithic model trained on the same data under a matched total compute budget, Paris 2.0 reduces Frechet Video Distance from 561.04 to 279.01, approximately a 2x improvement. The model also improves CLIP text-video similarity and aesthetic score against the monolithic baseline. Results are reported in the arXiv paper (2605.26064).

The model currently generates low resolution text-to-video output. The paper frames Paris 2.0 as a research contribution establishing the feasibility of decentralized video training, with higher resolution output as a direction for future work.

License and Availability

Paris 2.0 is released under the MIT license, confirmed on the HuggingFace model card. Commercial use is permitted. Weights are hosted under the bageldotcom organization on HuggingFace. The full technical specification, architecture details, and training procedure are documented in the arXiv paper.

The Bagel Labs team, Ali Rouzbayani, Bidhan Roy, Marcos Villagra, and Zhiying Jiang, submitted the paper on May 25, 2026, with a revision posted May 28.

Why This Matters

Frontier video models such as Wan, HunyuanVideo, and Seedance are trained by organizations with dedicated GPU infrastructure. The DDM approach opens the possibility of training video generation models across distributed contributors, which lowers the capital cost of developing and fine tuning models for specialized filmmaking applications.

The model is not a production tool at current resolution. Its significance is in demonstrating that the training paradigm works for video. Filmmakers already experimenting with AI diffusion tools at a technical level, as Gareth Edwards described after nine months of personal testing, will find Paris 2.0 a useful reference point for where open source training research is heading.

The Decentralized Diffusion Model Family

Paris 2.0 extends the DDM architecture that Bagel Labs introduced in Paris 1.0 for static image generation (arXiv: 2510.03434). The October 2025 image paper established the core claim: that independent expert diffusion models could be trained without any communication between nodes and still produce coherent output when coordinated at inference by a lightweight router.

Paris 2.0 applies the same approach to video. The additional challenge for video is temporal coherence. Each frame in a video sequence must be consistent with the frames before and after it. In standard monolithic video generation training, that consistency is maintained because the full model sees all the training data and learns cross frame dependencies. In the DDM approach, where expert models train independently, achieving temporal coherence without gradient sharing required specific architectural choices that the arXiv paper documents.

The fact that Paris 2.0 achieves a 2x improvement in Frechet Video Distance over a monolithic baseline trained on equivalent compute confirms that the DDM approach can outperform the standard training paradigm in certain regimes, not just match it.

How the Router Coordinates Experts

The router is the component that makes the DDM architecture function at inference time. During training, each expert model receives a portion of the training data and trains without any awareness of the other experts. The router is trained separately and learns to assign specific types of inputs to specific experts based on characteristics the models have specialized in.

For video generation, the routing decisions happen at the denoising step level. As the model iterates from noise to coherent video frames, the router determines which expert or combination of experts handles each step. That coordination achieves the temporal consistency that would otherwise require the experts to communicate during training.

The lightweight designation for the router is meaningful. A complex routing system would reintroduce computational costs that the distributed training approach is designed to eliminate. Bagel Labs' choice to keep the router lightweight means the inference cost remains comparable to running a single model rather than multiplying with the number of experts.

The Temporal Coherence Problem in Video

Temporal coherence is the central technical challenge that separated video generation from image generation for the DDM approach. In a static image, the model generates a single output that either looks coherent or it does not. In a video sequence, every frame must be consistent with its neighbors in a way that creates the perception of motion.

Existing video generation models handle this in several ways: some condition each frame generation on prior frames, some process the full sequence jointly, and some use architectural components designed specifically to maintain consistency over time. All of these approaches assume a unified model that has access to the full training dataset and can learn cross frame relationships from it.

Paris 2.0 demonstrates that temporal coherence can be maintained even when the underlying experts trained independently. The mechanism by which the router achieves this across independently trained models is the paper's core technical contribution and the finding that makes Paris 2.0 significant beyond its benchmark numbers.

What MIT License Enables

The MIT license is the most permissive of the major open source licenses. It allows commercial use, modification, distribution, and private use without restrictions beyond attribution. For a video generation model, that means any company, independent filmmaker, or research institution can use Paris 2.0 as a starting point for building production tools without licensing fees or usage restrictions.

That contrasts with the licensing approaches of most frontier video models. Proprietary models charge per generation or per API call. Research models often carry non commercial restrictions. MIT licensed weights with commercial use permitted represent a specific contribution to the open source filmmaking ecosystem.

The practical implication for independent production is that Paris 2.0's architecture can be fine tuned on domain specific datasets, integrated into custom production pipelines, or extended by other researchers, all without negotiating license terms. As the model's resolution improves in future iterations, the MIT license will carry more weight.

The Bagel Labs Team

Bagel Labs published Paris 2.0 through a four person research team: Ali Rouzbayani, Bidhan Roy, Marcos Villagra, and Zhiying Jiang. The paper was submitted May 25, 2026, with a revision posted three days later. The team's prior work includes the Paris 1.0 image generation paper, which established the DDM architecture.

The size of the team is relevant context. Frontier video generation models are typically products of large research divisions with dozens of contributors. Paris 1.0 and Paris 2.0 together demonstrate that a small team can produce architecturally significant research without the infrastructure of a major AI lab.

That scale also reflects the decentralized training approach the models are built around. Bagel Labs is not just arguing that distributed computation can train frontier models. They are operating as a distributed research team demonstrating the same principle in their organizational structure.

Where This Fits in Open Source Video Generation

The open source video generation landscape in mid-2026 includes models at several capability levels. Wan, HunyuanVideo, and similar models from large Chinese AI labs are capable of generating high resolution video from text prompts at quality levels approaching commercial tools. Paris 2.0 operates at lower resolution but introduces a training paradigm none of those models use.

The distinction matters for where the research is pointing. Higher resolution output in the DDM framework is achievable in principle, and the Paris 2.0 paper frames it as future work. If Bagel Labs or other researchers can scale the approach to the resolutions where Wan and HunyuanVideo operate, decentralized training will have demonstrated that it can match monolithic training at every capability level, not only at research benchmarks.

For the open source filmmaking community, Paris 2.0's significance today is as a proof of concept that changes what is technically possible. Its significance in twelve to eighteen months will depend on whether the architecture scales.

Practical Implications for Independent Production

The immediate practical implication of Paris 2.0 for independent filmmakers is indirect. The model generates low resolution output, which limits its direct use in production contexts where resolution matters. Its contribution is to the research trajectory, showing that the capital requirements for training new video generation models can be reduced through architectural choices rather than requiring more hardware.

If the DDM approach enables training at higher resolutions through distributed contributor networks, the organizations that develop and fine tune video generation models will no longer be exclusively large technology companies. That would change who builds the tools that filmmakers use, and potentially how those tools are designed, since developers working in distributed networks may prioritize different capabilities than developers inside large AI labs.

Paris 2.0 represents that trajectory rather than its destination. Independent filmmakers using AI video tools today are using products built with the monolithic training approach. The open source DDM research is building the alternative path.

The Benchmark Context

Paris 2.0 reports a Frechet Video Distance of 279.01 against 561.04 for the monolithic baseline, a 50 percent reduction. FVD is a standard metric for evaluating video generation quality, measuring the statistical distance between generated videos and real videos in a feature space. Lower values indicate that the generated distribution is closer to the real distribution.

The improvement is measured against a matched compute budget baseline, meaning the comparison accounts for the total computation used by both approaches rather than comparing a large distributed system to a small monolithic one. That matching makes the comparison meaningful for evaluating the architectural approach rather than just the compute investment.

CLIP text-video similarity and aesthetic score also improve in the Paris 2.0 paper's comparisons. These metrics measure how well the generated video matches the text prompt and how visually appealing the output is by a learned aesthetic model. Improvement across multiple metrics is stronger evidence for the architecture than improvement in a single benchmark would be.

The Frechet Video Distance metric and CLIP similarity together address distinct aspects of generation quality. FVD captures whether the videos look like real videos statistically. CLIP similarity captures whether the videos represent what the text prompt asked for. A model can improve one without improving the other, so improvement in both validates the architecture more completely.

The Research Trajectory

Paris 2.0 follows Paris 1.0 by approximately seven months. The 1.0 image paper established the DDM framework. The 2.0 video paper extends it to a harder problem. The naming suggests a continuing research program rather than a single paper's contribution.

Higher resolution video output is explicitly named as a future direction in the paper. If the research timeline continues at a similar pace, a Paris 3.0 or equivalent with production relevant resolution is a plausible development within the next year. Each iteration represents both a capability improvement and a validation of the broader decentralized training hypothesis.

The DDM approach is also applicable to modalities beyond video. If distributed training can handle the temporal consistency demands of video generation, it can likely handle other sequence generation tasks. Bagel Labs' research agenda, if it continues, may produce tools for audio, 3D, or multimodal generation that carry the same architectural principles into additional domains relevant to filmmaking production.

Try the latest commercial AI video models available now through AI FILMS Studio.

AI FILMS Studio video generation workspace

Try AI FILMS Studio

Generate text-to-video and image-to-video with the latest AI models in the video workspace.

Nodes Graph Editor

Build custom AI workflows by connecting models visually in the Nodes Graph Editor.

Sources

arXiv: Paris 2.0: A Decentralized Diffusion Model for Video Generation GitHub: bageldotcom HuggingFace: bageldotcom/paris2 Project Page: bagel.com

Continue Reading

Jul 13, 2026

Luma Ray 3.2 Tutorial: Text to Video and Image to Video

Step by step guide to Luma Ray 3.2 on AI FILMS Studio. Generate text-to-video and image-to-video with cinematic AI video generation in the workspace.

Jul 11, 2026

ARDY: NVIDIA Open Real Time Text to Motion Model for Digital Humans and Robots

NVIDIA's ARDY generates 3D human and humanoid motion from text in real time with kinematic constraints, accepted to SIGGRAPH 2026, code under Apache 2.0.