EditorNodesPricingBlog

NVIDIA Cosmos 3: Open World Foundation Model for Physics Aware Video Generation

June 4, 2026
NVIDIA Cosmos 3: Open World Foundation Model for Physics Aware Video Generation

Share this post:

NVIDIA Cosmos 3: Open World Foundation Model for Physics Aware Video Generation

NVIDIA released Cosmos 3 on June 1, 2026, publishing model weights and code under the OpenMDW1.1 license, which permits commercial use. The model is a world foundation model designed primarily for physical AI and robotics training, with video generation capabilities that produce physics accurate output. It comes in two parameter sizes: a 16B Nano variant and a 65B Super variant, plus dedicated image-to-video and text-to-image configurations.

A World Foundation Model for Physical AI

Cosmos 3 is built to understand and generate physical environments. Its stated purpose is generating synthetic training data for robots and autonomous systems, producing video output that accurately represents how objects behave under physical constraints. The model natively processes and generates text, images, video, ambient sound, and actions within a single architecture.

That ambient sound modality distinguishes Cosmos 3 from prior open world models. NVIDIA trained it on 20 trillion tokens, including approximately 1 billion images, 400 million real and synthetic videos, and ambient audio data alongside text and action sequences. Including audio in the training pipeline means the model develops an implicit understanding of how environments sound, not just how they look.

Architecture and Variants

NVIDIA Cosmos 3 model architecture diagram showing the mixture of transformers design
Cosmos 3 architecture. NVIDIA Corporation.

Cosmos 3 uses a mixture of transformers architecture. NVIDIA released four variants at launch:

  • Cosmos 3 Nano (16B): video generation and action reasoning, optimized for faster inference
  • Cosmos 3 Super (65B): highest physics accuracy across generation tasks
  • Cosmos 3 Super Image2Video (65B): image-to-video generation from a single input frame
  • Cosmos 3 Super Text2Image (65B): text-to-image generation

A fifth variant, Cosmos 3 Edge, is announced for future release targeting real-time inference on edge hardware. All currently released variants are available on HuggingFace under the OpenMDW1.1 license. Cloud access is available through build.nvidia.com for users without local H100-class hardware.

Benchmark Results

Cosmos 3 ranks first among open models on six published benchmarks. On Physics-IQ and PAI-Bench, it scores highest for physics accurate generation. On R-Bench, it leads open models for world generation quality. For robotics applications, it leads on RoboLab and RoboArena for action policy learning. On VANTAGE-Bench and TAR, it leads for vision understanding.

These benchmarks measure the model's ability to generate environments that obey physical laws, which is the design target. The physics accuracy that makes Cosmos 3 useful for robot training is the same property that makes its video output distinct from generation models optimized for aesthetic quality alone. For context on how NVIDIA has approached open world models previously, see the earlier NVIDIA SANA-WM and NVIDIA Lyra 2 releases.

Video Generation for Filmmakers

Cosmos 3 is not positioned as a filmmaking tool. NVIDIA describes it as infrastructure for physical AI. The video generation capabilities, however, produce output with properties that are useful in production contexts, specifically for generating environments that behave consistently under physical constraints rather than drifting between frames.

The Image2Video variant (65B) generates video from a single input frame, a task directly applicable to extending a still image into a moving shot. The ambient audio training means Cosmos 3 generates footage with an implicit sense of environment, which is relevant to scene design and synthetic location work. These applications are a byproduct of the physical AI design goals, not the primary intent, but they are capabilities the model demonstrably has. Try video generation in AI FILMS Studio.


Sources

GitHub: nvidia/Cosmos
HuggingFace (Nano): nvidia/Cosmos3-Nano
HuggingFace (Super): nvidia/Cosmos3-Super
HuggingFace (Image2Video): nvidia/Cosmos3-Super-Image2Video
HuggingFace (Text2Image): nvidia/Cosmos3-Super-Text2Image
NVIDIA press release: NVIDIA Launches Cosmos 3
License: OpenMDW1.1