NVIDIA Cosmos 3: Open World Foundation Model for Physics Aware Video Generation

June 4, 2026

Updated: July 18, 2026

Share this post:

NVIDIA Cosmos 3: Open World Foundation Model for Physics Aware Video Generation

NVIDIA released Cosmos 3 on June 1, 2026, publishing model weights and code under the OpenMDW1.1 license, which permits commercial use. The model is a world foundation model designed primarily for physical AI and robotics training, with video generation capabilities that produce physics accurate output. It comes in two parameter sizes: a 16B Nano variant and a 65B Super variant, plus dedicated image-to-video and text-to-image configurations.

What "Physical AI" Means

NVIDIA's term "physical AI" refers to AI systems that operate in the physical world: robots, autonomous vehicles, manufacturing systems, and any agent that must perceive and act in an environment governed by physics. NVIDIA CEO Jensen Huang began using the term prominently in 2024 to describe the next wave of AI deployment, distinguishing it from AI that operates purely in digital environments (language models, code generation, image generation). Physical AI requires perception of the real world, planning within physical constraints, and actuation that produces physical consequences. Training those systems requires data about the physical world at a scale that only synthetic generation can provide. Training these systems requires enormous quantities of realistic video showing how objects behave under physical constraints.

Cosmos 3 generates that training data. A robot learning to pick up objects needs to see thousands of examples of objects falling, rolling, deforming, and bouncing under different conditions. Generating that data synthetically is faster, cheaper, and more controllable than capturing it on physical stages. The video generation capability that makes Cosmos 3 useful for robot training is the same capability that makes its output distinct from aesthetics-focused generation models.

The ambient audio modality underscores this design orientation. Physical AI systems need to understand how environments sound, not just how they look. A robot navigating a warehouse hears motors, footsteps, and machinery. Cosmos 3's training on ambient audio means it develops an implicit model of how environments sound alongside how they look.

What Makes Cosmos 3 Different From Earlier Cosmos Versions

NVIDIA released Cosmos 1 in January 2025 at 4B and 12B parameter sizes. Cosmos 2 followed with expanded parameter counts and broader task coverage. Cosmos 3's Nano (16B) and Super (65B) variants are substantially larger than both predecessors, reflecting the scale required to achieve benchmark leadership across all six categories simultaneously.

The addition of ambient audio as a native modality is the most architecturally significant change from Cosmos 2. Prior versions generated video only. Cosmos 3 must generate synchronized audio alongside video, which requires understanding the acoustic properties of the environments it generates. A warehouse scene needs to sound like a warehouse. A underwater scene needs the acoustic signature of water. This is not post-production audio; it is part of the model's forward pass.

A World Foundation Model for Physical AI

Cosmos 3 is built to understand and generate physical environments. Its stated purpose is generating synthetic training data for robots and autonomous systems, producing video output that accurately represents how objects behave under physical constraints. The model natively processes and generates text, images, video, ambient sound, and actions within a single architecture.

NVIDIA trained it on 20 trillion tokens, including approximately 1 billion images, 400 million real and synthetic videos, and ambient audio data alongside text and action sequences. Including audio in the training pipeline means the model develops an implicit understanding of how environments sound, not just how they look.

The Training Data at Scale

20 trillion tokens is the total training volume. For comparison, large language models typically train on 1 to 10 trillion tokens of text. The difference reflects the information density gap between text and video: a single second of video at standard resolution contains more raw data than pages of text.

400 million videos is a training set assembled across years of collection. The real and synthetic mix is important: synthetic videos allow NVIDIA to include controlled examples of physical events that are rare in naturally captured footage, such as specific collision scenarios, unusual material interactions, and extended sequences of a single type of physical phenomenon.

The 1 billion image figure represents static scene understanding. Cosmos 3's world model needs to know what physical scenes look like before they move, establishing the baseline visual representation that the video generation system builds on.

Architecture and Variants

NVIDIA Cosmos 3 model architecture diagram showing the mixture of transformers design

Cosmos 3 architecture. NVIDIA Corporation.

The four-variant release structure covers the major deployment scenarios the model addresses: video generation (Nano and Super), single image animation (Image2Video), and reference based visual generation (Text2Image). The parameter range from 16B to 65B allows organizations with different hardware budgets to deploy the version that fits their compute constraints. NVIDIA Build provides cloud inference access for users without local H100-class hardware.

Cosmos 3 uses a mixture of transformers architecture. NVIDIA released four variants at launch:

Cosmos 3 Nano (16B): video generation and action reasoning, optimized for faster inference. Cosmos 3 Super (65B): highest physics accuracy across generation tasks. Cosmos 3 Super Image2Video (65B): image-to-video generation from a single input frame. Cosmos 3 Super Text2Image (65B): text-to-image generation.

A fifth variant, Cosmos 3 Edge, is announced for future release targeting real time inference on edge hardware. All currently released variants are available on HuggingFace under the OpenMDW1.1 license. Cloud access is available through build.nvidia.com for users without local H100-class hardware.

The OpenMDW1.1 License

OpenMDW1.1 is the license NVIDIA uses for its open AI model releases. It permits commercial use, which distinguishes it from research-only licenses. The key restrictions are that the model cannot be used to build competing AI model services and that derivative models must carry the same license terms.

For filmmakers and production companies, the commercial use permission is the operative detail. A production that generates environments, assets, or reference footage using Cosmos 3 can do so without licensing restrictions on the resulting work. The model is a tool in the production pipeline, not a licensed component that affects rights to the output.

The comparison to Apache 2.0 is useful. Apache 2.0 is more permissive on derivative work licensing. OpenMDW1.1 is more restrictive in that derivatives must stay under OpenMDW1.1. For most production uses, this distinction does not matter. For companies building AI products on top of Cosmos 3, it does.

How Cosmos 3 Compares to Prior Versions

Cosmos 1 launched in January 2025 at 4B and 12B parameter sizes under the same license family. Cosmos 2 expanded parameter counts and added the video-to-world simulation capability. Cosmos 3 represents the first version to achieve benchmark leadership across six simultaneous categories against all open models, not just competing Cosmos versions.

The jump from Cosmos 2 to Cosmos 3 also introduces ambient audio as a native modality. Prior versions generated video only. Adding audio to the generation pipeline requires the model to learn how environments sound, which is additional training data and a different architecture at the output layer. The 20 trillion token total includes the audio data needed for that capability.

Benchmark Results

Cosmos 3 ranks first among open models on six published benchmarks. On Physics-IQ and PAI-Bench, it scores highest for physics accurate generation.

Physics-IQ and PAI-Bench measure whether generated video depicts physical events correctly: objects falling at appropriate rates, collisions producing correct outcomes, materials behaving according to their physical properties. These benchmarks did not exist before physical AI world models created a need for them. Standard video generation benchmarks like VBench measure visual quality and prompt fidelity but do not evaluate physical accuracy. Cosmos 3's leadership on these benchmarks reflects NVIDIA's design priority: physical accuracy over aesthetic quality, which is the correct priority for robotics training data. On R-Bench, it leads open models for world generation quality. For robotics applications, it leads on RoboLab and RoboArena for action policy learning. On VANTAGE-Bench and TAR, it leads for vision understanding.

These benchmarks measure the model's ability to generate environments that obey physical laws, which is the design target. The physics accuracy that makes Cosmos 3 useful for robot training is the same property that makes its video output distinct from generation models optimized for aesthetic quality alone. For context on how NVIDIA has approached open world models previously, see the earlier NVIDIA SANA-WM and NVIDIA Lyra 2 releases. As of July 2026, LingBot-Video from Ant Group scored 0.620 on RBench, placing above Cosmos3 Super (0.581) and taking the top open source position on that benchmark.

The Physical AI Training Data Problem

The core problem Cosmos 3 addresses is data scarcity for physical AI systems. A robot learning to navigate a warehouse needs to see thousands of examples of specific physical interactions: boxes falling, surfaces with different friction properties, humans moving in tight spaces. Capturing all of that in real environments takes years and significant physical infrastructure.

Synthetic data generation solves the scarcity problem, but it requires a generation system that produces physically plausible video. Most generation models produce visually convincing output but get the physics wrong: objects fall at incorrect rates, collisions produce wrong outcomes, surfaces behave incorrectly. Cosmos 3 is designed to generate video where the physics are right, which makes it useful for training physical AI systems.

Open Source Strategy and What It Enables

NVIDIA's decision to publish Cosmos 3 weights under OpenMDW1.1 continues the pattern established by Cosmos 1 and Cosmos 2: major world model releases with commercial-permissive licenses. The comparison is with models like Sora, which is closed and available only through OpenAI's API, and with Google Veo 3, also closed.

Open weights mean researchers and production companies can run Cosmos 3 locally, fine-tune it on proprietary datasets, and deploy it in production pipelines without per call API costs. For a studio building a synthetic training data pipeline for visual effects or animation, that economic difference matters significantly over a multi year deployment. The OpenMDW1.1 restriction on competitive AI model services is the only limitation relevant to most production use cases.

Edge Variant and Future Deployment

NVIDIA announced a fifth Cosmos 3 variant, Cosmos 3 Edge, as a planned future release targeting real time inference on edge hardware. Edge hardware in this context means devices without data center class compute: embedded processors in robots, autonomous vehicle compute units, and industrial systems where sending data to a central server for inference is too slow or too expensive.

An edge-capable world model is the deployment target that makes physical AI commercially viable at scale. Robots that require cloud connectivity for perception cannot operate in environments with network latency, disconnected sites, or bandwidth constraints. Cosmos 3 Edge moves the world modeling capability to the device, which is a different engineering problem from the cloud-inference variants and will require further work after the initial four-variant release.

Video Generation for Filmmakers

Cosmos 3 is not positioned as a filmmaking tool. NVIDIA describes it as infrastructure for physical AI. The video generation capabilities, however, produce output with properties that are useful in production contexts, specifically for generating environments that behave consistently under physical constraints rather than drifting between frames.

For character motion within those environments, ARDY from NVIDIA Research generates 3D human and humanoid motion from text in real time with kinematic constraints, published alongside Cosmos 3 and accepted to SIGGRAPH 2026.

The Image2Video variant (65B) generates video from a single input frame, a task directly applicable to extending a still image into a moving shot. The ambient audio training means Cosmos 3 generates footage with an implicit sense of environment, which is relevant to scene design and synthetic location work. These applications are a byproduct of the physical AI design goals, not the primary intent, but they are capabilities the model demonstrably has. Try video generation in AI FILMS Studio.

AI FILMS Studio video generation workspace

Try AI FILMS Studio

Generate text-to-video and image-to-video with the latest AI models in the video workspace.

Nodes Graph Editor

Build custom AI workflows by connecting models visually in the Nodes Graph Editor.

ARDY and the Cosmos 3 Research Cluster

NVIDIA published ARDY alongside Cosmos 3 in June 2026. ARDY generates 3D human and humanoid motion from text in real time with kinematic constraints. It was accepted to SIGGRAPH 2026, the primary venue for computer graphics research. The two releases together address separate layers of the physical AI pipeline: Cosmos 3 generates environments and object physics, ARDY generates character motion within those environments. For production teams building synthetic training data, having both available under compatible licenses reduces the integration work needed to combine environment generation with character animation.

Sources

GitHub: nvidia/Cosmos
HuggingFace (Nano): nvidia/Cosmos3-Nano
HuggingFace (Super): nvidia/Cosmos3-Super
HuggingFace (Image2Video): nvidia/Cosmos3-Super-Image2Video
HuggingFace (Text2Image): nvidia/Cosmos3-Super-Text2Image
NVIDIA press release: NVIDIA Launches Cosmos 3
License: OpenMDW1.1

Continue Reading

Jul 17, 2026

Andy Serkis Says AI Cannot Replicate an 'Authored Performance' as Hunt for Gollum Begins Filming

Andy Serkis says AI cannot yet replicate an authored performance as The Hunt for Gollum begins filming, and argues that motion capture acting is long overdue for Oscar recognition.

Jul 17, 2026

MolmoMotion: Ai2 Releases Open Source Model That Forecasts 3D Object Motion From Language

Allen Institute for AI releases MolmoMotion, an open source model that predicts 3D object trajectories from video and language instructions, with a dataset of 1.16 million annotated clips.

Jul 17, 2026

Venice Immersive 2026: Margot Robbie, Andy Serkis, Daisy Ridley Lead AI and XR Lineup

Venice Immersive marks its 10th anniversary with 68 projects featuring Margot Robbie, Andy Serkis, Daisy Ridley and Mark Ruffalo in AI and XR immersive works.

View all Posts

Image & Edit

Speech & Voice

Music & Sound Effects

NVIDIA Cosmos 3: Open World Foundation Model for Physics Aware Video Generation

NVIDIA Cosmos 3: Open World Foundation Model for Physics Aware Video Generation

What "Physical AI" Means

What Makes Cosmos 3 Different From Earlier Cosmos Versions

A World Foundation Model for Physical AI

The Training Data at Scale

Architecture and Variants

The OpenMDW1.1 License

How Cosmos 3 Compares to Prior Versions

Benchmark Results

The Physical AI Training Data Problem

Open Source Strategy and What It Enables

Edge Variant and Future Deployment

Video Generation for Filmmakers

ARDY and the Cosmos 3 Research Cluster

Sources

Continue Reading

Andy Serkis Says AI Cannot Replicate an 'Authored Performance' as Hunt for Gollum Begins Filming

MolmoMotion: Ai2 Releases Open Source Model That Forecasts 3D Object Motion From Language

Venice Immersive 2026: Margot Robbie, Andy Serkis, Daisy Ridley Lead AI and XR Lineup

Video & LipSync

Image & Edit

Speech & Voice

Music & Sound Effects