NVIDIA Cosmos 3: Open World Foundation Model for Physics Aware Video Generation
Share this post:
NVIDIA Cosmos 3: Open World Foundation Model for Physics Aware Video Generation
NVIDIA released Cosmos 3 on June 1, 2026, publishing model weights and code under the OpenMDW1.1 license, which permits commercial use. The model is a world foundation model designed primarily for physical AI and robotics training, with video generation capabilities that produce physics accurate output. It comes in two parameter sizes: a 16B Nano variant and a 65B Super variant, plus dedicated image-to-video and text-to-image configurations.
A World Foundation Model for Physical AI
Cosmos 3 is built to understand and generate physical environments. Its stated purpose is generating synthetic training data for robots and autonomous systems, producing video output that accurately represents how objects behave under physical constraints. The model natively processes and generates text, images, video, ambient sound, and actions within a single architecture.
That ambient sound modality distinguishes Cosmos 3 from prior open world models. NVIDIA trained it on 20 trillion tokens, including approximately 1 billion images, 400 million real and synthetic videos, and ambient audio data alongside text and action sequences. Including audio in the training pipeline means the model develops an implicit understanding of how environments sound, not just how they look.
Architecture and Variants
Cosmos 3 uses a mixture of transformers architecture. NVIDIA released four variants at launch:
- Cosmos 3 Nano (16B): video generation and action reasoning, optimized for faster inference
- Cosmos 3 Super (65B): highest physics accuracy across generation tasks
- Cosmos 3 Super Image2Video (65B): image-to-video generation from a single input frame
- Cosmos 3 Super Text2Image (65B): text-to-image generation
A fifth variant, Cosmos 3 Edge, is announced for future release targeting real-time inference on edge hardware. All currently released variants are available on HuggingFace under the OpenMDW1.1 license. Cloud access is available through build.nvidia.com for users without local H100-class hardware.
Benchmark Results
Cosmos 3 ranks first among open models on six published benchmarks. On Physics-IQ and PAI-Bench, it scores highest for physics accurate generation. On R-Bench, it leads open models for world generation quality. For robotics applications, it leads on RoboLab and RoboArena for action policy learning. On VANTAGE-Bench and TAR, it leads for vision understanding.
These benchmarks measure the model's ability to generate environments that obey physical laws, which is the design target. The physics accuracy that makes Cosmos 3 useful for robot training is the same property that makes its video output distinct from generation models optimized for aesthetic quality alone. For context on how NVIDIA has approached open world models previously, see the earlier NVIDIA SANA-WM and NVIDIA Lyra 2 releases.
Video Generation for Filmmakers
Cosmos 3 is not positioned as a filmmaking tool. NVIDIA describes it as infrastructure for physical AI. The video generation capabilities, however, produce output with properties that are useful in production contexts, specifically for generating environments that behave consistently under physical constraints rather than drifting between frames.
The Image2Video variant (65B) generates video from a single input frame, a task directly applicable to extending a still image into a moving shot. The ambient audio training means Cosmos 3 generates footage with an implicit sense of environment, which is relevant to scene design and synthetic location work. These applications are a byproduct of the physical AI design goals, not the primary intent, but they are capabilities the model demonstrably has. Try video generation in AI FILMS Studio.
Sources
GitHub: nvidia/Cosmos
HuggingFace (Nano): nvidia/Cosmos3-Nano
HuggingFace (Super): nvidia/Cosmos3-Super
HuggingFace (Image2Video): nvidia/Cosmos3-Super-Image2Video
HuggingFace (Text2Image): nvidia/Cosmos3-Super-Text2Image
NVIDIA press release: NVIDIA Launches Cosmos 3
License: OpenMDW1.1
Continue Reading
Video & LipSync
- Video Generator
- Text to Video
- Image to Video
- Start-End Frame to Video
- Draw to Video
- Motion Control
- Video Enhancer
- Video Upscaler
- Video to Video LipSync
- Audio to Video LipSync
- Image to Video LipSync
- Video FaceSwap
- Seedance 2
- OpenAI Sora 2
- Kling 3.0
- Kling O1
- Google Veo 3.1
- LTX 2.3
- Kling O1
- Hailuo AI
- Luma Ray
- Kling 3.0 Motion
- Topaz Upscaler
- InfiniteTalk Face Swap
Image & Edit
- AI Character
- AI Actor
- Art Generator
- Text to Image
- Image to Image
- Draw to Edit
- Image Training
- Remove Background
- Image Enhancer
- MidJourney 8.0
- OpenAI GPT Image 2.0
- Kling Image 3.0
- NanoBanana Pro
- Minimax Image
- NanoBanana 2
- Kling Omni 3
- FLUX 2
- WAN 2.6
- Z-Image
- SeedEdit 3.0
- GLM-Image
- Omnigen 2
- Seedream 4.5
- Background Erase Network 2 (BEN2)


