InSpatio-World: Open Source 4D World Model From Video

March 19, 2026

Still from demo video by InSpatio-World

Share this post:

InSpatio-World: Open Source 4D World Model From Video

InSpatio-World is the first open source real time 4D world model that takes a video as input and turns it into a dynamic, navigable world. You can roam freely across viewpoints, control time forward and backward, and build on top of it.

InSpatio-World 4D world model demo showing navigable real time scene from a reference video — Still from demo video by InSpatio-World

What It Does

Existing 2D and static models capture a single perspective of a scene. The physical world has three spatial dimensions and one time dimension. InSpatio-World models all four, conditioning the output on a reference video and letting you explore the resulting world interactively.

The 1.3 billion parameter model runs at 24 FPS on a single GPU. It ranks first among all real time methods on the WorldScore-Dynamic leaderboard, a benchmark that evaluates 3D, 4D, and video generation systems on controllability, visual quality, and dynamic consistency.

The distinction between real time and offline methods matters in production workflows. Offline world models can produce higher quality output but require render times that break the interactive loop of previsualization. InSpatio-World's single GPU real time output means a director or cinematographer can explore the world during a meeting without specialized infrastructure.

The benchmark ranking confirms what the architecture implies: InSpatio-World performs best on scenes with strong physical dynamics in the reference video, where the temporal control and physical realism capabilities have real material to work with. Slow moving or static reference footage produces a stable world but a less interactive one.

Generating from video rather than from scratch has practical implications for accuracy. The model is not imagining a scene. It is representing one that was captured. Physical relationships between objects, the way light falls, and the actual scale of the environment are all grounded in what the camera saw.

Four Core Capabilities

Free Spatial Roaming

Immerse yourself in the scene and experience the same event from diverse vantage points.

Free spatial roaming across different vantage points of the same scene

Temporal Control

Pause, slow down, or even reverse time to re-experience captured moments.

Temporal control: pause, reverse, or slow down time within the world

Physical Realism

Drawing from the natural dynamics of the reference video, the model preserves physically consistent and realistic dynamics.

Physical realism: consistent dynamics derived from the reference video

Long Horizon Stability

Even under extended exploration, the world remains anchored to the reference video, preventing drift and preserving consistency with the source scene.

Long horizon stability: the generated world stays consistent over extended exploration

The Technical Problem It Solves

Generative video models simulate pixels rather than persistent worlds. That leads to three failures: physical inconsistency (objects interpenetrate or float), spatial fragility (objects outside the frame become unstable), and temporal drift (world state degrades over long sequences).

InSpatio-World addresses all three with State-Anchored World Modeling. The reference video is anchored as a viewpoint-independent Local World State. All generated observations are sampled from this state rather than from frame history alone. Three components implement this:

World State Anchoring: builds a persistent world state ensuring spatial and physical constancy
Spatiotemporal Autoregression: performs precise sampling conditioned on the reference video, enabling free navigation across viewpoints and time
Joint Distribution Matching Distillation: balances real world fidelity with synthetic controllability, enabling stable generalization under user interaction

The result is spatiotemporally consistent sampling that mitigates long term drift. The model knows where it is in space and time at every step.

State-Anchored World Modeling works differently from frame history based approaches. Rather than conditioning on the most recent frames and hoping the accumulated context is sufficient, InSpatio-World samples from a viewpoint-independent representation of the entire reference video. The world state is always fully accessible, not just the last few frames of it.

The distillation step in Joint Distribution Matching Distillation trades some expressive range for inference speed. The model runs at real time because the distillation has compressed the behavior of a larger, slower model into a smaller, faster one. The quality ceiling is lower than an unconstrained offline model, but high enough for previsualization and world exploration workflows.

For Filmmakers

The combination of spatial roaming and temporal control has direct applications in previsualization and shot planning. A director can feed existing reference footage, explore the resulting world from any angle, and then scrub backward and forward in time to identify the best moment and framing for a shot.

This is a fundamentally different workflow from standard video generation. Instead of generating a clip and evaluating it, you inhabit the scene and direct from inside it. The model keeps the world stable so creative decisions made at one timestamp or viewpoint remain valid across the exploration.

The temporal control capability sets InSpatio-World apart from standard world generation tools. The ability to pause or reverse time within a generated scene is not primarily a visual effect. It is a directorial tool: a way to identify the exact frame where a performance or movement reads correctly, then freeze on it and adjust the camera angle without losing the moment.

Shot planning for action sequences or dynamic scenes becomes a different kind of work when the world is live and reversible. A director can run the same action forward and backward multiple times, stopping at any frame to test coverage from different angles without additional generation steps.

The anchor video can be any footage the production team already has. Shot on set or on location, the reference clip becomes the seed for a navigable world rather than a fixed cut. Teams can explore alternate framings of existing captured material without returning to the location.

Teams working with world simulators for virtual production can compare InSpatio-World against other open source approaches. Lingbot World from Ant Group focuses on camera pose control for cinematographic applications. MIND benchmark documents the six specific challenges current systems face. InSpatio-World takes a different angle: it starts from a reference video rather than generating from scratch, which anchors the world to real captured dynamics rather than purely synthesized ones. For production teams that need precise camera trajectories over long sequences without a reference video anchor, MosaicMem introduces a hybrid spatial memory design that lifts patches to 3D and reprojects them at target viewpoints, sustaining consistent scenes across 2 minute rollouts. AMAP-ML's DreamX-World 1.0 takes the opposite starting point: it generates worlds from text prompts with six degree of freedom camera navigation, covering realistic and stylized environments under MIT and Apache 2.0 licenses.

InSpatio-World's real time inference speed means feedback on a shot angle or timeline moment arrives without waiting for a render queue. The 24 FPS output plays back at standard frame rate, so previsualization decisions can be made in the same visual register as the final footage.

The model is open source under the project's license. Filmmakers and production teams with access to a single GPU can run InSpatio-World on their own infrastructure. Inference requires no specialized hardware beyond a standard GPU with sufficient VRAM for the 1.3 billion parameter model.

For reference video generation, AI FILMS Studio's video workspace lets you generate the source footage that InSpatio-World can then transform into a navigable 4D world.

AI FILMS Studio video generation workspace

Try AI FILMS Studio

Generate text-to-video and image-to-video with the latest AI models in the video workspace.

Nodes Graph Editor

Build custom AI workflows by connecting models visually in the Nodes Graph Editor.

Sources

InSpatio-World Project Page https://inspatio.github.io/inspatio-world/

InSpatio-World GitHub Repository https://github.com/inspatio/inspatio-world

Continue Reading

Jun 22, 2026

Google Invests $75 Million in A24 to Develop AI Filmmaking Tools

Google invested $75 million in A24 to develop AI filmmaking tools with DeepMind, the first time Google has taken a direct equity stake in a Hollywood studio.

Jun 22, 2026

Holo-World: Unified Camera, Object, and Weather Control for Video Generation

Holo-World is an open source video generation model from Alibaba Group that controls camera trajectory, object motion, and weather conditions simultaneously.

Jun 22, 2026

MiniMax Launches Hub at SIFF as AI Reshapes Chinese Film Production

MiniMax launched Hub at SIFF 2026, a multimodal AI platform built around human creative oversight, as Chinese film leaders named compute power as the core bottleneck.

View all Posts