MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

March 21, 2026

Share this post:

MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

MosaicMem is a spatial memory architecture for video world models from researchers at the University of Toronto, Vector Institute, Georgia Tech, and several collaborating institutions. The system generates camera controlled video up to 2 minutes long by combining explicit 3D patch memory with the dynamic generation capabilities of a diffusion model.

MosaicMem hybrid spatial memory architecture diagram for video world model generation

The Problem With Current Approaches

Video world models use spatial memory to stay consistent when a camera revisits a scene. Two dominant approaches exist, and both fall short.

Explicit systems build 3D geometric caches from frames using point clouds or Gaussian splats. They achieve strong view consistency but cannot animate moving objects. Any dynamic content in the scene is frozen or absent.

Implicit systems store past frames as latent tokens and retrieve them through attention. They handle dynamic content well but struggle with precise camera trajectories. Pose data is provided but its influence on motion between frames is weak, and error accumulates over long sequences.

Patch and Compose

MosaicMem introduces a "patch and compose" interface as the core memory mechanism. Rather than operating on whole frames or full 3D reconstructions, it treats patches as the fundamental memory unit.

Each patch is lifted to 3D using an off the shelf depth estimator. When the camera moves to a new viewpoint, the system reprojects the relevant patches to the target view and passes them as conditioning tokens to the video diffusion model. The model decides whether to follow the retrieved memory or generate new content from the text prompt.

Text driven dynamic generation: giraffe scene

Moving object with spatial consistency: wolf scene

Two alignment methods prevent geometric drift in the latent space. Warped RoPE adjusts the positional encodings of retrieved patches to match their reprojected coordinates in the target view. Warped Latent directly resamples the feature map at those projected positions using bilinear interpolation. Training with a mixture of both produces the strongest results.

Camera Control via PRoPE

MosaicMem adds PRoPE (Projective Positional Encoding) as a dedicated camera conditioning module. PRoPE encodes the relative geometry between camera views by injecting projective transforms into the self-attention mechanism. This compensates for cases where the retrieved memory alone cannot specify fine grained camera motion, such as large rotations or sparse patch coverage.

The combination of Warped RoPE conditioning and PRoPE separates MosaicMem from both explicit and implicit baselines. Explicit systems lack dynamic flexibility. Implicit systems lack camera accuracy. MosaicMem targets both shortcomings at once.

Long Horizon Navigation Up to 2 Minutes

MosaicMem generates navigation sequences up to 2 minutes long through autoregressive rollout. Each 80 frame segment is generated, added to the patch memory pool, and used to condition the next segment. The final frame of each segment becomes the anchor for the next.

Long navigation with memory retrieval overlay

2 minute navigation: persistent memory across revisits

The comparison baseline, Context-as-Memory, deteriorates during long rollouts. Artifacts build up past the 30 second mark and generation coherence collapses. MosaicMem maintains consistent scene structure across the full 2 minute sequence.

Memory Manipulation for Scene Editing

Because patches are stored with their 3D spatial coordinates, MosaicMem supports direct manipulation of memory content. Patches can be deleted, duplicated, relocated, or concatenated. The model treats the modified memory as its authoritative record of the scene.

Scene stitching is the most practical application. Two distinct environments can be registered along a shared boundary by concatenating their patch memories. Navigating from one into the other, and back, leaves both intact.

Memory manipulation: scene stitching via patch operations

One example from the paper inverts a scene's memory and registers it in the sky region. Looking up reveals a second environment overhead, a spatial trick with applications in surreal or fantasy set design.

Mosaic Forcing: Real Time Generation

The autoregressive "Mosaic Forcing" variant generates video at 16 FPS at 640×360 resolution. On the VBench quality benchmark it scores 81.11 on the total metric, versus 79.08 for RELIC and 75.11 for Matrix-Game 2.0. Unlike the base model, which runs offline for best quality, Mosaic Forcing is designed for interactive use.

Quantitative Results

MosaicMem was evaluated against explicit and implicit memory baselines on camera accuracy, visual quality, memory retrieval consistency, and motion dynamics. Full model results:

Rotation error: 0.51° (vs. 1.42°–1.61° for explicit baselines, 4.65°–5.87° for implicit baselines)
Translation error: 0.06
FID: 65.67 (vs. 74.67–89.17 for all other methods)
SSIM: 0.75 (vs. 0.47–0.66 for all baselines)
Dynamic score: 2.58 (highest across all tested systems)

The dynamic score is the clearest indicator of the hybrid advantage. Explicit memory baselines score between 0.41 and 0.68. MosaicMem scores 2.58, showing that patch based conditioning preserves the motion generation capacity of the underlying diffusion model.

Training and Data

The model fine-tunes Wan 2.2 5B using AdamW for 250,000 steps with an effective batch size of 64 on 8 H100 GPUs. A new benchmark called MosaicMEM-World supports training and evaluation with data from four sources: Unreal Engine 5 scenes, commercial game environments including Cyberpunk 2077, real world first person captures, and selected sequences from the Sekai dataset.

Depth estimation uses Depth Anything V3. Scene annotations come from Gemini 3 with a "static + dynamic" labeling scheme that describes first frame content and temporal dynamics separately. Code and data are planned for release.

Generate camera controlled video in the AI FILMS Studio video workspace.

Related Research

The spatial memory design in MosaicMem builds on a wave of world model research from the past several months. InSpatio-World takes reference video as input and generates a navigable 4D environment with temporal control. Lingbot World from Ant Group released a commercially licensed world simulator with comparable camera control. For long form video generation at the diffusion model level, Helios from Peking University achieved 19.5 FPS at 14 billion parameters without KV-cache. The MIND benchmark provides open tools for testing memory consistency across all these systems on the same criteria. NVIDIA's SANA-WM, released in May 2026, offers a complementary approach, generating 60-second 720p world video on a single GPU with 6-DoF camera control under Apache 2.0.

AI FILMS Studio video generation workspace

Try AI FILMS Studio

Generate text-to-video and image-to-video with the latest AI models in the video workspace.

Nodes Graph Editor

Build custom AI workflows by connecting models visually in the Nodes Graph Editor.

Sources

arXiv: MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

Project Page: mosaicmem.github.io/mosaicmem

Continue Reading

Jun 22, 2026

Google Invests $75 Million in A24 to Develop AI Filmmaking Tools

Google invested $75 million in A24 to develop AI filmmaking tools with DeepMind, the first time Google has taken a direct equity stake in a Hollywood studio.

Jun 22, 2026

Holo-World: Unified Camera, Object, and Weather Control for Video Generation

Holo-World is an open source video generation model from Alibaba Group that controls camera trajectory, object motion, and weather conditions simultaneously.

Jun 22, 2026

MiniMax Launches Hub at SIFF as AI Reshapes Chinese Film Production

MiniMax launched Hub at SIFF 2026, a multimodal AI platform built around human creative oversight, as Chinese film leaders named compute power as the core bottleneck.

View all Posts