MosaicMem: Hybrid Spatial Memory for Controllable Video World Models
Share this post:
MosaicMem: Hybrid Spatial Memory for Controllable Video World Models
MosaicMem is a spatial memory architecture for video world models from researchers at the University of Toronto, Vector Institute, Georgia Tech, and several collaborating institutions. The system generates camera controlled video up to 2 minutes long by combining explicit 3D patch memory with the dynamic generation capabilities of a diffusion model.
The Problem With Current Approaches
Video world models use spatial memory to stay consistent when a camera revisits a scene. Two dominant approaches exist, and both fall short.
Explicit systems build 3D geometric caches from frames using point clouds or Gaussian splats. They achieve strong view consistency but cannot animate moving objects. Any dynamic content in the scene is frozen or absent.
Implicit systems store past frames as latent tokens and retrieve them through attention. They handle dynamic content well but struggle with precise camera trajectories. Pose data is provided but its influence on motion between frames is weak, and error accumulates over long sequences.
Patch and Compose
MosaicMem introduces a "patch and compose" interface as the core memory mechanism. Rather than operating on whole frames or full 3D reconstructions, it treats patches as the fundamental memory unit.
Each patch is lifted to 3D using an off the shelf depth estimator. When the camera moves to a new viewpoint, the system reprojects the relevant patches to the target view and passes them as conditioning tokens to the video diffusion model. The model decides whether to follow the retrieved memory or generate new content from the text prompt.
Text driven dynamic generation: giraffe scene
Moving object with spatial consistency: wolf scene
Two alignment methods prevent geometric drift in the latent space. Warped RoPE adjusts the positional encodings of retrieved patches to match their reprojected coordinates in the target view. Warped Latent directly resamples the feature map at those projected positions using bilinear interpolation. Training with a mixture of both produces the strongest results.
Camera Control via PRoPE
MosaicMem adds PRoPE (Projective Positional Encoding) as a dedicated camera conditioning module. PRoPE encodes the relative geometry between camera views by injecting projective transforms into the self-attention mechanism. This compensates for cases where the retrieved memory alone cannot specify fine grained camera motion, such as large rotations or sparse patch coverage.
The combination of Warped RoPE conditioning and PRoPE separates MosaicMem from both explicit and implicit baselines. Explicit systems lack dynamic flexibility. Implicit systems lack camera accuracy. MosaicMem targets both shortcomings at once.
Long Horizon Navigation Up to 2 Minutes
MosaicMem generates navigation sequences up to 2 minutes long through autoregressive rollout. Each 80 frame segment is generated, added to the patch memory pool, and used to condition the next segment. The final frame of each segment becomes the anchor for the next.
Long navigation with memory retrieval overlay
2 minute navigation: persistent memory across revisits
The comparison baseline, Context-as-Memory, deteriorates during long rollouts. Artifacts build up past the 30 second mark and generation coherence collapses. MosaicMem maintains consistent scene structure across the full 2 minute sequence.
Memory Manipulation for Scene Editing
Because patches are stored with their 3D spatial coordinates, MosaicMem supports direct manipulation of memory content. Patches can be deleted, duplicated, relocated, or concatenated. The model treats the modified memory as its authoritative record of the scene.
Scene stitching is the most practical application. Two distinct environments can be registered along a shared boundary by concatenating their patch memories. Navigating from one into the other, and back, leaves both intact.
Memory manipulation: scene stitching via patch operations
One example from the paper inverts a scene's memory and registers it in the sky region. Looking up reveals a second environment overhead, a spatial trick with applications in surreal or fantasy set design.
Mosaic Forcing: Real Time Generation
The autoregressive "Mosaic Forcing" variant generates video at 16 FPS at 640×360 resolution. On the VBench quality benchmark it scores 81.11 on the total metric, versus 79.08 for RELIC and 75.11 for Matrix-Game 2.0. Unlike the base model, which runs offline for best quality, Mosaic Forcing is designed for interactive use.
Quantitative Results
MosaicMem was evaluated against explicit and implicit memory baselines on camera accuracy, visual quality, memory retrieval consistency, and motion dynamics. Full model results:
- Rotation error: 0.51° (vs. 1.42°–1.61° for explicit baselines, 4.65°–5.87° for implicit baselines)
- Translation error: 0.06
- FID: 65.67 (vs. 74.67–89.17 for all other methods)
- SSIM: 0.75 (vs. 0.47–0.66 for all baselines)
- Dynamic score: 2.58 (highest across all tested systems)
The dynamic score is the clearest indicator of the hybrid advantage. Explicit memory baselines score between 0.41 and 0.68. MosaicMem scores 2.58, showing that patch based conditioning preserves the motion generation capacity of the underlying diffusion model.
Training and Data
The model fine-tunes Wan 2.2 5B using AdamW for 250,000 steps with an effective batch size of 64 on 8 H100 GPUs. A new benchmark called MosaicMEM-World supports training and evaluation with data from four sources: Unreal Engine 5 scenes, commercial game environments including Cyberpunk 2077, real world first person captures, and selected sequences from the Sekai dataset.
Depth estimation uses Depth Anything V3. Scene annotations come from Gemini 3 with a "static + dynamic" labeling scheme that describes first frame content and temporal dynamics separately. Code and data are planned for release.
Generate camera controlled video in the AI FILMS Studio video workspace.
Related Research
The spatial memory design in MosaicMem builds on a wave of world model research from the past several months. InSpatio-World takes reference video as input and generates a navigable 4D environment with temporal control. Lingbot World from Ant Group released a commercially licensed world simulator with comparable camera control. For long form video generation at the diffusion model level, Helios from Peking University achieved 19.5 FPS at 14 billion parameters without KV-cache. The MIND benchmark provides open tools for testing memory consistency across all these systems on the same criteria.
Sources
arXiv: MosaicMem: Hybrid Spatial Memory for Controllable Video World Models
Project Page: mosaicmem.github.io/mosaicmem
