First Frame Go: Video Customization Using Memory Buffer Method

Share this post:
First Frame Go: Video Customization Using Memory Buffer Method
Researchers discovered video generation models implicitly use the first frame as a conceptual memory buffer. This insight enables customization across scenarios using minimal training data.
What First Frame Go Does
The method achieves video content customization by leveraging how models process initial frames. Traditional approaches view the first frame as a temporal starting point. This research shows models actually store visual entities in that frame for later reuse during generation.
The technique requires 20-50 training examples. No architectural modifications needed. No large scale finetuning required.
Built on Wan2.2-I2V-14B, a 14 billion parameter image-to-video model. Researchers apply lightweight LoRA adapters to invoke the model's subject mixing and scene transition capabilities.
Technical Implementation
The system processes 81 frame videos at 1280×720 resolution. The first 4 frames serve as subject mixing inputs, aligned with the 3D VAE temporal compression ratio. The remaining 77 frames comprise the output.
Multi character anime scene generation
Product demonstration scenario
Hardware requirements: H200 GPU with 141GB RAM for full resolution inference. Lower resolutions (640×480) run on standard hardware but produce different content quality.
The method uses three components:
- Base Wan2.2-14B-I2V model for image-to-video conversion
- LoRA adapters trained on specific visual entities
- First frame composition system for entity placement
Application Scenarios
The research demonstrates 31 distinct use cases:
Animation and Characters: Five character anime crossovers, historical figure meetings, character showdowns with precise pose control.
Product Showcasing: Multi product comparisons, hand off demonstrations, commercial quality presentations with accurate brand representation.
Autonomous Driving: Multi vehicle scenarios, rare edge cases with aircraft, emergency response simulations with proper vehicle physics.
Underwater exploration with multiple entities
First-person perspective simulation
Robotic Applications: Manipulation tasks, liquid pouring, pick and place operations with precise gripper control.
First Person Perspectives: Driving simulations, underwater swimming, shooter gameplay with accurate FOV rendering.
Complex Interactions: Robot-human handshakes, human-animal coordination, character-to-character object transfers.
Baseline Comparisons
The research team compared against three systems built on the same 14B parameter architecture:
- Wan2.2-14B-I2V: Base model with standard I2V capabilities
- VACE: Specialized video customization approach
- SkyReels-A2: Alternative reference-based method
First Frame Go handles scenarios with 5-6 reference objects where baseline methods fail. Maintains consistency across longer sequences. Preserves entity details during complex motions.
Training Efficiency
The lightweight approach uses 20-50 examples per entity. Training produces LoRA adapters rather than full model weights. This reduces storage requirements and enables modular entity libraries.
Test data preparation involves:
- Segment target entities as RGBA layers
- Compose with background scenes
- Generate captions describing desired motion
- Run inference with composed first frame
The team removed all test data involving personal portrait rights from public release.
Limitations and Requirements
Full resolution outputs (1280×720, 81 frames) require H200 GPUs. A100 or H100 systems need CPU offload enabled. Lower resolutions produce significantly different content characteristics.
The method inherits limitations from the base Wan2.2 model. Motion quality depends on training data distribution. Entity interactions follow learned physical constraints.
Research Availability
Code: GitHub repository
Paper: Hugging Face
LoRA Adapters: Available on Hugging Face
Base Model: Wan2.2-I2V-A14B on Hugging Face and ModelScope
The project includes example test data, material assets, and combination scripts for creating custom first frames. Setup requires Python 3.11 environment.
What This Means for Video Production
The conceptual memory buffer insight changes how we approach video customization. Instead of training entire models for specific entities, creators can build LoRA adapter libraries.
This enables:
- Consistent character appearance across shots
- Product placement with brand accuracy
- Scenario testing for autonomous systems
- Mixed reality concept visualization
The 20-50 example requirement makes custom entity creation practical. No specialized ML infrastructure needed for training. Standard video editing workflows can prepare training data.
The method proves video models already contain sophisticated entity management systems. The first frame acts as an instruction set, telling the model which stored concepts to activate and how to compose them spatially.
Future work may reduce hardware requirements through quantization or architectural optimizations. The conceptual framework applies to any autoregressive video model using first frame conditioning.
Sources:
- First Frame Go Project Page: https://firstframego.github.io/
- Research Paper: https://huggingface.co/papers/2511.15700
- GitHub Repository: https://github.com/zli12321/FFGO-Video-Customization


