First Frame Go: Video Customization Using Memory Buffer Method

November 20, 2025

Share this post:

First Frame Go: Video Customization Using Memory Buffer Method

Researchers discovered video generation models implicitly use the first frame as a conceptual memory buffer. This insight enables customization across scenarios using minimal training data.

What First Frame Go Does

The method achieves video content customization by leveraging how models process initial frames. Traditional approaches view the first frame as a temporal starting point. This research shows models actually store visual entities in that frame for later reuse during generation.

The technique requires 20-50 training examples. No architectural modifications needed. No large scale finetuning required.

Built on Wan2.2-I2V-14B, a 14 billion parameter image-to-video model. Researchers apply lightweight LoRA adapters to invoke the model's subject mixing and scene transition capabilities.

Technical Implementation

The system processes 81 frame videos at 1280×720 resolution. The first 4 frames serve as subject mixing inputs, aligned with the 3D VAE temporal compression ratio. The remaining 77 frames comprise the output.

Multi character anime scene generation

Product demonstration scenario

Hardware requirements: H200 GPU with 141GB RAM for full resolution inference. Lower resolutions (640×480) run on standard hardware but produce different content quality.

The method uses three components:

Base Wan2.2-14B-I2V model for image-to-video conversion
LoRA adapters trained on specific visual entities
First frame composition system for entity placement

Application Scenarios

The research demonstrates 31 distinct use cases:

Animation and Characters: Five character anime crossovers, historical figure meetings, character showdowns with precise pose control.

Product Showcasing: Multi product comparisons, hand off demonstrations, commercial quality presentations with accurate brand representation.

Autonomous Driving: Multi vehicle scenarios, rare edge cases with aircraft, emergency response simulations with proper vehicle physics.

Underwater exploration with multiple entities

First-person perspective simulation

Robotic Applications: Manipulation tasks, liquid pouring, pick and place operations with precise gripper control.

First Person Perspectives: Driving simulations, underwater swimming, shooter gameplay with accurate FOV rendering.

Complex Interactions: Robot-human handshakes, human-animal coordination, character-to-character object transfers.

Baseline Comparisons

The research team compared against three systems built on the same 14B parameter architecture:

Wan2.2-14B-I2V: Base model with standard I2V capabilities
VACE: Specialized video customization approach
SkyReels-A2: Alternative reference-based method

First Frame Go handles scenarios with 5-6 reference objects where baseline methods fail. Maintains consistency across longer sequences. Preserves entity details during complex motions.

Training Efficiency

The lightweight approach uses 20-50 examples per entity. Training produces LoRA adapters rather than full model weights. This reduces storage requirements and enables modular entity libraries.

Test data preparation involves:

Segment target entities as RGBA layers
Compose with background scenes
Generate captions describing desired motion
Run inference with composed first frame

The team removed all test data involving personal portrait rights from public release.

Limitations and Requirements

Full resolution outputs (1280×720, 81 frames) require H200 GPUs. A100 or H100 systems need CPU offload enabled. Lower resolutions produce significantly different content characteristics.

The method inherits limitations from the base Wan2.2 model. Motion quality depends on training data distribution. Entity interactions follow learned physical constraints.

Research Availability

Code: GitHub repository
Paper: Hugging Face
LoRA Adapters: Available on Hugging Face
Base Model: Wan2.2-I2V-A14B on Hugging Face and ModelScope

The project includes example test data, material assets, and combination scripts for creating custom first frames. Setup requires Python 3.11 environment.

What This Means for Video Production

The conceptual memory buffer insight changes how we approach video customization. Instead of training entire models for specific entities, creators can build LoRA adapter libraries.

This enables:

Consistent character appearance across shots
Product placement with brand accuracy
Scenario testing for autonomous systems
Mixed reality concept visualization

The 20-50 example requirement makes custom entity creation practical. No specialized ML infrastructure needed for training. Standard video editing workflows can prepare training data.

The method proves video models already contain sophisticated entity management systems. The first frame acts as an instruction set, telling the model which stored concepts to activate and how to compose them spatially.

Future work may reduce hardware requirements through quantization or architectural optimizations. The conceptual framework applies to any autoregressive video model using first frame conditioning.

Sources:

First Frame Go Project Page: https://firstframego.github.io/
Research Paper: https://huggingface.co/papers/2511.15700
GitHub Repository: https://github.com/zli12321/FFGO-Video-Customization

Continue Reading

Feb 13, 2026

ByteDance Suspends Seedance 2.0 Voice Cloning Feature After Privacy Backlash

ByteDance disables controversial face-to-voice feature in Seedance 2.0 that generated personalized voices from facial photos alone, citing privacy and ethical concerns.

Feb 12, 2026

Hollywood Strikes Back: MPA Condemns ByteDance's Seedance 2.0 Over Viral AI Video

The Motion Picture Association demands ByteDance cease 'massive' copyright infringement after viral AI video of Tom Cruise fighting Brad Pitt sparks industry outcry.