The AI That Thinks Before It Edits Videos

Share this post:
The AI That Thinks Before It Edits Videos
Video editing models have faced a fundamental problem: you either get precision through external masks or flexibility through unified models. Never both. VideoCoF just changed that equation.
Released as opensource research from the University of Technology Sydney and Zhejiang University, VideoCoF introduces a Chain-of-Frames approach that makes the model reason about edits before executing them. The result is mask free precision editing trained on just 50,000 video pairs.
The Core Problem With Current Video Editing
Traditional video editing AI follows two paths. Expert models deliver precision but require users to provide masks marking exactly what to edit. This dependency on external guidance limits their practical use. Unified in context learning models remove this requirement but struggle with spatial accuracy because they lack explicit cues about where to apply changes.
VideoCoF addresses this trade off through a three step process: see, reason, edit. The model first identifies the editing region through reasoning tokens before generating the target video. This approach eliminates mask dependency while maintaining precise instruction-to-region alignment.
How VideoCoF Works
The Chain-of-Frames method forces the video diffusion model to predict reasoning tokens (edit region latents) before generating target video tokens. These reasoning tokens act as the model's internal representation of where and how to edit.
The system also uses a RoPE (Rotary Position Embedding) alignment strategy that leverages reasoning tokens to ensure motion alignment. This enables length extrapolation beyond training duration. Models trained on 33 frame sequences can generalize to videos four times longer.
Training Efficiency and Performance
VideoCoF achieves state of the art performance on VideoCoF Bench using only 50,000 video pairs for training. This data efficiency comes from the reasoning first approach, which helps the model learn spatial relationships more effectively than traditional methods.
The base model builds on Wan-2.1-T2V-14B pretrained weights and supports multiple editing tasks: object removal, object addition, object swapping, and local style transfer.
Object Removal Capabilities
The model handles precise object removal based on natural language descriptions. Users can specify removing specific people by describing their appearance, location, and clothing. The system maintains scene coherence while removing the targeted elements.
Object Addition Features
VideoCoF can add new elements to existing scenes while maintaining temporal consistency and natural motion. The added objects interact realistically with the existing environment and follow appropriate movement patterns.
Object Swapping and Style Transfer
The model supports swapping specific objects or modifying attributes while preserving the overall scene structure and motion patterns. This includes changing clothing, materials, and textures.
Reasoning Visualization
The reasoning tokens provide insight into how the model identifies edit regions before applying changes. This visualization shows the model's internal decision making process.
Availability and Licensing
VideoCoF is available as opensource software under the Apache 2.0 license, which permits commercial use. The model weights are hosted on Hugging Face, and the inference code is available on GitHub.
The project requires the Wan-2.1-T2V-14B pretrained weights and the VideoCoF checkpoint. Installation involves setting up a Python 3.10 environment with PyTorch and additional dependencies. The researchers recommend FlashAttention3 for optimal performance on NVIDIA H100/H800 GPUs.
Users can run inference for single tasks or parallel processing depending on their hardware setup. The repository includes scripts for object removal, addition, and local style transfer.
Try AI FILMS Studio's video generation tools to explore similar AI powered video capabilities.
Technical Requirements
The model runs on standard GPU configurations with CUDA 12.1 support. For Hopper architecture GPUs like the H100 or H800, PyTorch 2.8 with CUDA 12.8 provides faster inference. The system supports both single GPU and multi GPU setups through PyTorch's distributed training interface.
Memory requirements scale with video length and resolution. The base configuration handles 33 frame sequences, with length extrapolation enabling processing of longer videos through the RoPE alignment strategy.
Research Implications
VideoCoF demonstrates that explicit reasoning steps can resolve the precision unification trade off in video editing models. The data efficiency (50,000 training pairs) suggests this approach could extend to other video understanding and generation tasks.
The Chain-of-Frames paradigm introduces a structured way to handle temporal consistency in video editing. By predicting reasoning tokens before output tokens, the model builds an internal representation of spatial relationships that guides the editing process.
The project's opensource release enables researchers to build on this foundation. The Apache 2.0 license removes barriers to both academic research and commercial applications.
Paper: arXiv:2512.07469
Project Page: videocof.github.io
GitHub: github.com/knightyxp/VideoCoF
Model Weights: huggingface.co/XiangpengYang/VideoCoF


