The AI That Thinks Before It Edits Videos

December 8, 2025

Share this post:

The AI That Thinks Before It Edits Videos

Watch how VideoCoF achieves mask free precision editing through temporal reasoning, trained on just 50,000 video pairs

Video editing models have faced a fundamental problem: you either get precision through external masks or flexibility through unified models. Never both. VideoCoF just changed that equation.

Released as opensource research from the University of Technology Sydney and Zhejiang University, VideoCoF introduces a Chain-of-Frames approach that makes the model reason about edits before executing them. The result is mask free precision editing trained on just 50,000 video pairs.

VideoCoF demonstrates the difference between traditional approaches and its reasoning based method

The Core Problem With Current Video Editing

Traditional video editing AI follows two paths. Expert models deliver precision but require users to provide masks marking exactly what to edit. This dependency on external guidance limits their practical use. Unified in context learning models remove this requirement but struggle with spatial accuracy because they lack explicit cues about where to apply changes.

VideoCoF addresses this trade off through a three step process: see, reason, edit. The model first identifies the editing region through reasoning tokens before generating the target video. This approach eliminates mask dependency while maintaining precise instruction-to-region alignment.

How VideoCoF Works

The Chain-of-Frames method forces the video diffusion model to predict reasoning tokens (edit region latents) before generating target video tokens. These reasoning tokens act as the model's internal representation of where and how to edit.

The system also uses a RoPE (Rotary Position Embedding) alignment strategy that leverages reasoning tokens to ensure motion alignment. This enables length extrapolation beyond training duration. Models trained on 33 frame sequences can generalize to videos four times longer.

Training Efficiency and Performance

VideoCoF achieves state of the art performance on VideoCoF Bench using only 50,000 video pairs for training. This data efficiency comes from the reasoning first approach, which helps the model learn spatial relationships more effectively than traditional methods.

The base model builds on Wan-2.1-T2V-14B pretrained weights and supports multiple editing tasks: object removal, object addition, object swapping, and local style transfer.

Object Removal Capabilities

Input video (left) and edited output (right): "Remove the girl with dark skin wearing blue cardigan in the background on the right"

The model handles precise object removal based on natural language descriptions. Users can specify removing specific people by describing their appearance, location, and clothing. The system maintains scene coherence while removing the targeted elements.

Input video (left) and edited output (right): "Remove the young gray donkey standing near the green wheelbarrow to the right"

Object Addition Features

Input video (left) and edited output (right): "Add the makeup brush with a black handle and fluffy brush head moving smoothly along the cheek"

VideoCoF can add new elements to existing scenes while maintaining temporal consistency and natural motion. The added objects interact realistically with the existing environment and follow appropriate movement patterns.

Object Swapping and Style Transfer

Input video (left) and edited output (right): "Replace the red hoodie worn by the young Asian woman holding chopsticks on the left with a blue denim jacket"

The model supports swapping specific objects or modifying attributes while preserving the overall scene structure and motion patterns. This includes changing clothing, materials, and textures.

Reasoning Visualization

Input video (left) and reasoning visualization (right): "Make the dough on the cutting board crusty and golden as if freshly baked"

The reasoning tokens provide insight into how the model identifies edit regions before applying changes. This visualization shows the model's internal decision making process.

Input video (left) and reasoning visualization (right): "Remove the man with short dark hair wearing a gray suit on the right"

Availability and Licensing

VideoCoF is available as opensource software under the Apache 2.0 license, which permits commercial use. The model weights are hosted on Hugging Face, and the inference code is available on GitHub.

The project requires the Wan-2.1-T2V-14B pretrained weights and the VideoCoF checkpoint. Installation involves setting up a Python 3.10 environment with PyTorch and additional dependencies. The researchers recommend FlashAttention3 for optimal performance on NVIDIA H100/H800 GPUs.

Users can run inference for single tasks or parallel processing depending on their hardware setup. The repository includes scripts for object removal, addition, and local style transfer.

Try AI FILMS Studio's video generation tools to explore similar AI powered video capabilities.

Technical Requirements

The model runs on standard GPU configurations with CUDA 12.1 support. For Hopper architecture GPUs like the H100 or H800, PyTorch 2.8 with CUDA 12.8 provides faster inference. The system supports both single GPU and multi GPU setups through PyTorch's distributed training interface.

Memory requirements scale with video length and resolution. The base configuration handles 33 frame sequences, with length extrapolation enabling processing of longer videos through the RoPE alignment strategy.

Research Implications

VideoCoF demonstrates that explicit reasoning steps can resolve the precision unification trade off in video editing models. The data efficiency (50,000 training pairs) suggests this approach could extend to other video understanding and generation tasks.

The Chain-of-Frames paradigm introduces a structured way to handle temporal consistency in video editing. By predicting reasoning tokens before output tokens, the model builds an internal representation of spatial relationships that guides the editing process.

The project's opensource release enables researchers to build on this foundation. The Apache 2.0 license removes barriers to both academic research and commercial applications.

Paper: arXiv:2512.07469
Project Page: videocof.github.io
GitHub: github.com/knightyxp/VideoCoF
Model Weights: huggingface.co/XiangpengYang/VideoCoF

Continue Reading

Feb 3, 2026

IndieWire Sundance Panel Explores AI Tools for Creative Filmmaking

Industry leaders discuss practical AI applications in filmmaking at Sundance 2026, from interactive storytelling to IP protection.

Feb 1, 2026

Christopher Nolan's DGA Leadership Positions Hollywood for AI Negotiations

With contract talks opening June 2026, DGA President Christopher Nolan champions AI guardrails that preserve director authority and expand creative tools.