Kiwi-Edit: Open Source Video Editing Framework
Share this post:
Kiwi-Edit: Open Source Video Editing Framework
Kiwi-Edit is a new open source video editing framework from ShowLab at the National University of Singapore. Released under the MIT license, it combines text instruction guidance with reference image control to handle a wide range of editing tasks at 720p resolution.
What Kiwi-Edit Does
The framework covers both global and local video edits. On the global side, it applies style transfers, including cartoon, sketch, watercolor, and other visual aesthetics. On the local side, it handles object removal, object addition, object replacement, and background swaps. Beyond text instructions, users can supply a reference image to guide the visual output, which is particularly useful when language alone cannot describe the intended result.
The paper, "Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance," identifies a core problem with existing tools. Instruction based systems lack precise visual control because natural language cannot fully capture complex appearance details. Reference guided methods solve that, but they historically suffered from scarce paired training data. Kiwi-Edit addresses both sides by building a scalable pipeline that converts existing video editing pairs into reference training quadruplets using image generative models.
Architecture
The system builds on two core components: a Qwen2.5-VL-3B vision-language model that extracts semantic guidance from prompts and reference images, and a Wan2.2-TI2V-5B video diffusion transformer that handles the actual generation. Source video latents are injected to preserve structure, and learnable queries bring in reference visual features without requiring architectural changes to the base model.
Training follows a three stage curriculum. Stage one covers image editing only. Stage two adds video editing at multiple resolutions. Stage three introduces reference guided video editing on the RefVIE dataset the team built for this work. The final training set contains 477,000 high quality quadruplets drawn from a raw pool of 3.7 million samples from Ditto-1M, ReCo, and OpenVE-3M.
Benchmark Results
On OpenVE-Bench, evaluated by Gemini-2.5-Pro, the Stage-3 model scores 3.02 overall, the highest among open source methods. The breakdown: 3.83 for local changes and 3.64 for global style. On RefVIE-Bench, the system shows competitive results on identity consistency, temporal consistency, and reference similarity metrics.
Open Source and Commercial Use
The code, models, and datasets are all publicly available. The repository is licensed under MIT, which permits commercial use with no restrictions. Three Diffusers-compatible model variants are on Hugging Face: instruction-only, reference-only, and combined instruction-plus-reference. The full 5B parameter model runs at up to 81 frames and 1280x720 resolution.
For filmmakers evaluating open source video editing tools, the MIT license and Diffusers support make Kiwi-Edit straightforward to integrate. It builds on the same Wan2.2 base used by other recent open source video tools, which reduces infrastructure overhead if that model is already in use. Explore video editing capabilities at AI FILMS Studio.
Video Examples
The following pairs show each edit alongside the original. Videos autoplay in a loop.
Global Changes
Edited
Original
Prompt: "Apply the night aesthetic to this video."
Global Style
Edited
Original
Prompt: "Apply the Cartoon animation style to this video."
Edited
Original
Prompt: "Apply the sunny aesthetic to this video."
Background Replace
Edited
Original
Prompt: "Replace the background with a lively urban garden scene during winter."
Local Remove
Edited
Original
Prompt: "Remove the small bird."
Local Add
Edited
Original
Prompt: "Add an floor lamp near the corner of the room in front of the desk behind the man."
Local Replace
Edited
Original
Prompt: "Transform the entire bread surface into a miniature lush forest with green trees, moss, and tiny flowers shaped exactly like the loaf."
Reference Image Guided Editing
When a reference image is provided alongside the prompt, Kiwi-Edit transfers visual attributes from the reference into the edited video. This enables precise control over costume, texture, and background style without relying solely on text description.
Subject Edit
Edited
Original
Prompt: "Replace the girl's cloth with a matte black techwear tactical jacket."
Background Replace With Reference
Edited
Original
Prompt: "Replace the background with a whimsical fairytale landscape, featuring colorful mushroom houses and a floating castle in a magical forest."
How It Compares
Kiwi-Edit is not the first instruction guided video editor. Ditto, the dataset project behind Ditto and Editto, generated 1 million synthetic video editing pairs to train its editing model, contributing to a broader push toward data driven video editing. VideoCoF took a different approach with chain of frames reasoning to achieve mask free precision on just 50,000 training pairs. Kiwi-Edit distinguishes itself by combining both modalities. instruction text and reference images. within a single architecture, and by releasing a dedicated benchmark and dataset for reference guided video editing.
The Wan2.2 base model also appears in other recent tools covered on this site, including Wan2.2-Animate, which uses the same foundation for character animation and replacement. First Frame Go explored how first-frame conditioning in Wan-type models enables strong customization with minimal examples.
For automated scene segmentation before applying edits, Meta SAM 3 provides text-driven object tracking that can feed directly into editing pipelines.
Access and Resources
- Code: github.com/showlab/Kiwi-Edit (MIT license)
- Models: Hugging Face collection
- Dataset: kiwi_edit_training_data
- Paper: arXiv:2603.02175
Generate and edit videos directly in AI FILMS Studio.
Sources
ShowLab, National University of Singapore | arXiv | Hugging Face

