EditorNodesPricingBlog

Kiwi-Edit: Open Source Video Editing Framework

March 3, 2026
Kiwi-Edit: Open Source Video Editing Framework

Share this post:

Kiwi-Edit: Open Source Video Editing Framework

Kiwi-Edit is a new open source video editing framework from ShowLab at the National University of Singapore. Released under the MIT license, it combines text instruction guidance with reference image control to handle a wide range of editing tasks at 720p resolution.

What Kiwi-Edit Does

The framework covers both global and local video edits. On the global side, it applies style transfers, including cartoon, sketch, watercolor, and other visual aesthetics. On the local side, it handles object removal, object addition, object replacement, and background swaps. Beyond text instructions, users can supply a reference image to guide the visual output, which is particularly useful when language alone cannot describe the intended result.

The paper, "Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance," identifies a core problem with existing tools. Instruction based systems lack precise visual control because natural language cannot fully capture complex appearance details. Reference guided methods solve that, but they historically suffered from scarce paired training data. Kiwi-Edit addresses both sides by building a scalable pipeline that converts existing video editing pairs into reference training quadruplets using image generative models.

Architecture

The system builds on two core components: a Qwen2.5-VL-3B vision-language model that extracts semantic guidance from prompts and reference images, and a Wan2.2-TI2V-5B video diffusion transformer that handles the actual generation. Source video latents are injected to preserve structure, and learnable queries bring in reference visual features without requiring architectural changes to the base model.

Training follows a three stage curriculum. Stage one covers image editing only. Stage two adds video editing at multiple resolutions. Stage three introduces reference guided video editing on the RefVIE dataset the team built for this work. The final training set contains 477,000 high quality quadruplets drawn from a raw pool of 3.7 million samples from Ditto-1M, ReCo, and OpenVE-3M.

Benchmark Results

On OpenVE-Bench, evaluated by Gemini-2.5-Pro, the Stage-3 model scores 3.02 overall, the highest among open source methods. The breakdown: 3.83 for local changes and 3.64 for global style. On RefVIE-Bench, the system shows competitive results on identity consistency, temporal consistency, and reference similarity metrics.

Open Source and Commercial Use

The code, models, and datasets are all publicly available. The repository is licensed under MIT, which permits commercial use with no restrictions. Three Diffusers-compatible model variants are on Hugging Face: instruction-only, reference-only, and combined instruction-plus-reference. The full 5B parameter model runs at up to 81 frames and 1280x720 resolution.

For filmmakers evaluating open source video editing tools, the MIT license and Diffusers support make Kiwi-Edit straightforward to integrate. It builds on the same Wan2.2 base used by other recent open source video tools, which reduces infrastructure overhead if that model is already in use. Explore video editing capabilities at AI FILMS Studio.

Video Examples

The following pairs show each edit alongside the original. Videos autoplay in a loop.

Global Changes

Edited

Original

Prompt: "Apply the night aesthetic to this video."

Global Style

Edited

Original

Prompt: "Apply the Cartoon animation style to this video."

Edited

Original

Prompt: "Apply the sunny aesthetic to this video."

Background Replace

Edited

Original

Prompt: "Replace the background with a lively urban garden scene during winter."

Local Remove

Edited

Original

Prompt: "Remove the small bird."

Local Add

Edited

Original

Prompt: "Add an floor lamp near the corner of the room in front of the desk behind the man."

Local Replace

Edited

Original

Prompt: "Transform the entire bread surface into a miniature lush forest with green trees, moss, and tiny flowers shaped exactly like the loaf."

Reference Image Guided Editing

When a reference image is provided alongside the prompt, Kiwi-Edit transfers visual attributes from the reference into the edited video. This enables precise control over costume, texture, and background style without relying solely on text description.

Subject Edit

Flat lay reference photo of a matte black techwear tactical jacket
Reference image

Edited

Original

Prompt: "Replace the girl's cloth with a matte black techwear tactical jacket."

Background Replace With Reference

Reference image of a fairytale storybook style forest background
Reference image

Edited

Original

Prompt: "Replace the background with a whimsical fairytale landscape, featuring colorful mushroom houses and a floating castle in a magical forest."

How It Compares

Kiwi-Edit is not the first instruction guided video editor. Ditto, the dataset project behind Ditto and Editto, generated 1 million synthetic video editing pairs to train its editing model, contributing to a broader push toward data driven video editing. VideoCoF took a different approach with chain of frames reasoning to achieve mask free precision on just 50,000 training pairs. Kiwi-Edit distinguishes itself by combining both modalities. instruction text and reference images. within a single architecture, and by releasing a dedicated benchmark and dataset for reference guided video editing.

The Wan2.2 base model also appears in other recent tools covered on this site, including Wan2.2-Animate, which uses the same foundation for character animation and replacement. First Frame Go explored how first-frame conditioning in Wan-type models enables strong customization with minimal examples.

For automated scene segmentation before applying edits, Meta SAM 3 provides text-driven object tracking that can feed directly into editing pipelines.

Access and Resources

Generate and edit videos directly in AI FILMS Studio.

Sources

ShowLab, National University of Singapore | arXiv | Hugging Face