Kiwi-Edit: Open Source Video Editing Framework

March 4, 2026

Updated: July 1, 2026

Share this post:

Kiwi-Edit: Open Source Video Editing Framework

Kiwi-Edit is a new open source video editing framework from ShowLab at the National University of Singapore. Released under the MIT license, it combines text instruction guidance with reference image control to handle a wide range of editing tasks at 720p resolution.

What Kiwi-Edit Does

The framework covers both global and local video edits. On the global side, it applies style transfers, including cartoon, sketch, watercolor, and other visual aesthetics. On the local side, it handles object removal, object addition, object replacement, and background swaps. Beyond text instructions, users can supply a reference image to guide the visual output, which is particularly useful when language alone cannot describe the intended result.

The paper, "Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance," identifies a core problem with existing tools. Instruction based systems lack precise visual control because natural language cannot fully capture complex appearance details. Reference guided methods solve that, but they historically suffered from scarce paired training data. Kiwi-Edit addresses both sides by building a scalable pipeline that converts existing video editing pairs into reference training quadruplets using image generative models.

Architecture

The system builds on two core components: a Qwen2.5-VL-3B vision-language model that extracts semantic guidance from prompts and reference images, and a Wan2.2-TI2V-5B video diffusion transformer that handles the actual generation. Source video latents are injected to preserve structure, and learnable queries bring in reference visual features without requiring architectural changes to the base model.

Training follows a three stage curriculum. Stage one covers image editing only. Stage two adds video editing at multiple resolutions. Stage three introduces reference guided video editing on the RefVIE dataset the team built for this work. The final training set contains 477,000 high quality quadruplets drawn from a raw pool of 3.7 million samples from Ditto-1M, ReCo, and OpenVE-3M.

Benchmark Results

On OpenVE-Bench, evaluated by Gemini-2.5-Pro, the Stage-3 model scores 3.02 overall, the highest among open source methods. The breakdown: 3.83 for local changes and 3.64 for global style. On RefVIE-Bench, the system shows competitive results on identity consistency, temporal consistency, and reference similarity metrics.

Open Source and Commercial Use

The code, models, and datasets are all publicly available. The repository is licensed under MIT, which permits commercial use with no restrictions. Three Diffusers-compatible model variants are on Hugging Face: instruction-only, reference-only, and combined instruction-plus-reference. The full 5B parameter model runs at up to 81 frames and 1280x720 resolution.

For filmmakers evaluating open source video editing tools, the MIT license and Diffusers support make Kiwi-Edit straightforward to integrate. It builds on the same Wan2.2 base used by other recent open source video tools, which reduces infrastructure overhead if that model is already in use. For a specialized approach to object removal that also erases shadows and reflections, EffectErase from Fudan University targets that specific problem, though its CC BY-NC license restricts it to research use.

AI FILMS Studio video generation workspace

Try AI FILMS Studio

Generate text-to-video and image-to-video with the latest AI models in the video workspace.

Nodes Graph Editor

Build custom AI workflows by connecting models visually in the Nodes Graph Editor.

Video Examples

The following pairs show each edit alongside the original. Videos autoplay in a loop.

Global Changes

Edited

Original

Prompt: "Apply the night aesthetic to this video."

Global Style

Edited

Original

Prompt: "Apply the Cartoon animation style to this video."

Edited

Original

Prompt: "Apply the sunny aesthetic to this video."

Background Replace

Edited

Original

Prompt: "Replace the background with a lively urban garden scene during winter."

Local Remove

Edited

Original

Prompt: "Remove the small bird."

Local Add

Edited

Original

Prompt: "Add an floor lamp near the corner of the room in front of the desk behind the man."

Local Replace

Edited

Original

Prompt: "Transform the entire bread surface into a miniature lush forest with green trees, moss, and tiny flowers shaped exactly like the loaf."

Reference Image Guided Editing

When a reference image is provided alongside the prompt, Kiwi-Edit transfers visual attributes from the reference into the edited video. This enables precise control over costume, texture, and background style without relying solely on text description.

Subject Edit

Flat lay reference photo of a matte black techwear tactical jacket — Reference image

Edited

Original

Prompt: "Replace the girl's cloth with a matte black techwear tactical jacket."

Background Replace With Reference

Reference image of a fairytale storybook style forest background — Reference image

Edited

Original

Prompt: "Replace the background with a whimsical fairytale landscape, featuring colorful mushroom houses and a floating castle in a magical forest."

How It Compares

Kiwi-Edit is not the first instruction guided video editor. Ditto, the dataset project behind Ditto and Editto, generated 1 million synthetic video editing pairs to train its editing model, contributing to a broader push toward data driven video editing. VideoCoF took a different approach with chain of frames reasoning to achieve mask free precision on just 50,000 training pairs. Kiwi-Edit distinguishes itself by combining both modalities. instruction text and reference images. within a single architecture, and by releasing a dedicated benchmark and dataset for reference guided video editing.

The Wan2.2 base model also appears in other recent tools covered on this site, including Wan2.2-Animate, which uses the same foundation for character animation and replacement. First Frame Go explored how first-frame conditioning in Wan-type models enables strong customization with minimal examples.

For automated scene segmentation before applying edits, Meta SAM 3 provides text-driven object tracking that can feed directly into editing pipelines.

Access and Resources

Code: github.com/showlab/Kiwi-Edit (MIT license)
Models: Hugging Face collection
Dataset: kiwi_edit_training_data
Paper: arXiv:2603.02175

ByteDance's Bernini, released June 2026, takes a different approach: an MLLM plans the edit semantically before a DiT based renderer executes it, enabling reference guided editing without text descriptions.

For real time streaming edits without a full clip submission, LiveEdit processes video at 12.66 frames per second under Apache 2.0, applying edits chunk by chunk rather than waiting for the full pass to complete.

For larger instruction guided editing models, SAMA from Baidu and Tsinghua offers a 14B model covering the same four editing tasks with a different architecture built around separating semantic planning from motion modeling.

Generate and edit videos directly in AI FILMS Studio.

Sources

ShowLab, National University of Singapore | arXiv | Hugging Face

Continue Reading

Jul 17, 2026

Andy Serkis Says AI Cannot Replicate an 'Authored Performance' as Hunt for Gollum Begins Filming

Andy Serkis says AI cannot yet replicate an authored performance as The Hunt for Gollum begins filming, and argues that motion capture acting is long overdue for Oscar recognition.

Jul 17, 2026

MolmoMotion: Ai2 Releases Open Source Model That Forecasts 3D Object Motion From Language

Allen Institute for AI releases MolmoMotion, an open source model that predicts 3D object trajectories from video and language instructions, with a dataset of 1.16 million annotated clips.

Jul 17, 2026

Venice Immersive 2026: Margot Robbie, Andy Serkis, Daisy Ridley Lead AI and XR Lineup

Venice Immersive marks its 10th anniversary with 68 projects featuring Margot Robbie, Andy Serkis, Daisy Ridley and Mark Ruffalo in AI and XR immersive works.

View all Posts

Image & Edit

Speech & Voice

Music & Sound Effects

Kiwi-Edit: Open Source Video Editing Framework

Kiwi-Edit: Open Source Video Editing Framework

What Kiwi-Edit Does

Architecture

Benchmark Results

Open Source and Commercial Use

Video Examples

Global Changes

Global Style

Background Replace

Local Remove

Local Add

Local Replace

Reference Image Guided Editing

Subject Edit

Background Replace With Reference

How It Compares

Access and Resources

Sources

Continue Reading

Andy Serkis Says AI Cannot Replicate an 'Authored Performance' as Hunt for Gollum Begins Filming

MolmoMotion: Ai2 Releases Open Source Model That Forecasts 3D Object Motion From Language

Venice Immersive 2026: Margot Robbie, Andy Serkis, Daisy Ridley Lead AI and XR Lineup

Video & LipSync

Image & Edit

Speech & Voice

Music & Sound Effects