Aurora: Unified Video Editing Using a VLM Agent

May 30, 2026

Share this post:

Aurora: Unified Video Editing Using a VLM Agent

Aurora is an open source framework from the University of Rochester, MIT-IBM Watson AI Lab, and NVIDIA that applies a vision language model agent to video editing. Instead of requiring structured model inputs, Aurora accepts plain language edit requests and converts them into precise editing plans before passing them to a diffusion transformer.

The paper was submitted to arXiv in May 2026 by Yongsheng Yu, Ziyun Zeng, Zhiyuan Xiao, Zhenghong Zhou, Hang Hua, Wei Xiong, and Jiebo Luo.

Aurora pipeline: the VLM agent translates a plain language request into a structured edit plan before the diffusion transformer executes it

The Problem with Existing Video Editing Models

Current video editing transformers require precise, structured inputs: reference images, spatial grounding coordinates, and carefully formatted text that matches the model's conditioning channels. That precision requirement creates a gap between what a filmmaker wants to say and what the model can accept.

If you describe an edit in natural language, such as "remove the person on the left and replace the jacket with a red coat", most editing models cannot process that without additional human work to produce structured inputs first. Aurora adds an agent layer that does that work automatically.

How the Agent Works

The VLM agent sits between the user's instruction and the diffusion transformer. When it receives a request, it uses tool calls to identify what reference images are needed, determines where spatial grounding is required, and produces a structured edit plan that aligns with the transformer's conditioning channels.

The agent was trained on supervised data for complete edit planning and on preference pairs for reliable tool use, similar to the RLHF approach that made large language models responsive to human intent rather than just competent at prediction. Flash-GRPO applied the same reinforcement learning logic to video diffusion training, tackling the alignment challenge from the training side where Aurora tackles it from the interface side.

Object replacement: Aurora receives a plain language request and identifies the correct reference image for the substitution

Object addition: the agent grounds the spatial placement from context in the original video

The agent handles underspecification, requests that are vague or incomplete, by inferring the missing structure from the visual context of the input video. This is different from instruction only editing models, which fail when the user's request does not contain all the information the model needs.

Referential reasoning: Aurora identifies the correct visual referent from a description without explicit bounding box input

AgentEdit-Bench

Aurora introduces a new benchmark, AgentEdit-Bench, designed specifically to test video editing systems under textual and visual underspecification. Existing benchmarks assume well formed inputs; AgentEdit-Bench tests performance when the user's instruction is incomplete.

Aurora outperforms instruction only baselines on all three evaluated benchmarks. The VLM agent transfers to compatible frozen video editing models, meaning the agent frontend is not locked to a single backbone.

The Modular Architecture

The key architectural decision in Aurora is the separation between the edit planner and the executor. The VLM agent produces the edit plan; the diffusion transformer executes it. These are two distinct components.

This separation means the Aurora agent can be wired to future, better video editing transformers without retraining the agent. As video diffusion models improve, the same agent interface remains valid: you upgrade the executor, not the planner. That modularity is not covered in mainstream press reporting on the paper, and it is what makes Aurora a framework rather than just a model.

For Iterative Video Editing

The use case for filmmakers is iterative editing: generating a clip, then making targeted changes to it without restarting from scratch. Describe what needs to change and Aurora handles the translation into model inputs.

The broader shift toward AI filmmaking as a production tool in 2026 has increased demand for this kind of iteration. Generating raw footage is increasingly accessible; editing that footage with precision and without re-prompting from zero is the next gap Aurora targets.

Edit and generate AI video in the AI FILMS Studio workspace.

AI FILMS Studio video generation workspace

Try AI FILMS Studio

Generate text-to-video and image-to-video with the latest AI models in the video workspace.

Nodes Graph Editor

Build custom AI workflows by connecting models visually in the Nodes Graph Editor.

Sources

arXiv: Aurora: Unified Video Editing with a Tool-Using Agent Project Page: yongshengyu.com/Aurora-Page

Continue Reading

Jul 14, 2026

Odysseus: The Fall vs. The Odyssey: Same Story, Same Week, $250M Apart

Ash Koosha's fully AI generated feature Odysseus: The Fall lands the same week as Nolan's $250M Odyssey. Same source material, different scale.

Jul 14, 2026

George Lucas: AI Makes Filmmaking 'Much Easier,' and He Has 50 Years of Evidence

George Lucas says AI makes filmmaking 'much easier' and calls it inevitable, from the filmmaker who built ILM and pioneered digital cinema.

Jul 14, 2026

MOSS-Transcribe-Diarize 0.9B: Joint Speech Transcription and Speaker Identification for Filmmakers

MOSS-Transcribe-Diarize 0.9B runs joint transcription and speaker identification in one pass. Apache 2.0, 50+ languages, won INTERSPEECH 2026.

View all Posts

Image & Edit

Speech & Voice

Music & Sound Effects

Aurora: Unified Video Editing Using a VLM Agent

Aurora: Unified Video Editing Using a VLM Agent

The Problem with Existing Video Editing Models

How the Agent Works

AgentEdit-Bench

The Modular Architecture

For Iterative Video Editing

Sources

Continue Reading

Odysseus: The Fall vs. The Odyssey: Same Story, Same Week, $250M Apart

George Lucas: AI Makes Filmmaking 'Much Easier,' and He Has 50 Years of Evidence

MOSS-Transcribe-Diarize 0.9B: Joint Speech Transcription and Speaker Identification for Filmmakers

Video & LipSync

Image & Edit

Speech & Voice

Music & Sound Effects