SAMA: Instruction Guided Video Editing with 14B Model

March 20, 2026

Share this post:

SAMA: Instruction Guided Video Editing with 14B Model

Researchers at Baidu, Tsinghua University, City University of Hong Kong, and Zhejiang University released SAMA on March 19, 2026. The 14B parameter model edits video from plain text instructions, covering four tasks: object replacement, object addition, object removal, and style transfer. Model weights and code are publicly available.

The Problem: Semantics vs. Motion

Instruction based video editing forces a direct conflict. An edit must change the visual content of a scene while preserving how everything in that scene moves. Models that optimize for semantic accuracy tend to distort motion. Models built to preserve motion often fail to apply edits consistently across frames.

SAMA separates the two. The architecture factors video editing into independent components for semantic planning and motion modeling, solving each without forcing a compromise between them.

Two Stage Training

The first stage runs without paired editing data at all. Three motion centric pretext tasks teach the model temporal dynamics directly from raw video:

Cube inpainting: reconstructing masked video regions
Speed perturbation: predicting frames at altered temporal rates
Tube shuffle: restoring randomly reordered temporal segments

This gives the model an internal model of motion before it ever sees a labeled edit pair. The second stage then runs supervised fine tuning on SAMA-edit-filtered-1M, a curated dataset of one million video editing examples.

Semantic Anchoring

Rather than editing every frame in sequence, SAMA identifies sparse anchor frames and establishes semantic tokens and video latents at those keypoints. These anchors define the structure of the intended edit. Surrounding frames fill in based on the motion representations learned in stage one.

This sidesteps a common failure mode in video editing models: instruction following that degrades mid clip, producing flickering or inconsistent object appearance across the sequence.

Replacement

The pair below shows the original clip on the left and the edited result on the right.

Source

Replace the blackswan with a white cat.

SAMA replaces the subject while keeping the water movement, background, and camera motion intact. The swan's motion is carried through the cat, rather than replaced with a static insertion.

Object Addition

Source

Add a white cartoon cat in the right of the boy.

Object addition inserts new elements at a specified spatial position. The added cat tracks the scene's motion and perspective without a separate mask or reference image as input.

Style Transfer

Source

Turn into watercolor style.

Style transfers apply aesthetic transformations globally across the clip. Motion and scene structure carry through the watercolor render, rather than appearing as a flat overlay.

Appearance Editing

Source

Change all person's clothes color into black.

The model applies localized appearance changes to multiple subjects simultaneously. Clothing across all people in the frame converts to black while body motion and background remain unchanged.

Object Removal

Source

Remove the football.

Object removal fills the area previously occupied by the target with plausible background content. The pitch surface, players, and motion continue without visible seams where the ball was. For a model that also removes an object's cast shadows and reflections alongside it, see EffectErase.

Accessory Addition

Source

Add a brown hat on the man's head.

Fine grained additions like accessories on a moving subject require the model to track head position across frames and maintain consistent hat placement as the subject moves. No mask or manual tracking is required as input.

Text Replacement in Video

Source

Replace the subtitles at the bottom with "Her fingers trace fragile film - soft light, a life unspooling." in white text without border style.

SAMA can target and replace text overlays in video. The instruction specifies both the replacement string and visual formatting. The model removes the original subtitle burn and renders the new text in its place across the clip's duration.

Editors working on more localized AI assisted workflows can also explore Kiwi-Edit, another open source framework that combines text instructions with reference image guidance for video editing.

Benchmark Results

SAMA achieves state of the art results among open source instruction guided video editing models on VIE-Bench, OpenVE-Bench, and ReCo-Bench. The paper reports competitive performance against commercial systems including Kling-Omni.

The authors attribute a portion of the benchmark gains to the zero shot editing capability that emerges from stage one pretraining alone. The model performs reasonable edits before any supervised fine tuning, which the authors argue demonstrates the quality of the learned motion representations.

For comparison, Ditto and Editto achieved similar SOTA claims on instruction based video editing using a synthetic dataset of one million training pairs generated through a different pipeline.

Access

The 14B model is available on Hugging Face at syxbb/SAMA-14B. Code is open source on GitHub. The training dataset SAMA-edit-filtered-1M is listed as under review for release. The paper is available on arXiv under a CC BY-SA 4.0 license. Try text-to-video generation and image-to-video with AI FILMS Studio.

AI FILMS Studio video generation workspace

Try AI FILMS Studio

Generate text-to-video and image-to-video with the latest AI models in the video workspace.

Nodes Graph Editor

Build custom AI workflows by connecting models visually in the Nodes Graph Editor.

Sources

arXiv: SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing GitHub: Cynthiazxy123/SAMA Hugging Face: syxbb/SAMA-14B Project Page: cynthiazxy123.github.io/SAMA

Continue Reading

Jun 22, 2026

Google Invests $75 Million in A24 to Develop AI Filmmaking Tools

Google invested $75 million in A24 to develop AI filmmaking tools with DeepMind, the first time Google has taken a direct equity stake in a Hollywood studio.

Jun 22, 2026

Holo-World: Unified Camera, Object, and Weather Control for Video Generation

Holo-World is an open source video generation model from Alibaba Group that controls camera trajectory, object motion, and weather conditions simultaneously.

Jun 22, 2026

MiniMax Launches Hub at SIFF as AI Reshapes Chinese Film Production

MiniMax launched Hub at SIFF 2026, a multimodal AI platform built around human creative oversight, as Chinese film leaders named compute power as the core bottleneck.

View all Posts

Speech & Voice

Music & Sound Effects

SAMA: Instruction Guided Video Editing with 14B Model

SAMA: Instruction Guided Video Editing with 14B Model

The Problem: Semantics vs. Motion

Two Stage Training

Semantic Anchoring

Replacement

Object Addition

Style Transfer

Appearance Editing

Object Removal

Accessory Addition

Text Replacement in Video

Benchmark Results

Access

Sources

Continue Reading

Google Invests $75 Million in A24 to Develop AI Filmmaking Tools

Holo-World: Unified Camera, Object, and Weather Control for Video Generation

MiniMax Launches Hub at SIFF as AI Reshapes Chinese Film Production

Video & LipSync

Image & Edit

Speech & Voice

Music & Sound Effects