SAMA: Instruction Guided Video Editing with 14B Model

Share this post:
SAMA: Instruction Guided Video Editing with 14B Model
Researchers at Baidu, Tsinghua University, City University of Hong Kong, and Zhejiang University released SAMA on March 19, 2026. The 14B parameter model edits video from plain text instructions, covering four tasks: object replacement, object addition, object removal, and style transfer. Model weights and code are publicly available.
The Problem: Semantics vs. Motion
Instruction based video editing forces a direct conflict. An edit must change the visual content of a scene while preserving how everything in that scene moves. Models that optimize for semantic accuracy tend to distort motion. Models built to preserve motion often fail to apply edits consistently across frames.
SAMA separates the two. The architecture factors video editing into independent components for semantic planning and motion modeling, solving each without forcing a compromise between them.
Two Stage Training
The first stage runs without paired editing data at all. Three motion centric pretext tasks teach the model temporal dynamics directly from raw video:
- Cube inpainting: reconstructing masked video regions
- Speed perturbation: predicting frames at altered temporal rates
- Tube shuffle: restoring randomly reordered temporal segments
This gives the model an internal model of motion before it ever sees a labeled edit pair. The second stage then runs supervised fine tuning on SAMA-edit-filtered-1M, a curated dataset of one million video editing examples.
Semantic Anchoring
Rather than editing every frame in sequence, SAMA identifies sparse anchor frames and establishes semantic tokens and video latents at those keypoints. These anchors define the structure of the intended edit. Surrounding frames fill in based on the motion representations learned in stage one.
This sidesteps a common failure mode in video editing models: instruction following that degrades mid clip, producing flickering or inconsistent object appearance across the sequence.
Replacement
The pair below shows the original clip on the left and the edited result on the right.
Source
Replace the blackswan with a white cat.
SAMA replaces the subject while keeping the water movement, background, and camera motion intact. The swan's motion is carried through the cat, rather than replaced with a static insertion.
Object Addition
Source
Add a white cartoon cat in the right of the boy.
Object addition inserts new elements at a specified spatial position. The added cat tracks the scene's motion and perspective without a separate mask or reference image as input.
Style Transfer
Source
Turn into watercolor style.
Style transfers apply aesthetic transformations globally across the clip. Motion and scene structure carry through the watercolor render, rather than appearing as a flat overlay.
Appearance Editing
Source
Change all person's clothes color into black.
The model applies localized appearance changes to multiple subjects simultaneously. Clothing across all people in the frame converts to black while body motion and background remain unchanged.
Object Removal
Source
Remove the football.
Object removal fills the area previously occupied by the target with plausible background content. The pitch surface, players, and motion continue without visible seams where the ball was. For a model that also removes an object's cast shadows and reflections alongside it, see EffectErase.
Accessory Addition
Source
Add a brown hat on the man's head.
Fine grained additions like accessories on a moving subject require the model to track head position across frames and maintain consistent hat placement as the subject moves. No mask or manual tracking is required as input.
Text Replacement in Video
Source
Replace the subtitles at the bottom with "Her fingers trace fragile film - soft light, a life unspooling." in white text without border style.
SAMA can target and replace text overlays in video. The instruction specifies both the replacement string and visual formatting. The model removes the original subtitle burn and renders the new text in its place across the clip's duration.
Editors working on more localized AI assisted workflows can also explore Kiwi-Edit, another open source framework that combines text instructions with reference image guidance for video editing.
Benchmark Results
SAMA achieves state of the art results among open source instruction guided video editing models on VIE-Bench, OpenVE-Bench, and ReCo-Bench. The paper reports competitive performance against commercial systems including Kling-Omni.
The authors attribute a portion of the benchmark gains to the zero shot editing capability that emerges from stage one pretraining alone. The model performs reasonable edits before any supervised fine tuning, which the authors argue demonstrates the quality of the learned motion representations.
For comparison, Ditto and Editto achieved similar SOTA claims on instruction based video editing using a synthetic dataset of one million training pairs generated through a different pipeline.
Access
The 14B model is available on Hugging Face at syxbb/SAMA-14B. Code is open source on GitHub. The training dataset SAMA-edit-filtered-1M is listed as under review for release. The paper is available on arXiv under a CC BY-SA 4.0 license. Try text-to-video generation and image-to-video with AI FILMS Studio.
Sources
arXiv: SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing GitHub: Cynthiazxy123/SAMA Hugging Face: syxbb/SAMA-14B Project Page: cynthiazxy123.github.io/SAMA
