Lance: ByteDance's Unified Video and Image Generation Model (Apache 2.0)

May 21, 2026

Share this post:

Lance: ByteDance's Unified Video and Image Generation Model (Apache 2.0)

ByteDance Research released Lance on May 18, 2026, a 3 billion parameter open source model that handles text-to-video, text-to-image, video editing, image editing, and multimodal understanding within a single unified architecture. The license is Apache 2.0, confirming commercial use.

One Model, Eight Tasks

Most AI production pipelines stack separate specialist models for each task: one for text-to-video, another for image generation, a third for editing. Lance handles eight distinct tasks in a single architecture: text-to-video generation, video editing, sequential video editing across multiple turns, structured video planning, video understanding including visual question answering and captioning, text-to-image generation, instruction based image editing, and image understanding.

For a filmmaker or solo creator assembling a production workflow, that means one model covers the full pipeline from concept images to edited video clips, under a single license.

Benchmark Results

On VBench, the standard benchmark for video generation quality, Lance scores 85.11 overall, the highest among unified models in its published comparison. Subject consistency reaches 94.52, background consistency 94.28, and temporal flicker 99.66. Its semantic score of 84.96 indicates strong alignment between text prompts and generated output.

On MVBench, which tests video understanding rather than generation, Lance scores 62.0, again the highest among unified models in the comparison.

Text-to-Video Examples

Text-to-video generation

Video Editing

Lance supports video editing guided by text instructions, covering background transformation, object manipulation, subject replacement, and style transfer. The sequential editing capability allows multiple modifications across linked edits: changing subject, appearance, background, and motion in sequence without regenerating from scratch.

Instruction guided video editing

How It Works

Lance uses a dual stream Mixture of Experts design that separates semantic understanding from visual generation while processing shared multimodal sequences. Positional encoding is handled by Modality Aware Rotary Positional Encoding (MaPE), which reduces interference between the different types of visual tokens the model processes simultaneously.

The model was trained from scratch using no more than 128 A100 GPUs. A staged multitask training approach with capability oriented objectives and adaptive data scheduling drives the separation between semantic comprehension and visual generation across all eight tasks.

What It Means for Filmmakers

The practical case for a unified model is workflow compression. A production that needs AI generated backgrounds, character motion clips, and edited footage currently routes work through multiple separate tools and interfaces. Lance consolidates those steps into a single model under a single Apache 2.0 license, removing per-task licensing complexity for commercial productions.

The video understanding capability covers visual question answering and captioning on footage, adding a function specialist generation models typically cannot provide: automated analysis of existing video, useful for continuity checking, scene description, and archival tagging.

ByteDance Research followed Lance with Bernini, a separate 14B model focused on reference guided video editing using a two stage semantic planning architecture, released June 1, 2026.

Lance joins a growing set of open source video generation tools available to filmmakers. MOVA addresses audio synchronized video generation with a different architecture approach, and LTX 2.3 targets high-resolution latent diffusion video output. Lance's differentiation is the unified architecture across generation, editing, and understanding in one model. Filmmakers can run AI generated video workflows through AI FILMS Studio's video workspace.

AI FILMS Studio video generation workspace

Try AI FILMS Studio

Generate text-to-video and image-to-video with the latest AI models in the video workspace.

Nodes Graph Editor

Build custom AI workflows by connecting models visually in the Nodes Graph Editor.

Sources

arXiv: Lance: Unified Multimodal Modeling by Multi-Task Synergy
GitHub: bytedance/Lance
Hugging Face: bytedance-research/Lance
Project Page: lance-project.github.io

Continue Reading

Jul 13, 2026

Luma Ray 3.2 Tutorial: Text to Video and Image to Video

Step by step guide to Luma Ray 3.2 on AI FILMS Studio. Generate text-to-video and image-to-video with cinematic AI video generation in the workspace.

Jul 11, 2026

ARDY: NVIDIA Open Real Time Text to Motion Model for Digital Humans and Robots

NVIDIA's ARDY generates 3D human and humanoid motion from text in real time with kinematic constraints, accepted to SIGGRAPH 2026, code under Apache 2.0.