Lance: ByteDance's Unified Video and Image Generation Model (Apache 2.0)
Share this post:
Lance: ByteDance's Unified Video and Image Generation Model (Apache 2.0)
ByteDance Research released Lance on May 18, 2026, a 3 billion parameter open source model that handles text-to-video, text-to-image, video editing, image editing, and multimodal understanding within a single unified architecture. The license is Apache 2.0, confirming commercial use.
One Model, Eight Tasks
Most AI production pipelines stack separate specialist models for each task: one for text-to-video, another for image generation, a third for editing. Lance handles eight distinct tasks in a single architecture: text-to-video generation, video editing, sequential video editing across multiple turns, structured video planning, video understanding including visual question answering and captioning, text-to-image generation, instruction based image editing, and image understanding.
For a filmmaker or solo creator assembling a production workflow, that means one model covers the full pipeline from concept images to edited video clips, under a single license.
Benchmark Results
On VBench, the standard benchmark for video generation quality, Lance scores 85.11 overall, the highest among unified models in its published comparison. Subject consistency reaches 94.52, background consistency 94.28, and temporal flicker 99.66. Its semantic score of 84.96 indicates strong alignment between text prompts and generated output.
On MVBench, which tests video understanding rather than generation, Lance scores 62.0, again the highest among unified models in the comparison.
Text-to-Video Examples
Text-to-video generation
Text-to-video generation
Video Editing
Lance supports video editing guided by text instructions, covering background transformation, object manipulation, subject replacement, and style transfer. The sequential editing capability allows multiple modifications across linked edits: changing subject, appearance, background, and motion in sequence without regenerating from scratch.
Instruction guided video editing
How It Works
Lance uses a dual stream Mixture of Experts design that separates semantic understanding from visual generation while processing shared multimodal sequences. Positional encoding is handled by Modality Aware Rotary Positional Encoding (MaPE), which reduces interference between the different types of visual tokens the model processes simultaneously.
The model was trained from scratch using no more than 128 A100 GPUs. A staged multitask training approach with capability oriented objectives and adaptive data scheduling drives the separation between semantic comprehension and visual generation across all eight tasks.
What It Means for Filmmakers
The practical case for a unified model is workflow compression. A production that needs AI generated backgrounds, character motion clips, and edited footage currently routes work through multiple separate tools and interfaces. Lance consolidates those steps into a single model under a single Apache 2.0 license, removing per-task licensing complexity for commercial productions.
The video understanding capability covers visual question answering and captioning on footage, adding a function specialist generation models typically cannot provide: automated analysis of existing video, useful for continuity checking, scene description, and archival tagging.
Lance joins a growing set of open source video generation tools available to filmmakers. MOVA addresses audio synchronized video generation with a different architecture approach, and LTX 2.3 targets high-resolution latent diffusion video output. Lance's differentiation is the unified architecture across generation, editing, and understanding in one model. Filmmakers can run AI generated video workflows through AI FILMS Studio's video workspace.
Sources
arXiv: Lance: Unified Multimodal Modeling by Multi-Task Synergy
GitHub: bytedance/Lance
Hugging Face: bytedance-research/Lance
Project Page: lance-project.github.io
Continue Reading
Video & LipSync
- Video Generator
- Text to Video
- Image to Video
- Start-End Frame to Video
- Draw to Video
- Motion Control
- Video Enhancer
- Video Upscaler
- Video to Video LipSync
- Audio to Video LipSync
- Image to Video LipSync
- Video FaceSwap
- Seedance 2
- OpenAI Sora 2
- Kling 3.0
- Kling O1
- Google Veo 3.1
- LTX 2.3
- Kling O1
- Hailuo AI
- Luma Ray
- Kling 3.0 Motion
- Topaz Upscaler
- InfiniteTalk Face Swap

