Kandinsky 5.0: Open-Source Video Generation Up to 10 Seconds

Share this post:
Kandinsky 5.0: Open-Source Video Generation Up to 10 Seconds
Kandinsky Lab released Kandinsky 5.0 on November 18, 2025, a family of open source models for image and video generation. The release includes three model lineups: Video Lite (2B parameters) for fast 10 second video generation, Video Pro (19B parameters) for maximum quality, and Image Lite (6B parameters) for high resolution images.
All models use Apache 2.0 licensing, enabling commercial use without restrictions. The Video Lite model ranks #1 among open-source models in its parameter class, outperforming larger alternatives while maintaining faster generation speeds.
Kandinsky 5.0 Video Lite: 2B Parameters
The Video Lite models generate up to 10 seconds of video at 768×512 or 512×512 resolution at 24 fps. The 2B parameter architecture provides fast generation while maintaining quality competitive with significantly larger models.
Kandinsky 5.0 Video Lite generation example
The system uses a Diffusion Transformer (DiT) backbone with cross attention to text embeddings from Qwen2.5-VL and CLIP. HunyuanVideo 3D VAE handles encoding and decoding between pixel space and latent space. Flow Matching replaces traditional diffusion approaches for improved training stability and generation quality.
Complex motion and camera movement generation
Eight model variants offer different optimization trade-offs: SFT models provide highest generation quality, pretrain models enable fine-tuning, distilled models reduce inference steps from 50 to 16, and no-CFG models eliminate classifier free guidance overhead.
Video Pro: 19B Parameters for Maximum Quality
The Video Pro models use 19 billion parameters targeted at applications requiring maximum generation quality. These models generate HD video at 1280×768 resolution with 24 fps, delivering richer motion dynamics and precise camera control.
High-quality generation from Video Pro models
Video Pro handles complex prompts in both English and Russian with strong understanding of cultural concepts. The models underwent supervised finetuning using data manually selected by expert artists to enhance aesthetic quality.
NABLA: Efficient Long Video Generation
Kandinsky 5.0's 10 second generation capability uses NABLA (Neighborhood Adaptive Block-Level Attention), a sparse attention algorithm that reduces computational requirements for longer sequences.
Extended temporal coherence across 10-second generations
Traditional attention mechanisms scale quadratically with sequence length, making 10 second video generation computationally prohibitive. NABLA applies block level attention that focuses computation on relevant spatial temporal neighborhoods, dramatically reducing memory and processing requirements while maintaining quality.
Multi Stage Training Pipeline
Kandinsky 5.0's quality results from a comprehensive training approach spanning multiple stages with different data and objectives.
Pretraining uses large datasets covering diverse visual domains and concepts. The Kandinsky T2I and Kandinsky T2V datasets provide broad coverage enabling the model to understand wide ranging prompts.
Supervised finetuning (SFT) uses carefully curated data selected by expert artists. This stage significantly boosts visual quality and aesthetic appeal compared to pretrain only models.
Reinforcement learning post-training further refines outputs based on human preference modeling. The system learns to generate content that better aligns with user expectations and artistic standards.
Sophisticated material rendering and lighting from multi-stage training
Configuration free guidance (CFG) distillation and diffusion distillation enable faster inference. The distilled models reduce generation steps from 50 to 16 while maintaining comparable quality, enabling near real time generation for applications requiring quick iteration.
Image Lite: 6B Parameter Image Generation
The Image Lite models generate high resolution images at 1280×768 or 1024×1024 from text prompts. The 6B parameter architecture balances quality with efficiency for both text-to-image and image-to-image tasks.
The models handle precise details and accurate text rendering—a challenge for many generation systems. Support for both English and Russian prompts with strong cultural understanding makes the system accessible to broader audiences.
Image-to-image capabilities enable style transfer, image variation, blending, and editing workflows. Users provide source images and modification instructions, and the system generates results maintaining core composition while applying requested changes.
Performance and Speed
Video Lite generates 5 second videos in approximately 30 seconds on H100 GPUs after compilation. The 10 second models, using NABLA attention, maintain reasonable generation times despite doubled output length.
Distilled 16 step models reduce generation time significantly compared to the 50 step base models. For production workflows requiring rapid iteration, distilled models provide practical tradeoffs between quality and speed.
The models run on consumer GPUs with appropriate VRAM, though generation speeds scale with hardware capabilities. The open source release enables deployment flexibility from cloud infrastructure to local workstations.
Applications for AI Filmmakers
Kandinsky 5.0's capabilities enable several production workflows:
Concept Visualization: Generate 10 second sequences showing scene concepts, character actions, or camera movements for previsualization and creative exploration.
Stock Footage Generation: Create custom stock clips matching specific requirements impossible to source through traditional libraries.
Iteration Speed: The 2B parameter Video Lite models enable rapid generation cycles for testing different approaches and refining creative direction.
Cultural Content: Strong Russian language and concept understanding enables creators working with diverse cultural content or international audiences.
Commercial Use: Apache 2.0 licensing permits commercial applications without usage restrictions or royalty payments.
Open Source and Community
Kandinsky Lab released full code, model weights, and training checkpoints through GitHub and Hugging Face. The Apache 2.0 license enables commercial use, modification, and distribution without restrictions.
Integration with Diffusers library provides standardized APIs compatible with existing workflows. Community contributions already include ComfyUI nodes and optimization implementations.
The research team published comprehensive technical documentation covering architecture details, training procedures, and implementation notes. This transparency enables researchers to build on Kandinsky's foundations and adapt the models for specialized applications.
Limitations and Considerations
Video Lite models target SD resolution (768×512, 512×512). For HD output, Video Pro models provide 1280×768 generation at higher computational cost.
Generation quality improves with SFT models over pretrain-only versions. Production use should prefer SFT or distilled-SFT checkpoints for best results.
The 10 second generation limit constrains certain applications. Longer sequences require generating and concatenating multiple clips, which may introduce consistency challenges at boundaries.
Text rendering, while improved, remains imperfect for complex typography or extensive text content. Applications requiring pixel perfect text should combine generation with traditional compositing.
Implementation
Installation requires Python 3.11+, PyTorch, and appropriate CUDA drivers. The GitHub repository provides setup instructions and dependencies.
Multiple attention backends (FlashAttention 2/3, SDPA, SAGE) offer different performance characteristics. The system automatically selects optimal backends based on available hardware.
Generation parameters include guidance scale, negative prompts, resolution, frame count, and inference steps. Experimentation with these parameters enables tuning for specific quality or speed requirements.
You can explore AI video generation models at AI FILMS Studio.
Resources:
- Project Page: https://kandinskylab.ai/
- GitHub Repository: https://github.com/kandinskylab/kandinsky-5
- Research Paper: https://arxiv.org/abs/2511.14993
- Hugging Face: https://huggingface.co/papers/2511.14993


