BlockVid: Semi Autoregressive Framework Generates Minute Long Videos With Coherence

Share this post:
BlockVid: Semi Autoregressive Framework Generates Minute Long Videos With Coherence
Alibaba DAMO Academy released BlockVid, a semiautoregressive block diffusion framework addressing coherence challenges in minute long video generation. The system combines semantic sparse KV caching, block forcing, and noise scheduling to maintain consistency across extended sequences.
Block Diffusion Architecture
BlockVid generates video in chunks rather than frames. Each chunk contains multiple frames processed together through diffusion. The system conditions new chunks on previous ones while using retrieved semantic context from earlier sections.
Extended sequence maintaining visual consistency | BlockVid
Long form narrative with coherent scene transitions | BlockVid
Architecture components:
- Semi autoregressive chunk generation
- Semantic sparse KV cache for context
- Dynamic retrieval of relevant prior chunks
- Block level diffusion denoising
- 3D causal VAE compression
The semi autoregressive approach bridges diffusion and autoregressive methods. Diffusion operates within each block. Autoregressive conditioning connects blocks sequentially. This combination produces more coherent sequences than pure diffusion while maintaining quality.
Semantic Sparse KV Caching
BlockVid maintains compact memory of salient information from previous chunks. Rather than storing complete video history, the system extracts semantically important key/value pairs.
When generating chunk c+1, the framework retrieves top-k semantically similar chunks based on prompt embedding similarity. This global context supplements local conditioning from immediate prior chunk.
Character consistency across minute long sequence | BlockVid
Environment continuity in extended generation | BlockVid
KV cache mechanism:
- Extracts salient tokens from each generated chunk
- Stores semantic representations rather than raw frames
- Updates dynamically as generation progresses
- Retrieves relevant context via embedding similarity
- Manages memory efficiently for long sequences
This approach prevents information dilution common in naive long video generation. Early chunks remain accessible through semantic retrieval rather than requiring complete history retention.
Block Forcing Training Strategy
Block forcing stabilizes long video generation by combining velocity forcing and self forcing objectives.
Velocity forcing aligns predicted dynamics with semantic history to prevent drift. The model learns consistency between motion patterns and established narrative context.
Self forcing closes the training inference gap by exposing the model to its own rollouts during training. Rather than training exclusively on ground truth sequences, the system experiences its own generation artifacts and learns to maintain realism.
The joint integration prevents common failure modes in autoregressive generation: semantic drift where narrative loses coherence, and quality degradation where visual fidelity decreases over time.
Noise Scheduling Strategy
BlockVid implements noise scheduling operating during both training and inference.
Progressive noise scheduling (training): Gradually increases noise levels across chunks during training. Early chunks train with lower noise. Later chunks experience higher noise. This prepares the model for accumulating uncertainty in long sequences.
Noise shuffling (inference): Introduces local randomness at chunk boundaries to smooth transitions. Rather than deterministic noise patterns creating visible seams, shuffling maintains natural flow between sections.
Smooth temporal coherence across extended duration | BlockVid
The combined strategy addresses chunk boundary artifacts while maintaining per-chunk quality. Transitions between sections remain imperceptible despite block based generation.
LV-Bench: Minute Long Video Benchmark
The research contributes LV-Bench, a curated benchmark of 1,000 minute long videos evaluating long horizon generation.
Dataset curation:
- Sourced from DanceTrack, GOT-10k, HD-VILA-100M, ShareGPT4V
- Videos minimum 50 seconds duration
- GPT-4o generates fine-grained captions every 2-3 seconds
- Human in the loop validation at every stage
- 8:2 split for training and evaluation
Quality standards: Human annotators validate data sourcing, chunk splitting, and captioning. At least two annotators provide inter rater reliability at each stage. Visual checks confirm correct transitions and semantic accuracy.
Video Drift Error (VDE) Metrics
LV-Bench introduces five metrics measuring temporal consistency degradation.
VDE Clarity: Measures temporal drift in image sharpness. Creeping blur increases score; low values indicate stable clarity.
VDE Motion: Measures drift in motion smoothness. Low scores indicate consistent dynamics without jitter or freezing.
VDE Aesthetic: Measures drift in visual appeal. Low scores indicate sustained coherent aesthetics.
VDE Background: Measures background stability. Low scores indicate consistent setting without drift or flicker.
VDE Subject: Tracks identity drift. Low scores indicate subject remains consistently recognizable.
Long-range coherence demonstration across full minute | BlockVid
The metrics derive from Mean Absolute Percentage Error (MAPE) concepts, dividing long videos into segments evaluated on specific quality dimensions. This provides granular assessment of where and how quality degrades.
Inferix Integration
BlockVid research led to Inferix, a next generation inference engine for world simulation. The engine specifically optimizes semiautoregressive decoding processes.
Inferix capabilities:
- Advanced KV cache management for persistent world simulation
- Distributed world synthesis for large scale environment generation
- Video streaming with RTMP and WebRTC protocols
- Seamless model integration API
- Performance monitoring and profiling
- Continuous prompt support for dynamic narrative control
- 8 bit quantization support (INT8/FP8)
Inferix differs from high concurrency systems (vLLM, SGLang) and classic video diffusion (xDiTs). The engine focuses on immersive world synthesis rather than serving multiple users simultaneously.
Technical Specifications
Video compression: 3D causal VAE compresses spatiotemporal dimensions to [(1+T/4), H/8, W/8] while expanding channels to 16. First frame compresses spatially only to better handle image guidance.
Chunk structure: Each video chunk V_i ∈ R^((1+T) × H × W × 3) contains T frames plus initial guidance frame. Chunk level prompts y_i condition generation.
Latent processing: Block diffusion denoiser processes latent representation Z to produce denoised latent Z̃. Semantic sparse KV cache constructs during this procedure, preserving salient keys and values.
Memory efficiency: Semantic retrieval eliminates need for complete history retention. System accesses relevant prior context without storing all intermediate states.
Training Process
BlockVid trains on single shot long videos V = {V_1, V_2, V_3, ..., V_n} with corresponding chunk level prompts Y = {y_i}.
Stage 1: Learn block diffusion denoising on individual chunks.
Stage 2: Introduce block forcing to stabilize multi chunk sequences. Velocity forcing prevents semantic drift. Self forcing reduces training inference gap.
Stage 3: Apply progressive noise scheduling to prepare model for long sequence uncertainty accumulation.
The training strategy requires less data than naive approaches. Rather than massive minute long video datasets, the system learns from shorter sequences then extends capability through architectural design.
Use Cases and Applications
World modeling: Persistent simulation for gaming, embodied AI, agentic systems requiring extended coherent environments.
Long form content: Minute long narratives maintaining character and setting consistency throughout.
Interactive generation: Dynamic prompt control enables narrative branching and user guided story development.
Video streaming: Realtime generation with RTMP/WebRTC delivery for interactive applications.
Research applications: LV-Bench provides standardized evaluation for long video generation methods.
For filmmakers exploring AI video generation, AI FILMS Studio provides video generation tools to experiment with different models and workflows.
Licensing and Availability
Apache 2.0 License covers Inferix inference engine. Permissive opensource allowing:
- Commercial use
- Modification and distribution
- Private deployment
- Patent grant
Code availability:
- Inferix engine: GitHub (alibaba-damo-academy/Inferix)
- Model integration guide included
- Example configurations for multiple models
- Installation documentation
LV-Bench dataset:
- Available on Hugging Face (heyuanyu/LV-Bench)
- 1,000 minute long videos
- Fine grained annotations
- VDE metric computation scripts
Performance Characteristics
BlockVid achieves favorable balance between generation quality and temporal coherence for minute long sequences.
Advantages over standard diffusion:
- Maintains consistency across extended duration
- Prevents semantic drift through velocity forcing
- Reduces quality degradation via self forcing
- Manages memory efficiently through sparse KV caching
Comparison to naive autoregressive:
- Higher per frame quality through block diffusion
- Smoother transitions via noise scheduling
- Better long range coherence through semantic retrieval
Technical Limitations
Chunk boundary artifacts: Despite noise shuffling, extremely sensitive viewers may detect subtle transitions between blocks under careful examination.
Memory requirements: Semantic KV cache requires substantial memory for very long sequences (multi minute generations).
Computational cost: Block diffusion per chunk more expensive than single frame generation. Tradeoff between quality and speed.
Training complexity: Block forcing and progressive noise scheduling increase training difficulty compared to standard diffusion.
Research Team
Joint collaboration between:
- Zhejiang University
- Hong Kong University of Science and Technology
- Alibaba DAMO Academy
- Alibaba TRE
Key contributors: Tianyu Feng, Yizeng Han, Jiahao He, Yuanyu He, Xi Lin, Teng Liu, Hanfeng Lu, Jiasheng Tang, Wei Wang, Zhiyuan Wang, Jichao Wu, Mingyang Yang, Yinghao Yu, Zeyu Zhang, Bohan Zhuang
Future Directions
Planned features:
- Complex KV management enhancements
- Support for finetuning pretrained models to semi-AR
- Model distillation into fewer steps
- High concurrency deployment
- Advanced realtime streaming
- Interactive world models
- Enhanced simulation capabilities
- Persistent world state management
The roadmap emphasizes world simulation over simple video generation. Goal: enable persistent, interactive environments for gaming and embodied AI.
Implementation Guide
Installation: Detailed instructions in Inferix repository. Requires standard deep learning stack: PyTorch, CUDA, appropriate GPU hardware.
Model integration: Support for multiple base models. Wan-1.3B provided as widely used foundation. Custom models integrate through defined pipeline interface.
Configuration: YAML/JSON configs for model parameters, generation settings, KV cache management, streaming options.
Example workflows:
- Self Forcing configuration
- CausVid setup
- MAGI-1 integration
- Custom model templates
Getting Started
- Clone Inferix repository
- Follow installation guide for dependencies
- Download or integrate desired model
- Configure generation parameters
- Run example scripts to verify setup
- Customize for specific use case
Documentation includes model integration guide, example configurations, and pipeline implementation details.
Evaluation Methodology
LV-Bench provides standardized assessment for minute long generation:
Metrics suite:
- 5 VDE metrics (clarity, motion, aesthetic, background, subject)
- 5 complementary VBench metrics
- Granular temporal analysis
- Drift quantification
Usage: Evaluate generated videos against benchmark. Compare VDE scores across methods. Lower VDE values indicate better temporal consistency.
Broader Context
BlockVid addresses fundamental challenge in video generation: maintaining coherence over extended durations. Most methods generate seconds. Minute long sequences introduce qualitative differences.
Key innovations:
- Semi-AR architecture bridges diffusion and autoregressive benefits
- Semantic retrieval prevents information dilution
- Block forcing stabilizes training for long sequences
- LV-Bench enables standardized long-form evaluation
The research advances state of the art in extended video generation while providing infrastructure (Inferix) and benchmarks (LV-Bench) for community development.
Sources:
- BlockVid Project Page: https://ziplab.co/BlockVid/
- Inferix GitHub Repository: https://github.com/alibaba-damo-academy/Inferix
- Technical Report (arXiv): https://arxiv.org/abs/2511.22973
- Inferix Technical Report (arXiv): https://arxiv.org/abs/2511.20714
- LV-Bench Dataset: https://huggingface.co/datasets/heyuanyu/LV-Bench


