BlockVid: Semi Autoregressive Framework Generates Minute Long Videos With Coherence

December 2, 2025

Share this post:

BlockVid: Semi Autoregressive Framework Generates Minute Long Videos With Coherence

Alibaba DAMO Academy released BlockVid, a semiautoregressive block diffusion framework addressing coherence challenges in minute long video generation. The system combines semantic sparse KV caching, block forcing, and noise scheduling to maintain consistency across extended sequences.

Block Diffusion Architecture

BlockVid generates video in chunks rather than frames. Each chunk contains multiple frames processed together through diffusion. The system conditions new chunks on previous ones while using retrieved semantic context from earlier sections.

Extended sequence maintaining visual consistency | BlockVid

Long form narrative with coherent scene transitions | BlockVid

Architecture components:

Semi autoregressive chunk generation
Semantic sparse KV cache for context
Dynamic retrieval of relevant prior chunks
Block level diffusion denoising
3D causal VAE compression

The semi autoregressive approach bridges diffusion and autoregressive methods. Diffusion operates within each block. Autoregressive conditioning connects blocks sequentially. This combination produces more coherent sequences than pure diffusion while maintaining quality.

Semantic Sparse KV Caching

BlockVid maintains compact memory of salient information from previous chunks. Rather than storing complete video history, the system extracts semantically important key/value pairs.

When generating chunk c+1, the framework retrieves top-k semantically similar chunks based on prompt embedding similarity. This global context supplements local conditioning from immediate prior chunk.

Character consistency across minute long sequence | BlockVid

Environment continuity in extended generation | BlockVid

KV cache mechanism:

Extracts salient tokens from each generated chunk
Stores semantic representations rather than raw frames
Updates dynamically as generation progresses
Retrieves relevant context via embedding similarity
Manages memory efficiently for long sequences

This approach prevents information dilution common in naive long video generation. Early chunks remain accessible through semantic retrieval rather than requiring complete history retention.

Block Forcing Training Strategy

Block forcing stabilizes long video generation by combining velocity forcing and self forcing objectives.

Velocity forcing aligns predicted dynamics with semantic history to prevent drift. The model learns consistency between motion patterns and established narrative context.

Self forcing closes the training inference gap by exposing the model to its own rollouts during training. Rather than training exclusively on ground truth sequences, the system experiences its own generation artifacts and learns to maintain realism.

The joint integration prevents common failure modes in autoregressive generation: semantic drift where narrative loses coherence, and quality degradation where visual fidelity decreases over time.

Noise Scheduling Strategy

BlockVid implements noise scheduling operating during both training and inference.

Progressive noise scheduling (training): Gradually increases noise levels across chunks during training. Early chunks train with lower noise. Later chunks experience higher noise. This prepares the model for accumulating uncertainty in long sequences.

Noise shuffling (inference): Introduces local randomness at chunk boundaries to smooth transitions. Rather than deterministic noise patterns creating visible seams, shuffling maintains natural flow between sections.

Smooth temporal coherence across extended duration | BlockVid

The combined strategy addresses chunk boundary artifacts while maintaining per-chunk quality. Transitions between sections remain imperceptible despite block based generation.

LV-Bench: Minute Long Video Benchmark

The research contributes LV-Bench, a curated benchmark of 1,000 minute long videos evaluating long horizon generation.

Dataset curation:

Sourced from DanceTrack, GOT-10k, HD-VILA-100M, ShareGPT4V
Videos minimum 50 seconds duration
GPT-4o generates fine-grained captions every 2-3 seconds
Human in the loop validation at every stage
8:2 split for training and evaluation

Quality standards: Human annotators validate data sourcing, chunk splitting, and captioning. At least two annotators provide inter rater reliability at each stage. Visual checks confirm correct transitions and semantic accuracy.

Video Drift Error (VDE) Metrics

LV-Bench introduces five metrics measuring temporal consistency degradation.

VDE Clarity: Measures temporal drift in image sharpness. Creeping blur increases score; low values indicate stable clarity.

VDE Motion: Measures drift in motion smoothness. Low scores indicate consistent dynamics without jitter or freezing.

VDE Aesthetic: Measures drift in visual appeal. Low scores indicate sustained coherent aesthetics.

VDE Background: Measures background stability. Low scores indicate consistent setting without drift or flicker.

VDE Subject: Tracks identity drift. Low scores indicate subject remains consistently recognizable.

Long-range coherence demonstration across full minute | BlockVid

The metrics derive from Mean Absolute Percentage Error (MAPE) concepts, dividing long videos into segments evaluated on specific quality dimensions. This provides granular assessment of where and how quality degrades.

Inferix Integration

BlockVid research led to Inferix, a next generation inference engine for world simulation. The engine specifically optimizes semiautoregressive decoding processes.

Inferix capabilities:

Advanced KV cache management for persistent world simulation
Distributed world synthesis for large scale environment generation
Video streaming with RTMP and WebRTC protocols
Seamless model integration API
Performance monitoring and profiling
Continuous prompt support for dynamic narrative control
8 bit quantization support (INT8/FP8)

Inferix differs from high concurrency systems (vLLM, SGLang) and classic video diffusion (xDiTs). The engine focuses on immersive world synthesis rather than serving multiple users simultaneously.

Technical Specifications

Video compression: 3D causal VAE compresses spatiotemporal dimensions to [(1+T/4), H/8, W/8] while expanding channels to 16. First frame compresses spatially only to better handle image guidance.

Chunk structure: Each video chunk V_i ∈ R^((1+T) × H × W × 3) contains T frames plus initial guidance frame. Chunk level prompts y_i condition generation.

Latent processing: Block diffusion denoiser processes latent representation Z to produce denoised latent Z̃. Semantic sparse KV cache constructs during this procedure, preserving salient keys and values.

Memory efficiency: Semantic retrieval eliminates need for complete history retention. System accesses relevant prior context without storing all intermediate states.

Training Process

BlockVid trains on single shot long videos V = {V_1, V_2, V_3, ..., V_n} with corresponding chunk level prompts Y = {y_i}.

Stage 1: Learn block diffusion denoising on individual chunks.

Stage 2: Introduce block forcing to stabilize multi chunk sequences. Velocity forcing prevents semantic drift. Self forcing reduces training inference gap.

Stage 3: Apply progressive noise scheduling to prepare model for long sequence uncertainty accumulation.

The training strategy requires less data than naive approaches. Rather than massive minute long video datasets, the system learns from shorter sequences then extends capability through architectural design.

Use Cases and Applications

World modeling: Persistent simulation for gaming, embodied AI, agentic systems requiring extended coherent environments.

Long form content: Minute long narratives maintaining character and setting consistency throughout.

Interactive generation: Dynamic prompt control enables narrative branching and user guided story development.

Video streaming: Realtime generation with RTMP/WebRTC delivery for interactive applications.

Research applications: LV-Bench provides standardized evaluation for long video generation methods.

For filmmakers exploring AI video generation, AI FILMS Studio provides video generation tools to experiment with different models and workflows.

Licensing and Availability

Apache 2.0 License covers Inferix inference engine. Permissive opensource allowing:

Commercial use
Modification and distribution
Private deployment
Patent grant

Code availability:

Inferix engine: GitHub (alibaba-damo-academy/Inferix)
Model integration guide included
Example configurations for multiple models
Installation documentation

LV-Bench dataset:

Available on Hugging Face (heyuanyu/LV-Bench)
1,000 minute long videos
Fine grained annotations
VDE metric computation scripts

Performance Characteristics

BlockVid achieves favorable balance between generation quality and temporal coherence for minute long sequences.

Advantages over standard diffusion:

Maintains consistency across extended duration
Prevents semantic drift through velocity forcing
Reduces quality degradation via self forcing
Manages memory efficiently through sparse KV caching

Comparison to naive autoregressive:

Higher per frame quality through block diffusion
Smoother transitions via noise scheduling
Better long range coherence through semantic retrieval

Technical Limitations

Chunk boundary artifacts: Despite noise shuffling, extremely sensitive viewers may detect subtle transitions between blocks under careful examination.

Memory requirements: Semantic KV cache requires substantial memory for very long sequences (multi minute generations).

Computational cost: Block diffusion per chunk more expensive than single frame generation. Tradeoff between quality and speed.

Training complexity: Block forcing and progressive noise scheduling increase training difficulty compared to standard diffusion.

Research Team

Joint collaboration between:

Zhejiang University
Hong Kong University of Science and Technology
Alibaba DAMO Academy
Alibaba TRE

Key contributors: Tianyu Feng, Yizeng Han, Jiahao He, Yuanyu He, Xi Lin, Teng Liu, Hanfeng Lu, Jiasheng Tang, Wei Wang, Zhiyuan Wang, Jichao Wu, Mingyang Yang, Yinghao Yu, Zeyu Zhang, Bohan Zhuang

Future Directions

Planned features:

Complex KV management enhancements
Support for finetuning pretrained models to semi-AR
Model distillation into fewer steps
High concurrency deployment
Advanced realtime streaming
Interactive world models
Enhanced simulation capabilities
Persistent world state management

The roadmap emphasizes world simulation over simple video generation. Goal: enable persistent, interactive environments for gaming and embodied AI.

Implementation Guide

Installation: Detailed instructions in Inferix repository. Requires standard deep learning stack: PyTorch, CUDA, appropriate GPU hardware.

Model integration: Support for multiple base models. Wan-1.3B provided as widely used foundation. Custom models integrate through defined pipeline interface.

Configuration: YAML/JSON configs for model parameters, generation settings, KV cache management, streaming options.

Example workflows:

Self Forcing configuration
CausVid setup
MAGI-1 integration
Custom model templates

Getting Started

Clone Inferix repository
Follow installation guide for dependencies
Download or integrate desired model
Configure generation parameters
Run example scripts to verify setup
Customize for specific use case

Documentation includes model integration guide, example configurations, and pipeline implementation details.

Evaluation Methodology

LV-Bench provides standardized assessment for minute long generation:

Metrics suite:

5 VDE metrics (clarity, motion, aesthetic, background, subject)
5 complementary VBench metrics
Granular temporal analysis
Drift quantification

Usage: Evaluate generated videos against benchmark. Compare VDE scores across methods. Lower VDE values indicate better temporal consistency.

Broader Context

BlockVid addresses fundamental challenge in video generation: maintaining coherence over extended durations. Most methods generate seconds. Minute long sequences introduce qualitative differences.

Key innovations:

Semi-AR architecture bridges diffusion and autoregressive benefits
Semantic retrieval prevents information dilution
Block forcing stabilizes training for long sequences
LV-Bench enables standardized long-form evaluation

The research advances state of the art in extended video generation while providing infrastructure (Inferix) and benchmarks (LV-Bench) for community development.

Sources:

BlockVid Project Page: https://ziplab.co/BlockVid/
Inferix GitHub Repository: https://github.com/alibaba-damo-academy/Inferix
Technical Report (arXiv): https://arxiv.org/abs/2511.22973
Inferix Technical Report (arXiv): https://arxiv.org/abs/2511.20714
LV-Bench Dataset: https://huggingface.co/datasets/heyuanyu/LV-Bench

Continue Reading

Feb 13, 2026

ByteDance Suspends Seedance 2.0 Voice Cloning Feature After Privacy Backlash

ByteDance disables controversial face-to-voice feature in Seedance 2.0 that generated personalized voices from facial photos alone, citing privacy and ethical concerns.

Feb 12, 2026

Hollywood Strikes Back: MPA Condemns ByteDance's Seedance 2.0 Over Viral AI Video

The Motion Picture Association demands ByteDance cease 'massive' copyright infringement after viral AI video of Tom Cruise fighting Brad Pitt sparks industry outcry.