EditorNodesPricingBlog

LongLive-RAG: Open Source Long Video Generation with Retrieval Memory

June 5, 2026
LongLive-RAG: Open Source Long Video Generation with Retrieval Memory

Share this post:

LongLive-RAG: Open Source Long Video Generation with Retrieval Memory

Researchers published LongLive-RAG on arXiv on June 1, 2026, releasing code and weights under Apache 2.0 for commercial use. The framework addresses a specific failure mode in autoregressive long video generation: the longer the sequence, the more subject appearance, background details, and scene continuity deteriorate. LongLive-RAG attacks that problem by giving the model access to its own generation history as a searchable memory.

The Identity Drift Problem

Autoregressive video generation builds sequences frame by frame, with each step conditioning on a recent window of previous frames. That sliding window approach works well for short clips. Over longer sequences, small errors compound at each step and push the output progressively further from the original subject appearance and scene layout.

Helios and similar long video models address duration at the architecture level, building models that can sustain coherent generation for longer windows. LongLive-RAG takes a different approach: rather than redesigning the architecture, it adds a retrieval layer on top of existing autoregressive models that continuously corrects for drift by referencing what the model already generated.

How the Retrieval Works

At each generation step, LongLive-RAG stores the newly generated latent in a growing history pool. A lightweight latent encoder then maps the current latent to a query embedding and retrieves the top K most relevant historical latents by cosine similarity. Those retrieved latents are injected as additional context, providing implicit error correction before the next frame is generated.

The mechanism works like a DP reviewing dailies between shooting days. Rather than relying only on the most recent setup to maintain consistency, the model can pull from any point in the generation history where the relevant element, a face, a background object, a lighting condition, appeared clearly. That non local reference prevents the accumulation of small errors that would otherwise compound into obvious drift.

Self-Forcing baseline vs. LongLive-RAG, clip 97

Self-Forcing baseline vs. LongLive-RAG, clip 105

Window Temporal Delta Loss

The retrieval mechanism depends on the latent encoder producing meaningfully different embeddings for frames that are far apart in time. Standard encoders collapse this: adjacent video latents are nearly identical, so the encoder learns that similarity is the dominant signal. That makes the retrieved "history" useless because neighboring frames score just as high as the distant, relevant ones.

LongLive-RAG solves this with the Window Temporal Delta Loss. The loss function explicitly suppresses local similarity within a short temporal window while preserving it across longer distances, forcing the encoder to learn a search space where non-local context can actually be distinguished from recent noise. The result is a retrieval system that finds the right reference frame rather than the nearest one.

Three Backbone Options

LongLive-RAG works as a plug-in addition to three existing autoregressive video generation backbones: Causal-Forcing, Self-Forcing, and LongLive. All three can run either with their native sliding window or with LongLive-RAG's retrieval system added on top, creating six total inference configurations.

NVIDIA LongLive is one of those supported backbones: a frame-level autoregressive model built for real-time interactive generation at roughly 20.7 FPS on a single H100. LongLive-RAG adds the retrieval layer without modifying the backbone, so the real-time properties of the base model are preserved.

LongLive baseline vs. LongLive-RAG, clip 9

Benchmark Results

LongLive-RAG achieves the best average rank on VBench-Long across 30-second, 60-second, and 120-second generation horizons, evaluated against baseline configurations for all three backbones. The gains appear across four quality dimensions: subject consistency, background consistency, motion smoothness, and overall imaging quality. Performance holds at longer horizons, which is where sliding window baselines degrade most sharply.

What Filmmakers Can Do With It

The practical case for LongLive-RAG is sequence length without sacrifice. A 120-second AI generated sequence is roughly the length of a feature film scene. Maintaining consistent subject appearance, wardrobe details, and background layout across that duration is what separates usable footage from footage that requires continuous hand-correction.

The framework's plug in design means it can be layered on top of whichever autoregressive backbone already fits the production workflow. For previsualization work, where the goal is a consistent rough cut across multiple beats rather than a single polished short clip, retrieval based consistency correction addresses the most common failure point in current open source long video generation.

The model runs via the standard inference scripts in the GitHub repository. Weights for all six configurations download from HuggingFace. Prerequisites are Python 3.10 or higher and compatible CUDA hardware. For text-to-video and image-to-video generation in a production workspace, AI FILMS Studio provides access to the latest video generation models without local setup.


Sources

GitHub: qixinhu11/LongLive-RAG
HuggingFace: qixinhu11/LongLive-RAG
arXiv: LongLive-RAG: Retrieval-Augmented Long-form Video Generation
Project page: longlive-rag.github.io
License: Apache 2.0 (commercial use permitted)