LongLive-RAG: Open Source Long Video Generation with Retrieval Memory

June 5, 2026

Updated: July 19, 2026

Share this post:

LongLive-RAG: Open Source Long Video Generation with Retrieval Memory

Researchers published LongLive-RAG on arXiv on June 1, 2026, releasing code and weights under Apache 2.0 for commercial use. The framework addresses a specific failure mode in autoregressive long video generation: the longer the sequence, the more subject appearance, background details, and scene continuity deteriorate. LongLive-RAG attacks that problem by giving the model access to its own generation history as a searchable memory.

The Identity Drift Problem

Autoregressive video generation builds sequences frame by frame, with each step conditioning on a recent window of previous frames. That sliding window approach works well for short clips. Over longer sequences, small errors compound at each step and push the output progressively further from the original subject appearance and scene layout.

Helios and similar long video models address duration at the architecture level, building models that can sustain coherent generation for longer windows. LongLive-RAG takes a different approach. Rather than redesigning the architecture, it adds a retrieval layer on top of existing autoregressive models that continuously corrects for drift by referencing what the model already generated.

How the Retrieval Works

At each generation step, LongLive-RAG stores the newly generated latent in a growing history pool. A lightweight latent encoder then maps the current latent to a query embedding and retrieves the top K most relevant historical latents by cosine similarity. Those retrieved latents are injected as additional context, providing implicit error correction before the next frame is generated.

The mechanism works like a DP reviewing dailies between shooting days. Rather than relying only on the most recent setup to maintain consistency, the model can pull from any point in the generation history where the relevant element, a face, a background object, a lighting condition, appeared clearly. That non local reference prevents the accumulation of small errors that would otherwise compound into obvious drift.

Self-Forcing baseline vs. LongLive-RAG, clip 97

Self-Forcing baseline vs. LongLive-RAG, clip 105

Window Temporal Delta Loss

The retrieval mechanism depends on the latent encoder producing meaningfully different embeddings for frames that are far apart in time. Standard encoders collapse this. Adjacent video latents are nearly identical, so the encoder learns that similarity is the dominant signal. That makes the retrieved "history" useless because neighboring frames score just as high as the distant, relevant ones.

LongLive-RAG solves this with the Window Temporal Delta Loss. The loss function explicitly suppresses local similarity within a short temporal window while preserving it across longer distances, forcing the encoder to learn a search space where non local context can actually be distinguished from recent noise. The result is a retrieval system that finds the right reference frame rather than the nearest one.

History Pool Size and the Retrieval Latency Tradeoff

At 120 seconds and 24 FPS, the history pool accumulates 2,880 generated latents. Retrieving the top K most relevant from 2,880 candidates requires computing cosine similarity across the full pool at each step.

The Window Temporal Delta Loss training ensures the encoder produces embeddings that are meaningfully different across time, which is what makes retrieval from a pool that size useful rather than returning only the most recent frames. Without that loss, the pool would be a flat similarity surface and retrieval would degrade to recency weighting, which is functionally the same as the sliding window the method is meant to replace.

Growing the pool also grows the retrieval compute cost at each step. LongLive-RAG's encoder is deliberately lightweight to keep the overhead manageable. The tradeoff the researchers made is a lower capacity encoder that is fast enough to query at every generation step, versus a higher capacity encoder that would require selective querying and introduce latency spikes.

Three Backbone Options

LongLive-RAG works as a layer added on top of three existing autoregressive video generation backbones: Causal-Forcing, Self-Forcing, and LongLive. All three can run either with their native sliding window or with LongLive-RAG's retrieval system added on top, creating six total inference configurations.

NVIDIA LongLive is one of those supported backbones: a frame level autoregressive model built for real-time interactive generation at roughly 20.7 FPS on a single H100. LongLive-RAG adds the retrieval layer without modifying the backbone, so the real-time properties of the base model are preserved.

LongLive baseline vs. LongLive-RAG, clip 9

What Causal-Forcing and Self-Forcing Do Differently

Causal-Forcing conditions each new frame on all previous frames in the context window. Self-Forcing uses a learned conditioning approach where the model is trained to generate without teacher forcing, reducing the gap between training-time and inference-time inputs.

LongLive-RAG adds retrieval-based correction as a layer above all three, which means the gains it provides depend on how much drift the underlying backbone produces at long horizons. Backbones that already maintain strong local consistency benefit less from retrieval correction than backbones with higher baseline drift rates. The benchmark results confirm this: the improvement is largest for the configuration that starts with the most drift.

The Six Configuration Comparison

The three backbones each run with and without LongLive-RAG, creating six total inference configurations. VBench-Long results across all six show that LongLive-RAG improves every backbone it is applied to.

The best overall configuration is LongLive with LongLive-RAG, which combines a backbone already designed for long generation with the retrieval correction layer on top. That configuration achieves the best average rank on VBench-Long across all three evaluation horizons. The second best configuration is LongLive alone, which suggests the backbone quality matters as much as the retrieval method.

Benchmark Results

LongLive-RAG achieves the best average rank on VBench-Long across 30-second, 60-second, and 120-second generation horizons, evaluated against baseline configurations for all three backbones. The gains appear across four quality dimensions: subject consistency, background consistency, motion smoothness, and overall imaging quality.

Performance holds at longer horizons, which is where sliding window baselines degrade most sharply. The gap between LongLive-RAG and the baseline versions widens as generation length increases, which is the pattern that confirms the retrieval mechanism is doing useful work at the points where it is most needed.

The 120 second evaluation horizon is the most practically relevant test because it corresponds to full scene length in production footage. A method that holds consistency at 120 seconds can be applied to previs sequences, music video treatments, and short film rough cuts without editing the output into sub-segments to hide drift.

VBench-Long vs. Standard VBench

Standard VBench evaluates generation quality on short clips, typically under 10 seconds. VBench-Long evaluates at 30-second, 60-second, and 120-second horizons. The gap between a model's standard VBench score and its VBench-Long score measures how quickly it degrades at longer durations.

LongLive-RAG's benchmark claim is specifically that it reduces this degradation, not that it improves peak quality on short clips. A production that needs only 5 second clips does not benefit from LongLive-RAG. A production building 60 second or longer sequences, whether for previsualization, scene rough cuts, or long-form generation, is the direct target.

Previsualization as the Primary Use Case

A 120 second AI generated sequence covers a complete dramatic scene from opening to close. For previsualization, the goal is a consistent rough draft that a director can review and approve before committing crew to a location shoot.

LongLive-RAG's identity drift correction means the character in the first second of a previs sequence looks the same as the character in the last second, which is the minimum consistency requirement for the director to evaluate staging and performance before a live shoot. Without that consistency, the previs requires hand correction on every clip, which eliminates the time advantage that AI generation provides.

Previsualization does not need photorealistic output. It needs spatial and temporal consistency so the director can read blocking and performance from the generated material. LongLive-RAG addresses exactly the consistency failures that make previs unreadable, without requiring a change to the generation model itself.

What Filmmakers Can Do With It

The practical case for LongLive-RAG is sequence length without sacrifice. A 120-second AI generated sequence is roughly the length of a feature film scene. Maintaining consistent subject appearance, wardrobe details, and background layout across that duration is what separates usable footage from footage that requires continuous hand-correction.

The framework's modular design means it can be layered on top of whichever autoregressive backbone already fits the production workflow. The model runs via the standard inference scripts in the GitHub repository. Weights for all six configurations download from HuggingFace. Prerequisites are Python 3.10 or higher and compatible CUDA hardware.

The Apache 2.0 license permits commercial use without restriction. Productions that generate long sequences for client delivery, broadcast, or theatrical use can deploy LongLive-RAG without licensing fees or attribution requirements beyond the standard Apache 2.0 terms. That puts it in the same commercial deployment tier as other major open source generation tools released under permissive licenses in 2025 and 2026.

The open license also means the retrieval architecture can be studied and adapted. The Window Temporal Delta Loss training approach is the core contribution, and it is available in full in the GitHub repository for productions that want to apply the same principle to other generation architectures.

The Open Source Long Video Research Cluster

LongLive-RAG builds on NVIDIA LongLive and complements Helios. All three address the same problem of sustained coherence in autoregressive video generation from different angles. NVIDIA LongLive focuses on real-time interactive generation speed. Helios targets duration through architecture redesign. LongLive-RAG targets the drift correction problem by adding memory without changing the backbone.

The three represent a research cluster around long video that filmmakers working with open source tools can draw from simultaneously. A production workflow could use a Helios backbone for duration, LongLive-RAG for consistency correction, and NVIDIA LongLive for the interactive real-time generation stages where human review happens between generation steps. They are not competing solutions to the same problem but complementary tools addressing different parts of the same generation pipeline.

The pace of publication in this cluster is relevant to production planning. LongLive-RAG arrived four months after NVIDIA LongLive, which arrived months after Helios. Each release added a capability the previous one lacked. A production building a long video pipeline today on any of these tools should expect another release in the same research thread within six months that improves on what is currently available. Building on the modular architecture that LongLive-RAG demonstrates, adding new modules on top rather than replacing the base model, is the approach most likely to remain compatible with the next iteration.

For text-to-video and image-to-video generation in a production workspace, AI FILMS Studio provides access to the latest video generation models without local setup.

AI FILMS Studio video generation workspace

Try AI FILMS Studio

Generate text-to-video and image-to-video with the latest AI models in the video workspace.

Nodes Graph Editor

Build custom AI workflows by connecting models visually in the Nodes Graph Editor.

Sources

GitHub: qixinhu11/LongLive-RAG
HuggingFace: qixinhu11/LongLive-RAG
arXiv: LongLive-RAG: Retrieval-Augmented Long-form Video Generation
Project page: longlive-rag.github.io
License: Apache 2.0 (commercial use permitted)

Continue Reading

Jul 17, 2026

Andy Serkis Says AI Cannot Replicate an 'Authored Performance' as Hunt for Gollum Begins Filming

Andy Serkis says AI cannot yet replicate an authored performance as The Hunt for Gollum begins filming, and argues that motion capture acting is long overdue for Oscar recognition.

Jul 17, 2026

MolmoMotion: Ai2 Releases Open Source Model That Forecasts 3D Object Motion From Language

Allen Institute for AI releases MolmoMotion, an open source model that predicts 3D object trajectories from video and language instructions, with a dataset of 1.16 million annotated clips.

Jul 17, 2026

Venice Immersive 2026: Margot Robbie, Andy Serkis, Daisy Ridley Lead AI and XR Lineup

Venice Immersive marks its 10th anniversary with 68 projects featuring Margot Robbie, Andy Serkis, Daisy Ridley and Mark Ruffalo in AI and XR immersive works.

View all Posts

Image & Edit

Speech & Voice

Music & Sound Effects

LongLive-RAG: Open Source Long Video Generation with Retrieval Memory

LongLive-RAG: Open Source Long Video Generation with Retrieval Memory

The Identity Drift Problem

How the Retrieval Works

Window Temporal Delta Loss

History Pool Size and the Retrieval Latency Tradeoff

Three Backbone Options

What Causal-Forcing and Self-Forcing Do Differently

The Six Configuration Comparison

Benchmark Results

VBench-Long vs. Standard VBench

Previsualization as the Primary Use Case

What Filmmakers Can Do With It

The Open Source Long Video Research Cluster

Sources

Continue Reading

Andy Serkis Says AI Cannot Replicate an 'Authored Performance' as Hunt for Gollum Begins Filming

MolmoMotion: Ai2 Releases Open Source Model That Forecasts 3D Object Motion From Language

Venice Immersive 2026: Margot Robbie, Andy Serkis, Daisy Ridley Lead AI and XR Lineup

Video & LipSync

Image & Edit

Speech & Voice

Music & Sound Effects