EditorNodesPricingBlog

LongCat-Video-Avatar 1.5: Open Source Lip Sync with Whisper-Large and 8 Step Inference

May 21, 2026
LongCat-Video-Avatar 1.5: Open Source Lip Sync with Whisper-Large and 8 Step Inference

Share this post:

LongCat-Video-Avatar 1.5: Open Source Lip Sync with Whisper-Large and 8 Step Inference

LongCat-Video-Avatar 1.5 ships two changes that matter: a full audio encoder swap from Wav2Vec2 to Whisper-Large, and step distillation that cuts inference from 20 steps to 8. Both are available under an MIT License, which permits commercial use without royalties or restrictions.

LongCat-Video-Avatar 1.5 overview

What Changed From 1.0

The original LongCat-Video-Avatar used Wav2Vec2 as its speech encoder. Wav2Vec2 was trained primarily on English audio, which limited accuracy on other languages. Version 1.5 replaces it with Whisper-Large, trained by OpenAI on 680,000 hours of multilingual speech across 99 languages.

Step distillation reduces the diffusion sampling steps from 20 to 8. An INT8 quantized version is also available for lower memory deployments. The generation modes carry over from 1.0: Audio-Text-to-Video (AT2V), Audio-Text-Image-to-Video (ATI2V), and video continuation from an existing clip.

Talking Avatar Samples

Talking avatar sample 1

Talking avatar sample 2

Comparison Against Commercial Models

LongCat-Video-Avatar 1.5 vs. HeyGen, Kling Avatar 2.0, OmniHuman-1.5

The project benchmarks 1.5 against HeyGen, Kling Avatar 2.0, and OmniHuman-1.5. All three are commercial or closed-weights models. LongCat-Video-Avatar 1.5 is the only one in that comparison group with an MIT License.

Why Whisper-Large Matters for Languages Other Than English

Wav2Vec2's primary training data was English. Phoneme alignment for other languages in version 1.0 was serviceable at best. The shift to Whisper-Large changes this.

Whisper-Large covers 99 languages with training data proportional to actual spoken usage, not skewed toward English. For filmmakers working on dubbing, localization, or multilingual productions, that is the practical difference between a tool that works on one language and a tool that works on a production pipeline. Studios currently pay significant sums to dubbing houses for lip synced localized versions of theatrical releases. A commercially licensed open source model that handles 99 languages with comparable output quality is a direct alternative to that workflow.

None of the model's documentation makes this angle explicit. It follows from the encoder choice.

Multiple Speaker Generation

Multiple speaker talking avatar generation

Multiple audio stream input lets each speaker drive their own lip sync track simultaneously. The model resolves which audio stream maps to which face in the frame, enabling scenes with multiple speakers without separate post processing per character.

Animation and Non Human Subjects

Lip sync on animated characters and animals

LongCat-Video-Avatar 1.5 generalizes beyond photorealistic faces. The model applies lip sync to anime characters and animals, which expands its application range to animated productions and character work that does not involve human subjects.

LongCat Ecosystem Context

LongCat-Video-Avatar 1.5 builds on the same architecture documented in the LongCat Video Avatar guide, which covers the 13.6B DiT framework, Reference Skip Attention, and Cross-Chunk Latent Stitching. The 1.5 release focuses on the audio encoder upgrade and inference speed, not changes to the underlying generation architecture.

For extended duration generation without avatar constraints, LongCat Video's 15-minute coherent generation model addresses temporal consistency across long sequences.

Test lip sync and voice generation workflows with the latest AI models in the AI FILMS Studio voice workspace.


Sources

Project Page: LongCat-Video-Avatar 1.5 GitHub: meigen-ai/LongCat-Video-Avatar Hugging Face: meigen-ai/LongCat-Video-Avatar-1.5