EditorNodesPricingBlog

MisoTTS: Open Weight 8B Voice Model with Voice Cloning

June 6, 2026
MisoTTS: Open Weight 8B Voice Model with Voice Cloning

Share this post:

MisoTTS: Open Weight 8B Voice Model with Voice Cloning

Miso Labs released MisoTTS on June 3, 2026, an 8 billion parameter voice generation model with weights published to HuggingFace and GitHub on the day of release. The model ships under a modified MIT license, with commercial use permitted for the vast majority of deployments without any attribution requirement.

What MisoTTS Does

MisoTTS generates speech from text, with an optional audio context input for voice cloning. Given a short reference clip, the model replicates that voice's register, pacing, and tonal character for subsequent output. It also supports voice continuation, producing audio that extends naturally from a provided recording.

MisoTTS architecture overview showing the text to speech pipeline and audio codec layers
Miso Labs

The model is English only at launch. Miso Labs describes its design aim as "emotive, human sounding" output, with variation in tonal register and pacing rather than flat robotic speech. Audio domains demonstrated in the official blog include conversational dialogue, sports commentary, explanatory narration, and therapeutic register, each requiring a distinct vocal quality that the model produces without fine tuning.

Architecture and Hardware Requirements

MisoTTS uses a text to dialogue residual vector quantization Transformer architecture drawing on the Sesame CSM design. The backbone is an 8 billion parameter Llama 3.2 style language model paired with a 300 million parameter audio decoder. Output passes through the Mimi audio codec across 32 codebooks, which Miso Labs states enables a wider sonic range than lower codebook approaches.

MisoTTS benchmark comparisons and voice sample output across different audio domains
Miso Labs

Miso Labs reports 110ms inference latency on an NVIDIA GPU, per the official blog published June 3, 2026. Running at full precision requires approximately 32GB of VRAM. The bfloat16 checkpoint halves that requirement to approximately 16GB, suited for RTX 3090 or RTX 4090 class hardware. An API is announced as coming soon but was not available at launch.

License and Commercial Use

MisoTTS is released under a modified MIT license. Commercial use is permitted without attribution for organizations below 50 million monthly active users and $10 million in monthly revenue. Above those thresholds, attribution is required. For independent filmmakers, studios, and developers, this means the model is commercially free to use and integrate into production pipelines without licensing fees or agreements.

Weights are downloadable from HuggingFace alongside a HuggingFace Spaces demo from the multimodalart team. Inference code and setup instructions are on GitHub under the same license terms as the code: Apache 2.0.

The same week, the OpenMOSS team released MOSS-Audio, an open weight audio understanding model that reasons about existing audio content — transcription, speaker identification, acoustic event detection. MisoTTS generates speech; MOSS-Audio analyzes it. The two cover opposite ends of the audio production pipeline.

Filmmakers working on voice production can explore AI voice and sound generation in the AI FILMS Studio voice workspace and sound workspace.


Sources

Project page: misolabs.ai/blog/miso-tts-8b
GitHub: MisoLabsAI/MisoTTS
HuggingFace model: MisoLabs/MisoTTS
HuggingFace demo: multimodalart/MisoTTS
License: Modified MIT — commercial use permitted below 50M MAU / $10M MRR threshold