MisoTTS: Open Weight 8B Voice Model with Voice Cloning
Share this post:
MisoTTS: Open Weight 8B Voice Model with Voice Cloning
Miso Labs released MisoTTS on June 3, 2026, an 8 billion parameter voice generation model with weights published to HuggingFace and GitHub on the day of release. The model ships under a modified MIT license, with commercial use permitted for the vast majority of deployments without any attribution requirement.
What MisoTTS Does
MisoTTS generates speech from text, with an optional audio context input for voice cloning. Given a short reference clip, the model replicates that voice's register, pacing, and tonal character for subsequent output. It also supports voice continuation, producing audio that extends naturally from a provided recording.
The model is English only at launch. Miso Labs describes its design aim as "emotive, human sounding" output, with variation in tonal register and pacing rather than flat robotic speech. Audio domains demonstrated in the official blog include conversational dialogue, sports commentary, explanatory narration, and therapeutic register, each requiring a distinct vocal quality that the model produces without fine tuning.
Architecture and Hardware Requirements
MisoTTS uses a text to dialogue residual vector quantization Transformer architecture drawing on the Sesame CSM design. The backbone is an 8 billion parameter Llama 3.2 style language model paired with a 300 million parameter audio decoder. Output passes through the Mimi audio codec across 32 codebooks, which Miso Labs states enables a wider sonic range than lower codebook approaches.
Miso Labs reports 110ms inference latency on an NVIDIA GPU, per the official blog published June 3, 2026. Running at full precision requires approximately 32GB of VRAM. The bfloat16 checkpoint halves that requirement to approximately 16GB, suited for RTX 3090 or RTX 4090 class hardware. An API is announced as coming soon but was not available at launch.
License and Commercial Use
MisoTTS is released under a modified MIT license. Commercial use is permitted without attribution for organizations below 50 million monthly active users and $10 million in monthly revenue. Above those thresholds, attribution is required. For independent filmmakers, studios, and developers, this means the model is commercially free to use and integrate into production pipelines without licensing fees or agreements.
Weights are downloadable from HuggingFace alongside a HuggingFace Spaces demo from the multimodalart team. Inference code and setup instructions are on GitHub under the same license terms as the code: Apache 2.0.
The same week, the OpenMOSS team released MOSS-Audio, an open weight audio understanding model that reasons about existing audio content — transcription, speaker identification, acoustic event detection. MisoTTS generates speech; MOSS-Audio analyzes it. The two cover opposite ends of the audio production pipeline.
Filmmakers working on voice production can explore AI voice and sound generation in the AI FILMS Studio voice workspace and sound workspace.
Sources
Project page: misolabs.ai/blog/miso-tts-8b
GitHub: MisoLabsAI/MisoTTS
HuggingFace model: MisoLabs/MisoTTS
HuggingFace demo: multimodalart/MisoTTS
License: Modified MIT — commercial use permitted below 50M MAU / $10M MRR threshold
Continue Reading
Video & LipSync
- Video Generator
- Text to Video
- Image to Video
- Start-End Frame to Video
- Draw to Video
- Motion Control
- Video Enhancer
- Video Upscaler
- Video to Video LipSync
- Audio to Video LipSync
- Image to Video LipSync
- Video FaceSwap
- Seedance 2
- OpenAI Sora 2
- Kling 3.0
- Kling O1
- Google Veo 3.1
- LTX 2.3
- Kling O1
- Hailuo AI
- Luma Ray
- Kling 3.0 Motion
- Topaz Upscaler
- InfiniteTalk Face Swap
Image & Edit
- AI Character
- AI Actor
- Art Generator
- Text to Image
- Image to Image
- Draw to Edit
- Image Training
- Remove Background
- Image Enhancer
- MidJourney 8.0
- OpenAI GPT Image 2.0
- Kling Image 3.0
- NanoBanana Pro
- Minimax Image
- NanoBanana 2
- Kling Omni 3
- FLUX 2
- WAN 2.6
- Z-Image
- SeedEdit 3.0
- GLM-Image
- Omnigen 2
- Seedream 4.5
- Background Erase Network 2 (BEN2)
.jpg?w=3840)
