Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-TTS/llms.txt

Use this file to discover all available pages before exploring further.

This guide will walk you through generating your first speech with Qwen3-TTS using all three model types: CustomVoice, VoiceDesign, and Base (voice cloning).

Prerequisites

Before you begin, make sure you have:
  • Python 3.9 or higher (Python 3.12 recommended)
  • A CUDA-compatible GPU (optional but recommended)
  • Installed the qwen-tts package (see Installation)

CustomVoice: Generate with Preset Speakers

The CustomVoice models provide 9 premium preset voices across multiple languages.
1

Import and load the model

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
2

Generate speech with a preset speaker

wavs, sr = model.generate_custom_voice(
    text="Hello, welcome to Qwen3-TTS!",
    language="English",
    speaker="Ryan",
)

sf.write("output_custom_voice.wav", wavs[0], sr)
Available speakers: Vivian, Serena, Uncle_Fu, Dylan, Eric (Chinese); Ryan, Aiden (English); Ono_Anna (Japanese); Sohee (Korean). See the Custom Voice guide for details.
3

Add instruction control (1.7B model only)

wavs, sr = model.generate_custom_voice(
    text="I'm so excited to announce this!",
    language="English",
    speaker="Ryan",
    instruct="Very happy and enthusiastic.",
)

sf.write("output_with_instruction.wav", wavs[0], sr)

VoiceDesign: Create Custom Voices

The VoiceDesign model lets you create voices from natural language descriptions.
1

Load the VoiceDesign model

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
2

Generate with a voice description

wavs, sr = model.generate_voice_design(
    text="Welcome! I'm here to help you get started.",
    language="English",
    instruct="Male, 30s, professional and friendly tone, clear articulation",
)

sf.write("output_voice_design.wav", wavs[0], sr)
Be specific in your instructions. Include age, gender, tone, emotion, accent, and speaking style for best results.

Base: Clone Any Voice

The Base models can clone a voice from just 3 seconds of reference audio.
1

Load the Base model

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
2

Prepare reference audio

You need a reference audio file (3+ seconds) and its transcript.
ref_audio = "path/to/reference.wav"
ref_text = "This is the text spoken in the reference audio."
The reference audio can be a local file path, URL, base64 string, or numpy array with sample rate.
3

Clone the voice

wavs, sr = model.generate_voice_clone(
    text="Now I can speak any text in this voice!",
    language="English",
    ref_audio=ref_audio,
    ref_text=ref_text,
)

sf.write("output_voice_clone.wav", wavs[0], sr)

Batch Processing

All models support batch processing for efficiency:
# Batch generate with CustomVoice
wavs, sr = model.generate_custom_voice(
    text=[
        "First sentence.",
        "Second sentence.",
        "Third sentence."
    ],
    language=["English", "English", "English"],
    speaker=["Ryan", "Aiden", "Ryan"],
)

# Save all outputs
for i, wav in enumerate(wavs):
    sf.write(f"output_{i}.wav", wav, sr)

Common Parameters

All generation methods support these optional parameters:
ParameterTypeDefaultDescription
do_sampleboolTrueEnable sampling for more natural speech
temperaturefloat1.0Controls randomness (0.0-1.0)
top_kint50Top-k sampling parameter
top_pfloat1.0Nucleus sampling parameter
max_new_tokensint2048Maximum tokens to generate
non_streaming_modeboolFalseForce non-streaming generation
Example with custom parameters:
wavs, sr = model.generate_custom_voice(
    text="Hello world!",
    language="English",
    speaker="Ryan",
    temperature=0.7,
    top_k=100,
    top_p=0.95,
)

Model Selection Guide

CustomVoice

Use when:
  • You need consistent, high-quality preset voices
  • You want instruction-based control (1.7B)
  • You need multilingual support
Models:
  • 0.6B: 9 preset speakers
  • 1.7B: 9 speakers + instructions

VoiceDesign

Use when:
  • You need custom voice characteristics
  • You want to describe voices in natural language
  • You need creative voice variations
Model:
  • 1.7B only

Base

Use when:
  • You need to clone specific voices
  • You have reference audio samples
  • You want to fine-tune for your use case
Models:
  • 0.6B and 1.7B
  • Both support 3-second cloning

Next Steps

Explore Guides

Learn about advanced features and workflows

API Reference

Explore the complete API documentation

Performance Tips

Optimize for speed and quality

Fine-tuning

Customize models for your specific needs

Troubleshooting

Models are downloaded from Hugging Face on first use. For faster downloads in China, use ModelScope:
pip install modelscope
modelscope download --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --local_dir ./models
Then load from the local directory:
model = Qwen3TTSModel.from_pretrained("./models/Qwen3-TTS-12Hz-1.7B-CustomVoice", ...)
Try these solutions:
  1. Use the 0.6B model instead of 1.7B
  2. Reduce batch size
  3. Use dtype=torch.float16 instead of bfloat16
  4. Generate shorter text segments
Check these factors:
  1. Use the 12Hz models (better quality than 25Hz)
  2. Set appropriate language explicitly
  3. For voice cloning, ensure reference audio is clear (3+ seconds, good quality)
  4. Adjust temperature (try 0.7-0.9 for more stable output)
FlashAttention 2 is optional but recommended:
  • Requires CUDA 11.8 or higher
  • On machines with limited RAM, use: MAX_JOBS=4 pip install flash-attn --no-build-isolation
  • If it fails, the model will work without it (just slower)

Getting Help

Join the Community

Ask questions, report issues, and get help from the community on GitHub