Use this file to discover all available pages before exploring further.
The VoiceDesign model (Qwen3-TTS-12Hz-1.7B-VoiceDesign) allows you to generate speech with custom voice characteristics by providing natural language descriptions. Describe the voice you want, and the model will synthesize speech matching your description.
# Young male voicewavs, sr = model.generate_voice_design( text="Hello, how can I help you today?", language="English", instruct="Male, 25 years old, friendly and professional tone",)# Elderly female voicewavs, sr = model.generate_voice_design( text="Come here, dear, let me tell you a story.", language="English", instruct="Female, 70 years old, warm and grandmotherly voice with slight raspiness",)# Teen voice with nervousnesswavs, sr = model.generate_voice_design( text="H-hey! You dropped your notebook. I think it's yours?", language="English", instruct="Male, 17 years old, tenor range, gaining confidence - deeper breath support now, though vowels still tighten when nervous",)
Control emotional characteristics through descriptions:
# Excited and energeticwavs, sr = model.generate_voice_design( text="We did it! The project is finally complete!", language="English", instruct="Young female voice, extremely excited and energetic, speaking quickly with rising intonation",)# Worried and anxiouswavs, sr = model.generate_voice_design( text="I can't find my keys anywhere. What am I going to do?", language="English", instruct="Middle-aged male voice, worried and anxious tone, slightly breathless and speaking faster than normal",)# Incredulous with panicwavs, sr = model.generate_voice_design( text="It's in the top drawer... wait, it's empty? No way, that's impossible! I'm sure I put it there!", language="English", instruct="Speak in an incredulous tone, but with a hint of panic beginning to creep into your voice.",)
# British accentwavs, sr = model.generate_voice_design( text="Good afternoon. Shall we proceed with the meeting?", language="English", instruct="Male, 40s, British accent, formal and refined speaking style",)# Southern US accentwavs, sr = model.generate_voice_design( text="Well, y'all come back now, you hear?", language="English", instruct="Female, Southern US accent, warm and hospitable tone",)
# News anchorwavs, sr = model.generate_voice_design( text="Good evening. Here are tonight's top stories.", language="English", instruct="Male, professional news anchor voice, clear articulation, authoritative but approachable, medium paced",)# Customer servicewavs, sr = model.generate_voice_design( text="Thank you for calling. How may I assist you today?", language="English", instruct="Female, customer service representative, friendly and helpful tone, clear speech",)
import torchimport soundfile as sffrom qwen_tts import Qwen3TTSModelmodel = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2",)wavs, sr = model.generate_voice_design( text=[ "哥哥,你回来啦,人家等了你好久好久了,要抱抱!", "It's in the top drawer... wait, it's empty? No way, that's impossible! I'm sure I put it there!" ], language=["Chinese", "English"], instruct=[ "体现撒娇稚嫩的萝莉女声,音调偏高且起伏明显,营造出黏人、做作又刻意卖萌的听觉效果。", "Speak in an incredulous tone, but with a hint of panic beginning to creep into your voice." ])for i, wav in enumerate(wavs): sf.write(f"output_{i}.wav", wav, sr)
This is perfect for creating consistent character voices:
import torchimport soundfile as sffrom qwen_tts import Qwen3TTSModel# Step 1: Create reference audio with VoiceDesigndesign_model = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2",)ref_text = "H-hey! You dropped your... uh... calculus notebook? I mean, I think it's yours? Maybe?"ref_instruct = "Male, 17 years old, tenor range, gaining confidence - deeper breath support now, though vowels still tighten when nervous"ref_wavs, sr = design_model.generate_voice_design( text=ref_text, language="English", instruct=ref_instruct)sf.write("character_reference.wav", ref_wavs[0], sr)# Step 2: Build a reusable clone promptclone_model = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-Base", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2",)voice_clone_prompt = clone_model.create_voice_clone_prompt( ref_audio=(ref_wavs[0], sr), ref_text=ref_text,)# Step 3: Use the clone prompt for new contentsentences = [ "No problem! I actually... kinda finished those already? If you want to compare answers or something...", "What? No! I mean yes but not like... I just think you're... your titration technique is really precise!",]for i, sentence in enumerate(sentences): wavs, sr = clone_model.generate_voice_clone( text=sentence, language="English", voice_clone_prompt=voice_clone_prompt, ) sf.write(f"character_line_{i}.wav", wavs[0], sr)
More detailed descriptions produce better results. Instead of “young voice”, try “female, 22 years old, bright and energetic tone with clear articulation”.
Combine multiple attributes
You can combine age, gender, emotion, accent, speaking style, and other characteristics in one description.
Use emotional context
Describe not just how the voice sounds, but the emotional state: “anxious and speaking quickly” or “calm and reassuring”.
Specify prosody
Include details about rhythm, pace, intonation: “speaking slowly with rising intonation at the end of sentences”.