Voice Clone Prompt

Overview

The voice cloning workflow involves extracting features from reference audio and reusing them for multiple generations. The create_voice_clone_prompt() method and VoiceClonePromptItem dataclass provide this functionality. Key benefits:

Extract voice features once, generate many times
Avoid redundant audio processing
Enable batch voice cloning with different voices

VoiceClonePromptItem

@dataclass
class VoiceClonePromptItem:
    ref_code: Optional[torch.Tensor]
    ref_spk_embedding: torch.Tensor
    x_vector_only_mode: bool
    icl_mode: bool
    ref_text: Optional[str] = None

Container for one sample’s voice-clone prompt information that can be fed to the model. Fields are aligned with Qwen3TTSForConditionalGeneration.generate(..., voice_clone_prompt=...). Source: qwen_tts/inference/qwen3_tts_model.py:40-52

Fields

ref_code

Optional[torch.Tensor]

Reference audio codes extracted by the speech tokenizer.

Shape: (T, Q) or (T,) depending on tokenizer (25Hz/12Hz)
None when x_vector_only_mode=True

ref_spk_embedding

torch.Tensor

Speaker embedding vector extracted from reference audio.Shape: (D,) where D is the embedding dimension.

x_vector_only_mode

bool

Whether to use speaker embedding only (ignores ref_code and ref_text).

True: X-vector only mode - only speaker embedding is used
False: ICL mode - uses both embedding and reference codes/text

icl_mode

bool

Whether ICL (In-Context Learning) mode is enabled.Always the inverse of x_vector_only_mode:

True when x_vector_only_mode=False
False when x_vector_only_mode=True

ref_text

Optional[str]

Transcription of the reference audio.

Required when icl_mode=True (x_vector_only_mode=False)
Ignored when x_vector_only_mode=True

Example

from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen3-TTS-Base-2B",
    device_map="cuda:0"
)

# Create voice clone prompt
prompt_items = model.create_voice_clone_prompt(
    ref_audio="reference.wav",
    ref_text="This is the reference audio."
)

# Inspect the prompt item
item = prompt_items[0]
print(f"X-vector only: {item.x_vector_only_mode}")  # False
print(f"ICL mode: {item.icl_mode}")  # True
print(f"Ref code shape: {item.ref_code.shape}")  # e.g., (125, 8)
print(f"Embedding shape: {item.ref_spk_embedding.shape}")  # e.g., (512,)

create_voice_clone_prompt

create_voice_clone_prompt(
    ref_audio: Union[AudioLike, List[AudioLike]],
    ref_text: Optional[Union[str, List[Optional[str]]]] = None,
    x_vector_only_mode: Union[bool, List[bool]] = False
) -> List[VoiceClonePromptItem]

Build voice-clone prompt items from reference audio (and optionally reference text) using the Base model. Model Type: Base only Source: qwen_tts/inference/qwen3_tts_model.py:355-458

Modes

X-vector Only Mode (`x_vector_only_mode=True`)

Only speaker embedding is used to clone voice
ref_text and ref_code are ignored
Mutually exclusive with ICL mode
Faster and simpler, but may be less accurate

ICL Mode (`x_vector_only_mode=False`)

ICL (In-Context Learning) mode is enabled automatically
Both speaker embedding and reference codes/text are used
ref_text is required in this mode
More accurate voice cloning

Parameters

ref_audio

Union[AudioLike, List[AudioLike]]

required

Reference audio(s) used to extract:

ref_code via model.speech_tokenizer.encode(...)
ref_spk_embedding via model.extract_speaker_embedding(...) (resampled to 24kHz)

Supported formats:

str: Local wav path, URL, or base64 audio string
(np.ndarray, sr): Tuple of waveform + sampling rate
List of the above

Example:

"reference.wav"
"https://example.com/audio.wav"
(audio_array, 24000)
["ref1.wav", "ref2.wav"]

ref_text

Optional[Union[str, List[Optional[str]]]]

default:"None"

Reference transcript(s) - transcription of the reference audio.Required when x_vector_only_mode=False (ICL mode).Can be:

Single string (applied to all audio)
List of strings matching the length of ref_audio
None or empty string when x_vector_only_mode=True

x_vector_only_mode

Union[bool, List[bool]]

default:"False"

Whether to use speaker embedding only.

True: X-vector only mode (no ref_text required)
False: ICL mode (requires ref_text)

Can be:

Single boolean (applied to all audio)
List of booleans matching the length of ref_audio

Batch Behavior

ref_audio can be a single item or a list
ref_text and x_vector_only_mode can be scalars or lists
If any of them are lists with length > 1, all lists must match in length

Returns

prompt_items

List[VoiceClonePromptItem]

List of prompt items that can be passed to generate_voice_clone(voice_clone_prompt=...).Each item contains extracted voice features ready for synthesis.

Raises

ValueError - If x_vector_only_mode=False but ref_text is missing
ValueError - If batch lengths mismatch
ValueError - If the model is not a Base model

Example

from qwen_tts import Qwen3TTSModel
import soundfile as sf

# Load Base model
model = Qwen3TTSModel.from_pretrained(
    "Qwen3-TTS-Base-2B",
    device_map="cuda:0"
)

# ===== ICL Mode (default) =====
# Create prompt with reference text
prompt = model.create_voice_clone_prompt(
    ref_audio="reference.wav",
    ref_text="This is the reference audio transcription."
)

# Reuse prompt for multiple generations
for text in ["Hello!", "How are you?", "Goodbye!"]:
    wavs, sr = model.generate_voice_clone(
        text=text,
        voice_clone_prompt=prompt,
        language="English"
    )
    # Process wavs...

# ===== X-vector Only Mode =====
# Create prompt without reference text
prompt_xvec = model.create_voice_clone_prompt(
    ref_audio="reference.wav",
    x_vector_only_mode=True  # ref_text not required
)

wavs, sr = model.generate_voice_clone(
    text="Quick test.",
    voice_clone_prompt=prompt_xvec,
    language="English"
)

# ===== Batch Processing =====
# Create prompts for multiple voices at once
prompts = model.create_voice_clone_prompt(
    ref_audio=["voice1.wav", "voice2.wav", "voice3.wav"],
    ref_text=[
        "First voice reference.",
        "Second voice reference.",
        "Third voice reference."
    ]
)

# Generate with different voices in batch
wavs, sr = model.generate_voice_clone(
    text=["Hello from voice 1", "Hello from voice 2", "Hello from voice 3"],
    voice_clone_prompt=prompts,
    language="English"
)

# ===== Mixed Modes =====
# Use different modes for different samples
prompts_mixed = model.create_voice_clone_prompt(
    ref_audio=["ref1.wav", "ref2.wav"],
    ref_text=["Reference 1", None],  # Only needed for ICL mode
    x_vector_only_mode=[False, True]  # ICL for first, X-vec for second
)

Audio Input Formats

Both create_voice_clone_prompt() and generate_voice_clone() support flexible audio input:

Local File Path

prompt = model.create_voice_clone_prompt(
    ref_audio="/path/to/reference.wav",
    ref_text="Reference text"
)

URL

prompt = model.create_voice_clone_prompt(
    ref_audio="https://example.com/audio.wav",
    ref_text="Reference text"
)

Base64

import base64

# Read audio file
with open("reference.wav", "rb") as f:
    audio_bytes = f.read()

# Convert to base64
audio_b64 = base64.b64encode(audio_bytes).decode()

prompt = model.create_voice_clone_prompt(
    ref_audio=audio_b64,
    ref_text="Reference text"
)

# Or with data URI
audio_uri = f"data:audio/wav;base64,{audio_b64}"
prompt = model.create_voice_clone_prompt(
    ref_audio=audio_uri,
    ref_text="Reference text"
)

NumPy Array

import numpy as np
import librosa

# Load audio as numpy array
audio, sr = librosa.load("reference.wav", sr=None)

# Pass as tuple (audio, sample_rate)
prompt = model.create_voice_clone_prompt(
    ref_audio=(audio, sr),
    ref_text="Reference text"
)

Best Practices

1. Reuse Prompts

Create prompts once and reuse them for multiple generations:

# ✅ Efficient - extract once, generate many times
prompt = model.create_voice_clone_prompt(
    ref_audio="reference.wav",
    ref_text="Reference"
)

for i, text in enumerate(texts):
    wavs, sr = model.generate_voice_clone(
        text=text,
        voice_clone_prompt=prompt,  # Reuse
        language="English"
    )
    sf.write(f"output_{i}.wav", wavs[0], sr)

# ❌ Inefficient - extracts features every time
for i, text in enumerate(texts):
    wavs, sr = model.generate_voice_clone(
        text=text,
        ref_audio="reference.wav",  # Re-extracts every time
        ref_text="Reference",
        language="English"
    )

2. Choose the Right Mode

ICL Mode (x_vector_only_mode=False): Better quality, requires reference text
X-vector Only Mode (x_vector_only_mode=True): Faster, no reference text needed

# Use ICL mode for best quality
prompt_icl = model.create_voice_clone_prompt(
    ref_audio="reference.wav",
    ref_text="Accurate transcription here",
    x_vector_only_mode=False  # Default
)

# Use X-vector mode for speed or when text is unavailable
prompt_xvec = model.create_voice_clone_prompt(
    ref_audio="reference.wav",
    x_vector_only_mode=True
)

3. Reference Audio Quality

Use clean, high-quality reference audio (minimal background noise)
3-10 seconds of speech is usually sufficient
Ensure the reference text exactly matches the audio (for ICL mode)

Documentation Index

​Overview

​VoiceClonePromptItem

​Fields

​Example

​create_voice_clone_prompt

​Modes

​X-vector Only Mode (x_vector_only_mode=True)

​ICL Mode (x_vector_only_mode=False)

​Parameters

​Batch Behavior

​Returns

​Raises

​Example

​Audio Input Formats

​Local File Path

​URL

​Base64

​NumPy Array

​Best Practices

​1. Reuse Prompts

​2. Choose the Right Mode

​3. Reference Audio Quality

​See Also

Overview

VoiceClonePromptItem

Fields

Example

create_voice_clone_prompt

Modes

X-vector Only Mode (`x_vector_only_mode=True`)

ICL Mode (`x_vector_only_mode=False`)

Parameters

Batch Behavior

Returns

Raises

Example

Audio Input Formats

Local File Path

URL

Base64

NumPy Array

Best Practices

1. Reuse Prompts

2. Choose the Right Mode

3. Reference Audio Quality

See Also