vLLM officially provides day-0 support for Qwen3-TTS through vLLM-Omni! This integration enables efficient deployment and inference for speech generation workloads.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-TTS/llms.txt
Use this file to discover all available pages before exploring further.
Overview
vLLM-Omni extends vLLM with multimodal capabilities, including support for text-to-speech models like Qwen3-TTS. Key benefits:- Optimized inference: Faster generation compared to standard PyTorch inference
- Efficient memory usage: Better GPU memory management for batch processing
- Production-ready: Battle-tested serving infrastructure
- Continuous optimization: Ongoing improvements for speed and streaming capabilities
Current Status: Only offline inference is supported. Online serving will be supported in future releases.
Installation
Install vLLM-Omni following the official installation guide:Offline Inference
vLLM-Omni supports all three Qwen3-TTS task types: CustomVoice, VoiceDesign, and Base (voice cloning).Setup
Navigate to the examples directory:CustomVoice Task
Generate speech using predefined speaker voices with optional instruction control. Single sample:VoiceDesign Task
Create custom voices based on natural language descriptions. Single sample:Base Task (Voice Clone)
Clone a voice from a reference audio sample. Single sample with in-context learning (ICL) mode:- Reference audio (
ref_audio) - Reference transcript (
ref_text) - Target text to synthesize
Supported Models
All Qwen3-TTS models are supported via vLLM-Omni:| Model | Task Type | vLLM Support |
|---|---|---|
| Qwen3-TTS-12Hz-1.7B-CustomVoice | CustomVoice | ✅ |
| Qwen3-TTS-12Hz-0.6B-CustomVoice | CustomVoice | ✅ |
| Qwen3-TTS-12Hz-1.7B-VoiceDesign | VoiceDesign | ✅ |
| Qwen3-TTS-12Hz-1.7B-Base | Voice Clone | ✅ |
| Qwen3-TTS-12Hz-0.6B-Base | Voice Clone | ✅ |
Example Code
Here’s what the vLLM-Omni inference code looks like (simplified example):For complete working examples, refer to the vLLM-Omni repository.
Performance Considerations
Batch Processing
vLLM-Omni excels at batch inference. When processing multiple requests:- Use
--use-batch-sampleflag for batch processing - Larger batches improve GPU utilization
- Balance batch size with GPU memory constraints
Memory Management
vLLM automatically manages memory allocation:- KV cache optimization: Efficient attention computation
- Paged attention: Better memory utilization
- Dynamic batching: Automatically groups requests
Model Selection
Choose the right model for your use case:| Model Size | Speed | Quality | Use Case |
|---|---|---|---|
| 0.6B | Faster | Good | Real-time, high-throughput |
| 1.7B | Moderate | Excellent | Production, best quality |
GPU Recommendations
| Model | Minimum VRAM | Recommended VRAM | Batch Size |
|---|---|---|---|
| 0.6B | 8GB | 16GB | 4-16 |
| 1.7B | 12GB | 24GB | 2-8 |
Upcoming Features
vLLM-Omni is actively developing additional features:- Online serving: HTTP API for real-time generation
- Streaming support: Stream audio as it’s generated
- Multi-GPU inference: Tensor parallelism for large-scale deployment
- Quantization: INT8/INT4 quantization for faster inference
Comparison: vLLM vs PyTorch
| Feature | PyTorch (qwen-tts) | vLLM-Omni |
|---|---|---|
| Inference mode | Offline + streaming | Offline (serving coming) |
| Batch optimization | Manual | Automatic |
| Memory management | Standard | Optimized (paged attention) |
| Deployment | Direct Python | Production-ready serving |
| Speed | Baseline | Optimized |
| Ease of use | Simple API | Requires setup |
- Quick prototyping and testing
- Local demos and experiments
- Streaming generation (currently)
- Simple integration needs
- Production deployments
- High-throughput batch processing
- Optimized resource utilization
- Scalable serving infrastructure
Resources
Next Steps
- Optimize inference with performance tuning
- Deploy to production using DashScope API
- Fine-tune models with the fine-tuning guide