Alibaba Qwen3 TTS Flash — Fast Text-to-Speech
Qwen3 TTS Flash is Alibaba's low-latency, natural-sounding Text-to-Speech model that supports English and Chinese with multiple voice styles. It is designed for real-time conversations, product narration, and short-form video dubbing.
Highlights
- Low latency / high concurrency for real-time interaction
- Multi-language / multi-style voices (English/Chinese priority)
- Parameter control: speed, pitch, volume, speaker (voice_id), emotion
- Production-ready: stable output, easy integration, common audio formats
Input & Parameters
- text (string, required): The text to synthesize (recommended < 2000 characters per request)
- voice_id (string, optional): Voice style ID (e.g., qwen-female-1, qwen-male-1; see platform docs for the full list)
- language (string, optional): Language code (en, zh)
- speed (number, optional): Speaking rate, default 1.0 (range 0.5–2.0)
- pitch (number, optional): Pitch adjustment, default 0
- volume (number, optional): Output gain, default 0
- emotion (string, optional): Voice emotion/style, e.g., neutral, happy, sad
- sample_rate (int, optional): Sample rate, default 22050 (e.g., 16000/22050/24000/44100)
- format (string, optional): Output format, default mp3 (supports mp3, wav, ogg)
Note: The available speakers and parameter ranges depend on the platform configuration.
Pricing
- Formula: total_price = base_price * text_length / 1000
- Current base_price: 1000 (unit depends on platform configuration)
Example
{
"model": "alibaba/qwen3-tts-flash",
"input": {
"text": "Hello, welcome to WaveSpeedAI!",
"voice_id": "qwen-female-1",
"language": "en",
"speed": 1.0,
"format": "mp3"
}
}
Use Cases
- Real-time conversational agents / voice replies
- Short-form video, advertising, and e-commerce dubbing
- App/IoT voice prompts and announcements
- Education, customer service, and knowledge base narration