MiniMax Speech 2.6 HD
High-definition Text-to-Speech (TTS) with natural pronunciation and crisp articulation. Supports custom cloned voices and built-in voices, adjustable speed, volume, and pitch, and coverage of 40+ languages for professional audio creation.
Features
- Multilingual upgrade: Stronger English and overall multilingual similarity, accuracy, and rhythm vs. Speech 02; seamless switching across 40+ languages for meetings, podcasts, and daily dialog.
- Lifelike tone replication: Control across language, accent, style, and emotion—preserves cross-language and regional accents and “age” timbre with high fidelity.
- Global language set (40+): Expanded catalog including Bulgarian, Danish, Hebrew, Malay, Persian, Slovak, Swedish, Croatian, Filipino, Hungarian, Norwegian, Slovenian, Catalan, Nynorsk, Tamil, Afrikaans, and more—ideal for cross-border commerce, customer support, and localized marketing.
How to Use
1) Choose a Voice (voice_id)
Use either a custom voice you trained (voice cloning) or a built-in system voice (case-sensitive):
Wise_Woman, Friendly_Person, Inspirational_girl, Deep_Voice_Man, Calm_Woman,
Casual_Guy, Lively_Girl, Patient_Man, Young_Knight, Determined_Man, Lovely_Girl,
Decent_Boy, Imposing_Manner, Elegant_Man, Abbess, Sweet_Girl_2, Exuberant_Girl
- See the full, list and samples:
Voice_ID list
2) Set Audio Parameters (mapped to the UI dropdowns)
- english_normalization (boolean)
Improves English text normalization, especially numbers/units (e.g., “$1,299” → “one thousand two hundred ninety-nine dollars”).
- sample_rate (Hz)
22050 / 24000 / 44100 / 48000. Tip: 44.1 kHz for music/podcasts; 48 kHz for video.
- bitrate (bps for MP3/OGG)
64k / 96k / 128k / 192k / 256k / 320k. Tip: ≥192k for distribution; 96–128k for previews.
- channel: mono or stereo
Mono is smaller/clearer for speech; stereo for spatial mixes.
- format: mp3, wav, ogg, flac, wav is lossless (larger); mp3 is compact and web-friendly.
- language_boost (IETF code: en, zh, ja, …)
Prioritizes a primary language in mixed-language inputs.
Prosody controls
- speed: speaking rate (e.g., 0.8–1.2)
- volume: gain (linear or dB, depending on API)
- pitch: pitch shift (semitones/cents or normalized)
Pricing
- Price: $0.10 / 1,000 characters
Quick examples
- 1,000 chars → $0.10
- 2,500 chars → $0.10 × 2.5 = $0.25
- 10,000 chars → $1.00
Typical Use Cases
Short-video and ad voiceovers, e-learning/courseware, AI assistants and IVR, podcasts/audiobooks, cross-border e-commerce localization.
Best-Practice Presets (optional)
- Video voiceover: format=wav, sample_rate=48000, channel=mono, english_normalization=true
- Web preview: format=mp3, sample_rate=44100, bitrate=128000, channel=mono
- Podcast: format=mp3, sample_rate=44100, bitrate=192000–320000, channel=stereo if mixing music