MiniMax Speech 2.6 HD

High-definition Text-to-Speech (TTS) with natural pronunciation and crisp articulation. Supports custom cloned voices and built-in voices, adjustable speed, volume, and pitch, and coverage of 40+ languages for professional audio creation.

Features

Multilingual upgrade: Stronger English and overall multilingual similarity, accuracy, and rhythm vs. Speech 02; seamless switching across 40+ languages for meetings, podcasts, and daily dialog.
Lifelike tone replication: Control across language, accent, style, and emotion—preserves cross-language and regional accents and “age” timbre with high fidelity.
Global language set (40+): Expanded catalog including Bulgarian, Danish, Hebrew, Malay, Persian, Slovak, Swedish, Croatian, Filipino, Hungarian, Norwegian, Slovenian, Catalan, Nynorsk, Tamil, Afrikaans, and more—ideal for cross-border commerce, customer support, and localized marketing.

How to Use

1) Choose a Voice (voice_id)

Use either a custom voice you trained (voice cloning) or a built-in system voice (case-sensitive):

Wise_Woman, Friendly_Person, Inspirational_girl, Deep_Voice_Man, Calm_Woman,
Casual_Guy, Lively_Girl, Patient_Man, Young_Knight, Determined_Man, Lovely_Girl,
Decent_Boy, Imposing_Manner, Elegant_Man, Abbess, Sweet_Girl_2, Exuberant_Girl

See the full, list and samples: Voice_ID list

2) Set Audio Parameters (mapped to the UI dropdowns)

english_normalization (boolean) Improves English text normalization, especially numbers/units (e.g., “$1,299” → “one thousand two hundred ninety-nine dollars”).
sample_rate (Hz) 22050 / 24000 / 44100 / 48000. Tip: 44.1 kHz for music/podcasts; 48 kHz for video.
bitrate (bps for MP3/OGG) 64k / 96k / 128k / 192k / 256k / 320k. Tip: ≥192k for distribution; 96–128k for previews.
channel: mono or stereo Mono is smaller/clearer for speech; stereo for spatial mixes.
format: mp3, wav, ogg, flac, wav is lossless (larger); mp3 is compact and web-friendly.
language_boost (IETF code: en, zh, ja, …) Prioritizes a primary language in mixed-language inputs.

Prosody controls

speed: speaking rate (e.g., 0.8–1.2)
volume: gain (linear or dB, depending on API)
pitch: pitch shift (semitones/cents or normalized)

Pricing

Price: $0.10 / 1,000 characters

Quick examples

1,000 chars → $0.10
2,500 chars → $0.10 × 2.5 = $0.25
10,000 chars → $1.00

Typical Use Cases

Short-video and ad voiceovers, e-learning/courseware, AI assistants and IVR, podcasts/audiobooks, cross-border e-commerce localization.

Best-Practice Presets (optional)

Video voiceover: format=wav, sample_rate=48000, channel=mono, english_normalization=true
Web preview: format=mp3, sample_rate=44100, bitrate=128000, channel=mono
Podcast: format=mp3, sample_rate=44100, bitrate=192000–320000, channel=stereo if mixing music

minimax/speech-2.6-hd

ExamplesView all

README