Minimax Voice Clone

MiniMax Voice Clone is a state-of-the-art voice synthesis and cloning pipeline from MiniMax. It turns a short reference clip into a reusable voice ID, then uses MiniMax Speech models to generate speech that closely matches the speaker’s timbre, accent, and style. The system is built on the MiniMax Speech-02 and Speech-2.6 families, which deliver high-fidelity, multilingual, low-latency TTS for production use.

Now we also supports MiniMax’s latest generation models: Speech 2.6 HD and Speech 2.6 Turbo.

Key Features

High-Fidelity Voice Cloning Generates speech that is perceptually very close to the reference speaker, with natural prosody, clear pronunciation, and stable timbre across long passages.
Few-Second Voice Adaptation Uses a learnable speaker encoder to extract timbre from just a few seconds of audio, enabling fast, zero-/one-shot voice cloning without transcription.
Emotion and Style Control Exposes parameters for speaking rate, pitch, loudness, and emotion, making it suitable for storytelling, dialogue, gaming characters, and branded voices.
Multilingual & Cross-Lingual Output Supports dozens of languages (30+ in Speech-02 and 40+ in Speech-2.6 on WaveSpeedAI), with robust accent control and smooth code-switching between languages.
Low-Latency Inference Speech-02-Turbo and Speech-2.6-Turbo are optimized for real-time scenarios, with end-to-end latency in the sub-second range and < 250 ms reported for 2.6 in typical interactive settings.

Use Cases

AI voiceovers for YouTube, TikTok, and other content platforms
Personalized digital assistants and customer-service bots
Audiobook and podcast narration in a specific, consistent voice
In-game characters, VTubers, and interactive story experiences
Assistive speech applications for users who have lost or cannot safely use their natural voice

Model Overview

MiniMax Voice Clone is built around a neural TTS pipeline with:

A speaker encoder that extracts a compact voice embedding from a short reference clip
A text-to-audio generator (Speech-02 / Speech-2.6 HD or Turbo) that conditions on both text and the voice embedding
Optional controls for language, pace, pitch, and emotion

This design combines the clarity of studio-grade TTS with flexible voice cloning, making it suitable for both offline content production and real-time agents.

How to Use

Upload or paste your reference audio
- In the audio field, upload a short, clean voice clip (or paste a direct URL). Around 5–20 seconds of speech without background music works best.
Set custom_voice_id
- Choose a new, descriptive ID (for example: Alice-001).
- This ID must be unique across your account.
- If you reuse an existing ID when creating a new clone, the request will fail with a “voice clone voice id duplicate” error.
Select the speech model: Such as speech-02-hd.
Enter the output text
- In the text field, type what you want the cloned voice to say.
Example: “Hello! Welcome to WaveSpeedAI. This is a preview of your cloned voice.”
Run the job
- After it finishes, you can replay and download the audio.

Optional: Enable enhancements

Turn on need_noise_reduction if your reference audio has background noise.
Turn on need_volume_normalization to even out volume differences.
Adjust the accuracy slider if available: higher values make cloning closer to the reference, lower values make it more forgiving to noisy audio.

The custom_voice_id you used is now available for reuse in the supported MiniMax speech models.

Price

Just $0.5 per run!

Supported Speech Models on WaveSpeedAI

Your cloned voice IDs can be used directly with the following MiniMax speech models on WaveSpeedAI:

minimax/speech-02-hd – high-definition, studio-quality TTS
minimax/speech-02-turbo – low-latency version for real-time use
minimax/speech-2.6-hd – next-gen HD model with improved realism and 40+ languages ([WaveSpeedAI][5])
minimax/speech-2.6-turbo – ultra-low-latency model for interactive agents ([Replicate][3])

Voice ID Persistence (Important)

To keep your cloned voice reusable in the long term:

Any new voice ID must be used at least once with one of the MiniMax speech models above (02 HD/Turbo or 2.6 HD/Turbo).
If a voice ID is created but never used in a speech generation request, WaveSpeedAI can only retain it for 7 days. After 7 days of inactivity, the ID and its associated embedding are deleted and can no longer be called from our API.