Vibevoice
Playground
Try it on WavespeedAI!wavespeed-ai/vibevoice is an advanced voice generation model for producing high-fidelity, natural, and expressive speech from text, with optional speaker/region-style control for more precise results and easy integration into real-world applications.
Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.
Features
WaveSpeedAI VibeVoice Text-to-Audio
VibeVoice is a long-form text-to-speech (TTS) model designed to generate natural, podcast-like speech from transcripts, including multi-speaker conversations. It’s built to stay coherent over long scripts while keeping each speaker’s voice and speaking style consistent.
VibeVoice is most useful when you need dialogue, narration, or episode-length scripts rendered as speech. For background on the underlying model family, see the VibeVoice Technical Report and the Microsoft VibeVoice project page.
Key capabilities
-
Long-form speech generation Handles extended transcripts (up to ~90 minutes in the long-form variant), useful for podcasts, audiobooks, and lecture-style narration.
-
Multi-speaker dialogue in one request Supports up to 4 speakers in a single generation, making it well-suited for interviews, panel discussions, and scripted conversations.
-
Consistent speaker identity across long scripts Designed to preserve each speaker’s “voice” and conversational flow over long context windows.
-
Natural pacing and conversational delivery Optimized for dialogue-like speech (turn-taking, pauses, and rhythm) rather than robotic, sentence-by-sentence readouts.
-
Model-family support for low-latency streaming (variant-dependent) Some VibeVoice releases include a real-time streaming model optimized for fast first audio output; availability depends on the specific deployment/variant.
Parameters and how to use
- text: (required) The transcript you want VibeVoice to speak.
- speaker: Select a built-in voice (if exposed by this wrapper).
Prompt
VibeVoice works best when your text looks like a real script:
-
Write it like a transcript, not a paragraph. Use short utterances, turn-taking, and punctuation that reflects how you want it spoken.
-
For multi-speaker dialogue, tag speakers clearly. Common patterns include speaker tags like
S1:,S2:, etc. If your wrapper expects a specific tag format (for example[S1]/[S2]), follow what the Playground examples show. -
Keep overlap out of the script. If two speakers talk over each other in the transcript, the model may flatten it into a single line or produce unstable timing.
-
Use lightweight direction cues sparingly. Short cues like
(pause)or(laughs)may help with delivery, but results vary by model variant and deployment.
Example (single request, multi-speaker style):
S1: Welcome back. Today we're talking about shipping fast without breaking trust.
S2: The trick is to be explicit about trade-offs—especially in the UI.
S1: Let's start with a real example.Other parameters
- speaker
If the wrapper exposes
voice_id, pick one of the available built-in voices. For multi-speaker scripts, some deployments may apply fixed voices automatically; others may expose multiple voice selectors. Prefer what the Playground/UI schema provides.
After you finish configuring the parameters, click Run, preview the result, and iterate if needed.
Pricing
Minimum Pricing: $0.015 per run
Pricing is defined in this model’s WaveSpeedAI configuration and is shown in the Playground cost preview before you run.
Notes
-
Best-effort language support varies by release. Many VibeVoice releases focus on English and Chinese; other languages may work inconsistently depending on the deployed speaker set.
-
Plan for long scripts. If you’re generating a full episode, structure the transcript with clear segments (intro → sections → outro). If you hit instability, split into multiple runs and stitch audio in post.
-
Use responsibly. High-quality speech synthesis can be misused for impersonation or deceptive content. Only generate voices you have the rights and consent to use, and disclose AI-generated audio where appropriate.
Related Models
- Alibaba Qwen3 TTS Flash – Low-latency TTS for short-form, real-time dialogue experiences.
- MiniMax Speech-02-HD – High-definition narration with controllable delivery (speed/volume/pitch).
- ElevenLabs Turbo V2.5 – Fast, production-friendly TTS with a broad voice library.
- MiniMax Voice Clone – Generate speech in a specific voice using a short reference clip (voice cloning).
Authentication
For authentication details, please refer to the Authentication Guide.
API Endpoints
Submit Task & Query Result
# Submit the task
curl --location --request POST "https://api.wavespeed.ai/api/v3/wavespeed-ai/vibevoice" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${WAVESPEED_API_KEY}" \
--data-raw '{
"speaker": "Frank"
}'
# Get the result
curl --location --request GET "https://api.wavespeed.ai/api/v3/predictions/${requestId}/result" \
--header "Authorization: Bearer ${WAVESPEED_API_KEY}"
Parameters
Task Submission Parameters
Request Parameters
| Parameter | Type | Required | Default | Range | Description |
|---|---|---|---|---|---|
| text | string | Yes | - | - | Text to translate |
| speaker | string | No | Frank | Frank, Wayne, Carter, Emma, Grace, Mike | Voice to use for speaking. |
Response Parameters
| Parameter | Type | Description |
|---|---|---|
| code | integer | HTTP status code (e.g., 200 for success) |
| message | string | Status message (e.g., “success”) |
| data.id | string | Unique identifier for the prediction, Task Id |
| data.model | string | Model ID used for the prediction |
| data.outputs | array | Array of URLs to the generated content (empty when status is not completed) |
| data.urls | object | Object containing related API endpoints |
| data.urls.get | string | URL to retrieve the prediction result |
| data.has_nsfw_contents | array | Array of boolean values indicating NSFW detection for each output |
| data.status | string | Status of the task: created, processing, completed, or failed |
| data.created_at | string | ISO timestamp of when the request was created (e.g., “2023-04-01T12:34:56.789Z”) |
| data.error | string | Error message (empty if no error occurred) |
| data.timings | object | Object containing timing details |
| data.timings.inference | integer | Inference time in milliseconds |
Result Request Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
| id | string | Yes | - | Task ID |
Result Response Parameters
| Parameter | Type | Description |
|---|---|---|
| code | integer | HTTP status code (e.g., 200 for success) |
| message | string | Status message (e.g., “success”) |
| data | object | The prediction data object containing all details |
| data.id | string | Unique identifier for the prediction, the ID of the prediction to get |
| data.model | string | Model ID used for the prediction |
| data.outputs | string | Array of URLs to the generated content (empty when status is not completed). |
| data.urls | object | Object containing related API endpoints |
| data.urls.get | string | URL to retrieve the prediction result |
| data.status | string | Status of the task: created, processing, completed, or failed |
| data.created_at | string | ISO timestamp of when the request was created (e.g., “2023-04-01T12:34:56.789Z”) |
| data.error | string | Error message (empty if no error occurred) |
| data.timings | object | Object containing timing details |
| data.timings.inference | integer | Inference time in milliseconds |