WaveSpeed AI Vibevoice | Realistic Voice & TTS API

WaveSpeedAI VibeVoice Text-to-Audio

VibeVoice is a long-form text-to-speech (TTS) model designed to generate natural, podcast-like speech from transcripts, including multi-speaker conversations. It's built to stay coherent over long scripts while keeping each speaker's voice and speaking style consistent.

VibeVoice is most useful when you need dialogue, narration, or episode-length scripts rendered as speech. For background on the underlying model family, see the VibeVoice Technical Report and the Microsoft VibeVoice project page.

Key capabilities

Long-form speech generation Handles extended transcripts (up to ~90 minutes in the long-form variant), useful for podcasts, audiobooks, and lecture-style narration.
Multi-speaker dialogue in one request Supports up to 4 speakers in a single generation, making it well-suited for interviews, panel discussions, and scripted conversations.
Consistent speaker identity across long scripts Designed to preserve each speaker's "voice" and conversational flow over long context windows.
Natural pacing and conversational delivery Optimized for dialogue-like speech (turn-taking, pauses, and rhythm) rather than robotic, sentence-by-sentence readouts.
Model-family support for low-latency streaming (variant-dependent) Some VibeVoice releases include a real-time streaming model optimized for fast first audio output; availability depends on the specific deployment/variant.

Parameters and how to use

text: (required) The transcript you want VibeVoice to speak.
speaker: Select a built-in voice (if exposed by this wrapper).

Prompt

VibeVoice works best when your text looks like a real script:

Write it like a transcript, not a paragraph. Use short utterances, turn-taking, and punctuation that reflects how you want it spoken.
For multi-speaker dialogue, tag speakers clearly. Common patterns include speaker tags like S1:, S2:, etc. If your wrapper expects a specific tag format (for example [S1] / [S2]), follow what the Playground examples show.
Keep overlap out of the script. If two speakers talk over each other in the transcript, the model may flatten it into a single line or produce unstable timing.
Use lightweight direction cues sparingly. Short cues like (pause) or (laughs) may help with delivery, but results vary by model variant and deployment.

Example (single request, multi-speaker style):

S1: Welcome back. Today we're talking about shipping fast without breaking trust.
S2: The trick is to be explicit about trade-offs—especially in the UI.
S1: Let's start with a real example.

Other parameters

speaker If the wrapper exposes voice_id, pick one of the available built-in voices. For multi-speaker scripts, some deployments may apply fixed voices automatically; others may expose multiple voice selectors. Prefer what the Playground/UI schema provides.

After you finish configuring the parameters, click Run, preview the result, and iterate if needed.

Pricing

Minimum Pricing: $0.015 per run

Pricing is defined in this model's WaveSpeedAI configuration and is shown in the Playground cost preview before you run.

Notes

Best-effort language support varies by release. Many VibeVoice releases focus on English and Chinese; other languages may work inconsistently depending on the deployed speaker set.
Plan for long scripts. If you're generating a full episode, structure the transcript with clear segments (intro → sections → outro). If you hit instability, split into multiple runs and stitch audio in post.
Use responsibly. High-quality speech synthesis can be misused for impersonation or deceptive content. Only generate voices you have the rights and consent to use, and disclose AI-generated audio where appropriate.

Related Models

Qwen3 TTS Flash – Low-latency TTS for short-form, real-time dialogue experiences.
MiniMax Speech-02-HD – High-definition narration with controllable delivery (speed/volume/pitch).
ElevenLabs Turbo V2.5 – Fast, production-friendly TTS with a broad voice library.
MiniMax Voice Clone – Generate speech in a specific voice using a short reference clip (voice cloning).

Vibevoice API — Quick start

Grab a WaveSpeedAI API key, then call POST https://api.wavespeed.ai/api/v3/wavespeed-ai/vibevoice with your input as JSON. The endpoint returns a prediction id; poll the prediction endpoint until status flips to completed, then read the output URL from data.outputs[0]. Examples for Vibevoice below.

HTTP example

# Submit the prediction
curl -X POST "https://api.wavespeed.ai/api/v3/wavespeed-ai/vibevoice" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $WAVESPEED_API_KEY" \
  -d '{
    "speaker": "Frank"
}'

# Response includes a prediction id. Poll for the result:
curl -X GET "https://api.wavespeed.ai/api/v3/predictions/{request_id}/result" \
  -H "Authorization: Bearer $WAVESPEED_API_KEY"

# When status is "completed", read the output from data.outputs[0].

Node.js example

// npm install wavespeed
const WaveSpeed = require('wavespeed');

const client = new WaveSpeed(); // reads WAVESPEED_API_KEY from env

const result = await client.run("wavespeed-ai/vibevoice", {
        "speaker": "Frank"
});

console.log(result.outputs[0]); // → URL of the generated output

Python example

# pip install wavespeed
import wavespeed

output = wavespeed.run(
    "wavespeed-ai/vibevoice",
    {
    "speaker": "Frank"
}
)

print(output["outputs"][0])  # → URL of the generated output

Vibevoice API — Frequently asked questions

What is the Vibevoice API?

Vibevoice is a WaveSpeedAI model for audio generation, exposed as a REST API on WaveSpeedAI. wavespeed-ai/vibevoice is an advanced voice generation model for producing high-fidelity, natural, and expressive speech from text, with optional speaker/region-style control for more precise results and easy integration into real-world applications. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing. You can call it programmatically or try it from the playground above.

How do I call the Vibevoice API?

POST your input parameters to the model's REST endpoint (shown in the API tab of this playground) with your WaveSpeedAI API key in the Authorization header. Submission returns a prediction ID; poll the prediction endpoint until status flips to "completed", then read the output URL from the result. The playground generates a ready-to-paste code sample in Python, JavaScript, or cURL for whatever inputs you've set. Full request/response shape is documented at https://wavespeed.ai/docs/docs-api/wavespeed-ai/vibevoice.

How much does Vibevoice cost per run?

Vibevoice starts at $0.015 per run. That figure is the base price — the final charge scales with the parameters you set in the form (output size, length, count, references, or whatever knobs this model exposes), so a higher-quality or larger output costs more than a minimal one. The exact cost for your current input is shown live next to the Generate button before you submit, and the actual per-call charge is recorded on the prediction afterwards.

What inputs does Vibevoice accept?

Key inputs: `speaker`, `text`. The full JSON schema (types, defaults, allowed values) is rendered above the Generate button and mirrored in the API reference at https://wavespeed.ai/docs/docs-api/wavespeed-ai/vibevoice.

How long does Vibevoice take to generate?

Average end-to-end generation time on WaveSpeedAI is around 50 seconds per request — measured across recent runs. Queue time scales with global demand; live status is visible in the prediction record.

Can I use Vibevoice outputs commercially?

Commercial usage rights depend on the model's license, set by its provider (WaveSpeedAI). The license summary appears on the model card above; see WaveSpeedAI's Terms of Service for platform-level conditions.

ExamplesView all

Related Models

README

WaveSpeedAI VibeVoice Text-to-Audio

Key capabilities

Parameters and how to use

Prompt

Other parameters

Pricing

Notes

Related Models

Vibevoice API — Quick start

Vibevoice API — Frequently asked questions