
text-to-audio
Idle

Your request will cost $0.015 per run.
For $1 you can run this model approximately 66 times.
VibeVoice is a long-form text-to-speech (TTS) model designed to generate natural, podcast-like speech from transcripts, including multi-speaker conversations. It’s built to stay coherent over long scripts while keeping each speaker’s voice and speaking style consistent.
VibeVoice is most useful when you need dialogue, narration, or episode-length scripts rendered as speech. For background on the underlying model family, see the VibeVoice Technical Report and the Microsoft VibeVoice project page.
Long-form speech generation Handles extended transcripts (up to ~90 minutes in the long-form variant), useful for podcasts, audiobooks, and lecture-style narration.
Multi-speaker dialogue in one request Supports up to 4 speakers in a single generation, making it well-suited for interviews, panel discussions, and scripted conversations.
Consistent speaker identity across long scripts Designed to preserve each speaker’s “voice” and conversational flow over long context windows.
Natural pacing and conversational delivery Optimized for dialogue-like speech (turn-taking, pauses, and rhythm) rather than robotic, sentence-by-sentence readouts.
Model-family support for low-latency streaming (variant-dependent) Some VibeVoice releases include a real-time streaming model optimized for fast first audio output; availability depends on the specific deployment/variant.
true, the API waits for the output file to be fully generated and uploaded before returning.VibeVoice works best when your text looks like a real script:
Write it like a transcript, not a paragraph. Use short utterances, turn-taking, and punctuation that reflects how you want it spoken.
For multi-speaker dialogue, tag speakers clearly.
Common patterns include speaker tags like S1:, S2:, etc. If your wrapper expects a specific tag format (for example [S1] / [S2]), follow what the Playground examples show.
Keep overlap out of the script. If two speakers talk over each other in the transcript, the model may flatten it into a single line or produce unstable timing.
Use lightweight direction cues sparingly.
Short cues like (pause) or (laughs) may help with delivery, but results vary by model variant and deployment.
Example (single request, multi-speaker style):
S1: Welcome back. Today we’re talking about shipping fast without breaking trust.
S2: The trick is to be explicit about trade-offs—especially in the UI.
S1: Let’s start with a real example.
voice_id
If the wrapper exposes voice_id, pick one of the available built-in voices. For multi-speaker scripts, some deployments may apply fixed voices automatically; others may expose multiple voice selectors. Prefer what the Playground/UI schema provides.
format Choose the format you need for your pipeline:
sample_rate Higher sample rates generally preserve more detail. If you’re unsure, use the wrapper’s default.
seed
Set a fixed seed when you want repeatable output across runs (useful for iteration and QA).
enable_sync_mode Turn this on when you want a single request to block until the final audio is ready. Leave it off if you prefer async workflows.
After you finish configuring the parameters, click Run, preview the result, and iterate if needed.
Pricing is defined in this model’s WaveSpeedAI configuration and is shown in the Playground cost preview before you run.
Best-effort language support varies by release. Many VibeVoice releases focus on English and Chinese; other languages may work inconsistently depending on the deployed speaker set.
Plan for long scripts. If you’re generating a full episode, structure the transcript with clear segments (intro → sections → outro). If you hit instability, split into multiple runs and stitch audio in post.
Use responsibly. High-quality speech synthesis can be misused for impersonation or deceptive content. Only generate voices you have the rights and consent to use, and disclose AI-generated audio where appropriate.