Vidu Contest
WaveSpeed.ai
首頁/探索/Speech Generation/wavespeed-ai/vibevoice
text-to-audio

text-to-audio

WaveSpeedAI Vibevoice

wavespeed-ai/vibevoice

wavespeed-ai/vibevoice is an advanced voice generation model for producing high-fidelity, natural, and expressive speech from text, with optional speaker/region-style control for more precise results and easy integration into real-world applications. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Input

Idle

您的請求將花費 $0.015 每次運行。

使用 $1 您可以運行此模型大約 66 次。

示例查看全部

README

WaveSpeedAI VibeVoice Text-to-Audio

VibeVoice is a long-form text-to-speech (TTS) model designed to generate natural, podcast-like speech from transcripts, including multi-speaker conversations. It's built to stay coherent over long scripts while keeping each speaker's voice and speaking style consistent.

VibeVoice is most useful when you need dialogue, narration, or episode-length scripts rendered as speech. For background on the underlying model family, see the VibeVoice Technical Report and the Microsoft VibeVoice project page.

Key capabilities

  • Long-form speech generation Handles extended transcripts (up to ~90 minutes in the long-form variant), useful for podcasts, audiobooks, and lecture-style narration.

  • Multi-speaker dialogue in one request Supports up to 4 speakers in a single generation, making it well-suited for interviews, panel discussions, and scripted conversations.

  • Consistent speaker identity across long scripts Designed to preserve each speaker's "voice" and conversational flow over long context windows.

  • Natural pacing and conversational delivery Optimized for dialogue-like speech (turn-taking, pauses, and rhythm) rather than robotic, sentence-by-sentence readouts.

  • Model-family support for low-latency streaming (variant-dependent) Some VibeVoice releases include a real-time streaming model optimized for fast first audio output; availability depends on the specific deployment/variant.

Parameters and how to use

  • text: (required) The transcript you want VibeVoice to speak.
  • speaker: Select a built-in voice (if exposed by this wrapper).

Prompt

VibeVoice works best when your text looks like a real script:

  • Write it like a transcript, not a paragraph. Use short utterances, turn-taking, and punctuation that reflects how you want it spoken.

  • For multi-speaker dialogue, tag speakers clearly. Common patterns include speaker tags like S1:, S2:, etc. If your wrapper expects a specific tag format (for example [S1] / [S2]), follow what the Playground examples show.

  • Keep overlap out of the script. If two speakers talk over each other in the transcript, the model may flatten it into a single line or produce unstable timing.

  • Use lightweight direction cues sparingly. Short cues like (pause) or (laughs) may help with delivery, but results vary by model variant and deployment.

Example (single request, multi-speaker style):

S1: Welcome back. Today we're talking about shipping fast without breaking trust.
S2: The trick is to be explicit about trade-offs—especially in the UI.
S1: Let's start with a real example.

Other parameters

  • speaker If the wrapper exposes voice_id, pick one of the available built-in voices. For multi-speaker scripts, some deployments may apply fixed voices automatically; others may expose multiple voice selectors. Prefer what the Playground/UI schema provides.

After you finish configuring the parameters, click Run, preview the result, and iterate if needed.

Pricing

Minimum Pricing: $0.015 per run

Pricing is defined in this model's WaveSpeedAI configuration and is shown in the Playground cost preview before you run.

Notes

  • Best-effort language support varies by release. Many VibeVoice releases focus on English and Chinese; other languages may work inconsistently depending on the deployed speaker set.

  • Plan for long scripts. If you're generating a full episode, structure the transcript with clear segments (intro → sections → outro). If you hit instability, split into multiple runs and stitch audio in post.

  • Use responsibly. High-quality speech synthesis can be misused for impersonation or deceptive content. Only generate voices you have the rights and consent to use, and disclose AI-generated audio where appropriate.

Related Models