Vidu Contest
WaveSpeed.ai
首頁/探索/Speech Generation/microsoft/vibevoice
text-to-audio

text-to-audio

Microsoft VibeVoice

microsoft/vibevoice

Microsoft VibeVoice text-to-speech model generates long-form speech from text with multi-speaker dialogue support. Choose from 9 voice presets across English, Chinese, and Hindi. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Input

Idle

您的請求將花費 $0.12 每次運行。

使用 $10 您可以運行此模型大約 83 次。

示例查看全部

README

Microsoft VibeVoice

Microsoft VibeVoice is an advanced multi-speaker text-to-speech model that generates natural conversations between up to 4 speakers. Assign different voices to speakers in your script and the model produces realistic dialogue with natural turn-taking and expression.

Why Choose This?

  • Multi-speaker conversations Support up to 4 distinct speakers in a single generation.

  • Natural dialogue Realistic turn-taking and conversational flow between speakers.

  • Multilingual voices 9 preset voices across English, Chinese, and Indian languages.

  • Expression control Adjust voice expressiveness with the scale parameter.

  • Prompt Enhancer Built-in tool to automatically improve your scripts.

Parameters

ParameterRequiredDescription
promptYesConversation script with speaker labels
speaker_1NoVoice for Speaker 0 (default: en-Alice_woman)
speaker_2NoVoice for Speaker 1
speaker_3NoVoice for Speaker 2
speaker_4NoVoice for Speaker 3
scaleNoVoice expressiveness (default: 1.3)

Available Voices

VoiceLanguageGender
en-Alice_womanEnglishFemale
en-Carter_manEnglishMale
en-Frank_manEnglishMale
en-Mary_woman_bgmEnglishFemale
en-Maya_womanEnglishFemale
in-Samuel_manIndianMale
zh-Anchen_man_bgmChineseMale
zh-Bowen_manChineseMale
zh-Xinran_womanChineseFemale

Prompt Format

Write conversations using speaker labels. Each line starts with "Speaker N:" followed by the dialogue:

Speaker 1: Hey, have you tried the new VibeVoice model on WaveSpeedAI yet? Speaker 2: Not yet! What's so special about it? Speaker 1: It can generate really natural multi-speaker conversations like this one.

How to Use

  1. Write your script — create dialogue with Speaker 1, 2, 3, 4 labels.
  2. Assign voices — select a voice for each speaker.
  3. Adjust scale (optional) — increase for more expressive delivery, decrease for calmer tone.
  4. Run — submit and download your generated conversation.

Pricing

OutputCost
Per generation$0.12

Best Use Cases

  • Podcast Production — Generate multi-speaker podcast episodes.
  • Dialogue Prototyping — Preview conversational scripts before recording.
  • Audiobook Narration — Create multi-character dialogue scenes.
  • Language Learning — Produce natural conversation samples in multiple languages.
  • Video Voiceover — Generate dialogue tracks for video content.

Pro Tips

  • Use Speaker 1, 2, 3, 4 to label up to 4 different characters.
  • Mix male and female voices for more natural conversations.
  • Voices with "_bgm" suffix include background music.
  • Increase scale above 1.3 for more dramatic delivery, lower for neutral tone.
  • Combine English and Chinese speakers for bilingual conversations.

Notes

  • Only prompt is required; speaker voices default if not specified.
  • Speaker labels must use numbers 0-3 matching speaker_1 through speaker_4.
  • Maximum 4 speakers per generation.
  • Use the Prompt Enhancer to improve script quality.