Browse ModelsMicrosoftMicrosoft Vibevoice

Microsoft Vibevoice

Microsoft Vibevoice

Playground

Try it on WavespeedAI!

Microsoft VibeVoice text-to-speech model generates long-form speech from text with multi-speaker dialogue support. Choose from 9 voice presets across English, Chinese, and Hindi. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Features

Microsoft VibeVoice

Microsoft VibeVoice is an advanced multi-speaker text-to-speech model that generates natural conversations between up to 4 speakers. Assign different voices to speakers in your script and the model produces realistic dialogue with natural turn-taking and expression.


Why Choose This?

  • Multi-speaker conversations Support up to 4 distinct speakers in a single generation.

  • Natural dialogue Realistic turn-taking and conversational flow between speakers.

  • Multilingual voices 9 preset voices across English, Chinese, and Indian languages.

  • Expression control Adjust voice expressiveness with the scale parameter.

  • Prompt Enhancer Built-in tool to automatically improve your scripts.


Parameters

ParameterRequiredDescription
promptYesConversation script with speaker labels
speaker_1NoVoice for Speaker 0 (default: en-Alice_woman)
speaker_2NoVoice for Speaker 1
speaker_3NoVoice for Speaker 2
speaker_4NoVoice for Speaker 3
scaleNoVoice expressiveness (default: 1.3)

Available Voices

VoiceLanguageGender
en-Alice_womanEnglishFemale
en-Carter_manEnglishMale
en-Frank_manEnglishMale
en-Mary_woman_bgmEnglishFemale
en-Maya_womanEnglishFemale
in-Samuel_manIndianMale
zh-Anchen_man_bgmChineseMale
zh-Bowen_manChineseMale
zh-Xinran_womanChineseFemale

Prompt Format

Write conversations using speaker labels. Each line starts with “Speaker N:” followed by the dialogue:

Speaker 1: Hey, have you tried the new VibeVoice model on WaveSpeedAI yet? Speaker 2: Not yet! What’s so special about it? Speaker 1: It can generate really natural multi-speaker conversations like this one.


How to Use

  1. Write your script — create dialogue with Speaker 1, 2, 3, 4 labels.
  2. Assign voices — select a voice for each speaker.
  3. Adjust scale (optional) — increase for more expressive delivery, decrease for calmer tone.
  4. Run — submit and download your generated conversation.

Pricing

OutputCost
Per generation$0.12

Best Use Cases

  • Podcast Production — Generate multi-speaker podcast episodes.
  • Dialogue Prototyping — Preview conversational scripts before recording.
  • Audiobook Narration — Create multi-character dialogue scenes.
  • Language Learning — Produce natural conversation samples in multiple languages.
  • Video Voiceover — Generate dialogue tracks for video content.

Pro Tips

  • Use Speaker 1, 2, 3, 4 to label up to 4 different characters.
  • Mix male and female voices for more natural conversations.
  • Voices with “_bgm” suffix include background music.
  • Increase scale above 1.3 for more dramatic delivery, lower for neutral tone.
  • Combine English and Chinese speakers for bilingual conversations.

Notes

  • Only prompt is required; speaker voices default if not specified.
  • Speaker labels must use numbers 0-3 matching speaker_1 through speaker_4.
  • Maximum 4 speakers per generation.
  • Use the Prompt Enhancer to improve script quality.

Authentication

For authentication details, please refer to the Authentication Guide.

API Endpoints

Submit Task & Query Result


# Submit the task
curl --location --request POST "https://api.wavespeed.ai/api/v3/microsoft/vibevoice" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${WAVESPEED_API_KEY}" \
--data-raw '{
    "speaker_1": "en-Alice_woman",
    "scale": 1.3
}'

# Get the result
curl --location --request GET "https://api.wavespeed.ai/api/v3/predictions/${requestId}/result" \
--header "Authorization: Bearer ${WAVESPEED_API_KEY}"

Parameters

Task Submission Parameters

Request Parameters

ParameterTypeRequiredDefaultRangeDescription
promptstringYes-Text to convert to speech. For multi-speaker dialogue, use 'Speaker 0:', 'Speaker 1:' prefixes.
speaker_1stringNoen-Alice_womanen-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_womanVoice for Speaker 0.
speaker_2stringNo-en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_womanVoice for Speaker 1 (optional).
speaker_3stringNo-en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_womanVoice for Speaker 2 (optional).
speaker_4stringNo-en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_womanVoice for Speaker 3 (optional).
scalenumberNo1.31 ~ 2CFG Scale (Guidance Strength).

Response Parameters

ParameterTypeDescription
codeintegerHTTP status code (e.g., 200 for success)
messagestringStatus message (e.g., “success”)
data.idstringUnique identifier for the prediction, Task Id
data.modelstringModel ID used for the prediction
data.outputsarrayArray of URLs to the generated content (empty when status is not completed)
data.urlsobjectObject containing related API endpoints
data.urls.getstringURL to retrieve the prediction result
data.has_nsfw_contentsarrayArray of boolean values indicating NSFW detection for each output
data.statusstringStatus of the task: created, processing, completed, or failed
data.created_atstringISO timestamp of when the request was created (e.g., “2023-04-01T12:34:56.789Z”)
data.errorstringError message (empty if no error occurred)
data.timingsobjectObject containing timing details
data.timings.inferenceintegerInference time in milliseconds

Result Request Parameters

ParameterTypeRequiredDefaultDescription
idstringYes-Task ID

Result Response Parameters

ParameterTypeDescription
codeintegerHTTP status code (e.g., 200 for success)
messagestringStatus message (e.g., “success”)
dataobjectThe prediction data object containing all details
data.idstringUnique identifier for the prediction, the ID of the prediction to get
data.modelstringModel ID used for the prediction
data.outputsstringArray of URLs to the generated audio.
data.urlsobjectObject containing related API endpoints
data.urls.getstringURL to retrieve the prediction result
data.statusstringStatus of the task: created, processing, completed, or failed
data.created_atstringISO timestamp of when the request was created (e.g., “2023-04-01T12:34:56.789Z”)
data.errorstringError message (empty if no error occurred)
data.timingsobjectObject containing timing details
data.timings.inferenceintegerInference time in milliseconds
© 2025 WaveSpeedAI. All rights reserved.