Google Gemini 2.5 Pro Text To Speech
Playground
Try it on WavespeedAI!Google Gemini 2.5 Pro Text-to-Speech delivers natural multi-speaker voice synthesis with 30+ voices across 24 languages. Perfect for dialogues, conversations, and multilingual content. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.
Features
Gemini 2.5 Pro Text-to-Speech
Gemini 2.5 Pro Text-to-Speech is Google’s advanced multi-speaker speech synthesis model that turns written dialogue into natural, expressive audio. It supports multiple speakers with distinct voices in a single generation, making it ideal for podcasts, conversations, audiobooks, and any content that needs realistic multi-voice narration.
Why Choose This?
-
Multi-speaker dialogue Assign different voices to different speakers and generate a natural-sounding conversation in one pass — no need to stitch separate audio clips together.
-
Expressive, natural voices Powered by Gemini 2.5 Pro, the voices carry natural intonation, pacing, and emotional range for lifelike results.
-
Multi-language support Supports a wide range of languages including Arabic (Egypt), Bangla (Bangladesh), Dutch (Netherlands), English (India), English (United States), French (France), German (Germany), Hindi (India), Indonesian (Indonesia), and more.
-
Flexible speaker setup Add as many speakers as your script needs, each with their own named voice. Simply write dialogue with speaker labels and the model handles the rest.
Parameters
| Parameter | Required | Description |
|---|---|---|
| text | Yes | The script or dialogue text. Use “Speaker: line” format for multi-speaker content. |
| language | Yes | Language and locale for synthesis (e.g., English (United States), French (France)). |
| speakers | Yes | A list of speaker entries, each with a speaker name and a voice selection. |
How to Use
- Write your script in the text field using the “Speaker: dialogue” format (e.g., “Rose: Welcome back to Tech Talk!”).
- Select the language from the dropdown.
- Add speakers — for each speaker in your script, add an entry with the speaker name and choose a voice.
- Run — the model generates a single audio file with all speakers voiced naturally.
- Download the output audio.
Pricing
$0.08 per 1,000 characters of input text.
Billing Rules
- Billed by text length, rounded up to the nearest 1,000 characters
- Minimum charge is $0.08 (for texts up to 1,000 characters)
Examples
| Text Length | Cost |
|---|---|
| 500 characters | $0.08 |
| 1,000 characters | $0.08 |
| 2,500 characters | $0.24 |
| 5,000 characters | $0.40 |
| 10,000 characters | $0.80 |
Best Use Cases
- Podcasts & Talk Shows — Generate multi-host audio content with distinct voices for each speaker.
- Audiobooks & Narration — Bring stories to life with different character voices in a single generation.
- E-learning & Training — Create engaging instructional audio with conversational dialogue.
- Content Localization — Produce voiceovers in multiple languages for global audiences.
- Prototyping & Pre-production — Quickly audition dialogue and voice pairings before recording with real talent.
Pro Tips
- Use the “Speaker: dialogue” format consistently throughout your script to ensure correct voice assignment.
- Make sure each speaker name in the text exactly matches the speaker name in the speakers list.
- Keep dialogue natural — the model handles pacing and intonation best with conversational writing.
- For long scripts, break content into logical segments to review quality before generating the full piece.
Notes
- The number of available voices may vary by language. Experiment with different voice options to find the best fit for your content.
- Please ensure your content complies with Google’s usage policies.
Authentication
For authentication details, please refer to the Authentication Guide.
API Endpoints
Submit Task & Query Result
# Submit the task
curl --location --request POST "https://api.wavespeed.ai/api/v3/google/gemini-2.5-pro/text-to-speech" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${WAVESPEED_API_KEY}" \
--data-raw '{
"language": "English (United States)",
"speakers": [
{
"speaker": "",
"voice": "Achernar"
}
]
}'
# Get the result
curl --location --request GET "https://api.wavespeed.ai/api/v3/predictions/${requestId}/result" \
--header "Authorization: Bearer ${WAVESPEED_API_KEY}"
Parameters
Task Submission Parameters
Request Parameters
| Parameter | Type | Required | Default | Range | Description |
|---|---|---|---|---|---|
| text | string | Yes | - | - | Styling instructions on how to synthesize the content in the text field.Less than or equal to 8,000 bytes |
| language | string | Yes | English (United States) | Arabic (Egypt), Bangla (Bangladesh), Dutch (Netherlands), English (India), English (United States), French (France), German (Germany), Hindi (India), Indonesian (Indonesia), Italian (Italy), Japanese (Japan), Korean (South Korea), Marathi (India), Polish (Poland), Portuguese (Brazil), Romanian (Romania), Russian (Russia), Spanish (Spain), Tamil (India), Telugu (India), Thai (Thailand), Turkish (Turkey), Ukrainian (Ukraine), Vietnamese (Vietnam) | Language spoken in the audio. |
| speakers | array | Yes | [{"speaker":"","voice":"Achernar"}] | 1 ~ 2 items | Array of terminoogies to use for translation |
Response Parameters
| Parameter | Type | Description |
|---|---|---|
| code | integer | HTTP status code (e.g., 200 for success) |
| message | string | Status message (e.g., “success”) |
| data.id | string | Unique identifier for the prediction, Task Id |
| data.model | string | Model ID used for the prediction |
| data.outputs | array | Array of URLs to the generated content (empty when status is not completed) |
| data.urls | object | Object containing related API endpoints |
| data.urls.get | string | URL to retrieve the prediction result |
| data.has_nsfw_contents | array | Array of boolean values indicating NSFW detection for each output |
| data.status | string | Status of the task: created, processing, completed, or failed |
| data.created_at | string | ISO timestamp of when the request was created (e.g., “2023-04-01T12:34:56.789Z”) |
| data.error | string | Error message (empty if no error occurred) |
| data.timings | object | Object containing timing details |
| data.timings.inference | integer | Inference time in milliseconds |
Result Request Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
| id | string | Yes | - | Task ID |
Result Response Parameters
| Parameter | Type | Description |
|---|---|---|
| code | integer | HTTP status code (e.g., 200 for success) |
| message | string | Status message (e.g., “success”) |
| data | object | The prediction data object containing all details |
| data.id | string | Unique identifier for the prediction, the ID of the prediction to get |
| data.model | string | Model ID used for the prediction |
| data.outputs | string | Array of URLs to the generated content (empty when status is not completed). |
| data.urls | object | Object containing related API endpoints |
| data.urls.get | string | URL to retrieve the prediction result |
| data.status | string | Status of the task: created, processing, completed, or failed |
| data.created_at | string | ISO timestamp of when the request was created (e.g., “2023-04-01T12:34:56.789Z”) |
| data.error | string | Error message (empty if no error occurred) |
| data.timings | object | Object containing timing details |
| data.timings.inference | integer | Inference time in milliseconds |