Seedance 2.0 立省 15% | 在 Video Generator 中创作 →

LLM

设置

Speech Generation

Convert text into expressive spoken audio

我们的选择

video-dubbing

wavespeed-ai/mmaudio-v2

MMaudio v2 produces synchronized audio from video or text inputs, ideal for adding soundtracks to videos when paired with video models. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

立即尝试！查看文档

所有模型

39 个模型

video-dubbing

wavespeed-ai/mmaudio-v2

text-to-audio

kwaivgi/kling-text-to-audio

Kling Text-to-Audio turns text prompts into custom sound effects for videos, games, and multimedia using KlingAI's audio model. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

elevenlabs/turbo-v2

ElevenLabs Turbo V2 is a Text-To-Speech model available via WaveSpeedAI, billed at $0.05 per 1000 characters for API requests. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

audio-to-audio

wavespeed-ai/ace-step/audio-outpaint

ACE-Step Audio Outpaint generates seamless start or end extensions that match the original, ideal for intros, outros and longer tracks. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/voice-design

MiniMax Voice Design generates natural voices from textual descriptions - no cloning - lets you set tone, accent and personality. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/speech-2.5-hd-preview

MiniMax Speech 2.5 HD Preview offers HD TTS with enhanced multilingual expressiveness, accurate voice cloning, and 40-language support. Ready-to-use REST API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/speech-2.5-turbo-preview

Minimax Speech 2.5 Turbo Preview: HD TTS with multilingual support, accurate voice replication across 40 languages. $0.04/1000 chars. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

audio-to-audio

wavespeed-ai/ace-step/audio-inpaint

ACE-Step Audio Inpaint edits a specific audio segment to change lyrics or style while preserving the surrounding audio. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

elevenlabs/multilingual-v2

ElevenLabs Multilingual V2 is a multilingual text-to-speech model; cost $0.1 per 1000 characters. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/speech-02-turbo

Minimax Speech-02 Turbo is a high-definition text-to-speech model delivering natural voice output. Cost: $0.03 per 1000 characters. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

elevenlabs/flash-v2

ElevenLabs Flash V2 is a Text-to-Speech model that converts text into spoken audio using the ElevenLabs Flash V2 engine. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

elevenlabs/flash-v2.5

ElevenLabs Flash v2.5 is a text-to-speech model on WaveSpeedAI, billed at $0.05 per 1000 characters for generated speech. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

elevenlabs/multilingual-v1

ElevenLabs Multilingual V1 provides natural-sounding multilingual text-to-speech across many languages. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

digital-human

wavespeed-ai/wan-2.2/speech-to-video

Wan-2.2-S2V turns images and speech into high-fidelity videos with realistic face and body motion; supports up to 10-minute clips in 480p, from $0.15/5s. Ready-to-use REST API, no coldstarts, affordable pricing.

text-to-audio

kwaivgi/kling-v1-tts

Kling V1 TTS creates natural-sounding audio and supports KlingAI image, video, sound effect, virtual model, and custom AI workflows. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/music-v1.5

MiniMax Music v1.5 turns text prompts into high-quality, diverse music (Text-to-Audio) using advanced AI for versatile tracks. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

alibaba/qwen3-tts-flash

Qwen3 TTS Flash: Low-latency Text-to-Speech for English and Chinese with multiple voices, ideal for real-time dialogue. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

elevenlabs/eleven-v3

ElevenLabs eleven-v3 is a text-to-speech model available as a hosted endpoint; requests cost $0.1 per 1000 characters. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

wavespeed-ai/ace-step

ACE-Step generates up to 4-minute music with lyrics from text and high acoustic fidelity; supports voice cloning, lyric edits, and remixing. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

audio-to-audio

wavespeed-ai/ace-step/audio-to-audio

ACE-Step Audio-to-Audio turns existing tracks into remixes or vocal edits using remix and lyrics modes while preserving audio character. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

audio-to-audio

minimax/voice-clone

Minimax Voice Clone creates high-quality voice clones from short reference clips, closely matching tone, accent, and speaking style. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

wavespeed-ai/ace-step/prompt-to-audio

ACE-Step Prompt-to-Audio creates music from simple prompts, auto-generating genre tags and lyrics for quick song creation. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

elevenlabs/turbo-v2.5

ElevenLabs Turbo V2.5 is a text-to-speech model available via WaveSpeedAI, billed at $0.05 per 1000 characters for TTS requests. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/music-01

Minimax Music-01 Synthesizes Accompaniment And Vocals Simultaneously To Produce Complete Songs Across Diverse Styles. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/speech-2.6-hd

Minimax Speech 2.6 HD: Ultra-human, low-latency (< 250ms) TTS with voice cloning, text normalization and support for 40+ languages. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/speech-2.6-turbo

Minimax Speech 2.6 Turbo is a Text-to-Speech model offering ultra-human voice cloning, industry-leading text normalization, sub-250ms latency and 40+ language support. Pricing: $0.06 per 1000 characters. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/music-02

Minimax Music-02 is a compact, fast, cost-effective MoE music generator (230B params, 10B active) for high-quality music production. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/speech-02-hd

Minimax Speech 02 HD is Minimax's high-definition text-to-speech model delivering clear HD voices; pricing $0.05 per 1,000 characters. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

wavespeed-ai/vibevoice

wavespeed-ai/vibevoice is an advanced voice generation model for producing high-fidelity, natural, and expressive speech from text, with optional speaker/region-style control for more precise results and easy integration into real-world applications. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/speech-2.8-turbo

MiniMax Speech 2.8 Turbo is a high-definition text-to-speech model with natural and expressive voice synthesis. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

minimax/speech-2.8-hd

MiniMax Speech 2.8 HD is a high-definition text-to-speech model with natural and expressive voice synthesis for premium audio quality. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

wavespeed-ai/qwen3-tts/text-to-speech

Qwen3 TTS: Multi-language, multi-voice text-to-speech synthesis with style control. Supports 11 languages and 9 voice characters. Ready-to-use REST inference API, best performance, no cold starts, affordable pricing.

audio-to-audio

wavespeed-ai/qwen3-tts/voice-clone

Qwen3 TTS Voice Clone: Clone any voice from a reference audio and generate speech in that voice. Ready-to-use REST inference API, best performance, no cold starts, affordable pricing.

text-to-audio

wavespeed-ai/qwen3-tts/voice-design

Qwen3 TTS Voice Design: Generate speech with custom voice characteristics described in natural language. Ready-to-use REST inference API, best performance, no cold starts, affordable pricing.

text-to-audio

microsoft/vibevoice

Microsoft VibeVoice text-to-speech model generates long-form speech from text with multi-speaker dialogue support. Choose from 9 voice presets across English, Chinese, and Hindi. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

inworld/inworld-1.5-max/text-to-speech

Inworld 1.5 Max delivers premium text-to-speech synthesis with 56+ multilingual voices, adjustable speaking rate, and high-fidelity natural-sounding audio output. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

inworld/inworld-1.5-mini/text-to-speech

Inworld 1.5 Mini delivers high-quality text-to-speech synthesis with 56+ multilingual voices, adjustable speaking rate, and natural-sounding audio output. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

google/gemini-2.5-pro/text-to-speech

Google Gemini 2.5 Pro Text-to-Speech delivers natural multi-speaker voice synthesis with 30+ voices across 24 languages. Perfect for dialogues, conversations, and multilingual content. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-audio

google/gemini-2.5-flash/text-to-speech

Google Gemini 2.5 Flash Text-to-Speech delivers fast, natural multi-speaker voice synthesis with 30+ voices across 24 languages at lower cost. Perfect for dialogues, conversations, and multilingual content. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.