Seedance 2.0 立省 15% | 在 Video Generator 中创作 →
Speech Generation

Speech Generation

Convert text into expressive spoken audio

我们的选择

wavespeed-ai/mmaudio-v2
video-dubbing

wavespeed-ai/mmaudio-v2

MMaudio v2 produces synchronized audio from video or text inputs, ideal for adding soundtracks to videos when paired with video models. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

所有模型

39 个模型
wavespeed-ai/mmaudio-v2
video-dubbing

wavespeed-ai/mmaudio-v2

MMaudio v2 produces synchronized audio from video or text inputs, ideal for adding soundtracks to videos when paired with video models. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

kwaivgi/kling-text-to-audio
text-to-audio

kwaivgi/kling-text-to-audio

Kling Text-to-Audio turns text prompts into custom sound effects for videos, games, and multimedia using KlingAI's audio model. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

elevenlabs/turbo-v2
text-to-audio

elevenlabs/turbo-v2

ElevenLabs Turbo V2 is a Text-To-Speech model available via WaveSpeedAI, billed at $0.05 per 1000 characters for API requests. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

wavespeed-ai/ace-step/audio-outpaint
audio-to-audio

wavespeed-ai/ace-step/audio-outpaint

ACE-Step Audio Outpaint generates seamless start or end extensions that match the original, ideal for intros, outros and longer tracks. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

minimax/voice-design
text-to-audio

minimax/voice-design

MiniMax Voice Design generates natural voices from textual descriptions - no cloning - lets you set tone, accent and personality. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

minimax/speech-2.5-hd-preview
text-to-audio

minimax/speech-2.5-hd-preview

MiniMax Speech 2.5 HD Preview offers HD TTS with enhanced multilingual expressiveness, accurate voice cloning, and 40-language support. Ready-to-use REST API, best performance, no coldstarts, affordable pricing.

minimax/speech-2.5-turbo-preview
text-to-audio

minimax/speech-2.5-turbo-preview

Minimax Speech 2.5 Turbo Preview: HD TTS with multilingual support, accurate voice replication across 40 languages. $0.04/1000 chars. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

wavespeed-ai/ace-step/audio-inpaint
audio-to-audio

wavespeed-ai/ace-step/audio-inpaint

ACE-Step Audio Inpaint edits a specific audio segment to change lyrics or style while preserving the surrounding audio. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

elevenlabs/multilingual-v2
text-to-audio

elevenlabs/multilingual-v2

ElevenLabs Multilingual V2 is a multilingual text-to-speech model; cost $0.1 per 1000 characters. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

minimax/speech-02-turbo
text-to-audio

minimax/speech-02-turbo

Minimax Speech-02 Turbo is a high-definition text-to-speech model delivering natural voice output. Cost: $0.03 per 1000 characters. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

elevenlabs/flash-v2
text-to-audio

elevenlabs/flash-v2

ElevenLabs Flash V2 is a Text-to-Speech model that converts text into spoken audio using the ElevenLabs Flash V2 engine. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

elevenlabs/flash-v2.5
text-to-audio

elevenlabs/flash-v2.5

ElevenLabs Flash v2.5 is a text-to-speech model on WaveSpeedAI, billed at $0.05 per 1000 characters for generated speech. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

elevenlabs/multilingual-v1
text-to-audio

elevenlabs/multilingual-v1

ElevenLabs Multilingual V1 provides natural-sounding multilingual text-to-speech across many languages. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

wavespeed-ai/wan-2.2/speech-to-video
digital-human

wavespeed-ai/wan-2.2/speech-to-video

Wan-2.2-S2V turns images and speech into high-fidelity videos with realistic face and body motion; supports up to 10-minute clips in 480p, from $0.15/5s. Ready-to-use REST API, no coldstarts, affordable pricing.

kwaivgi/kling-v1-tts
text-to-audio

kwaivgi/kling-v1-tts

Kling V1 TTS creates natural-sounding audio and supports KlingAI image, video, sound effect, virtual model, and custom AI workflows. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

minimax/music-v1.5
text-to-audio

minimax/music-v1.5

MiniMax Music v1.5 turns text prompts into high-quality, diverse music (Text-to-Audio) using advanced AI for versatile tracks. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

alibaba/qwen3-tts-flash
text-to-audio

alibaba/qwen3-tts-flash

Qwen3 TTS Flash: Low-latency Text-to-Speech for English and Chinese with multiple voices, ideal for real-time dialogue. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

elevenlabs/eleven-v3
text-to-audio

elevenlabs/eleven-v3

ElevenLabs eleven-v3 is a text-to-speech model available as a hosted endpoint; requests cost $0.1 per 1000 characters. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

wavespeed-ai/ace-step
text-to-audio

wavespeed-ai/ace-step

ACE-Step generates up to 4-minute music with lyrics from text and high acoustic fidelity; supports voice cloning, lyric edits, and remixing. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

wavespeed-ai/ace-step/audio-to-audio
audio-to-audio

wavespeed-ai/ace-step/audio-to-audio

ACE-Step Audio-to-Audio turns existing tracks into remixes or vocal edits using remix and lyrics modes while preserving audio character. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

minimax/voice-clone
audio-to-audio

minimax/voice-clone

Minimax Voice Clone creates high-quality voice clones from short reference clips, closely matching tone, accent, and speaking style. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

wavespeed-ai/ace-step/prompt-to-audio
text-to-audio

wavespeed-ai/ace-step/prompt-to-audio

ACE-Step Prompt-to-Audio creates music from simple prompts, auto-generating genre tags and lyrics for quick song creation. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

elevenlabs/turbo-v2.5
text-to-audio

elevenlabs/turbo-v2.5

ElevenLabs Turbo V2.5 is a text-to-speech model available via WaveSpeedAI, billed at $0.05 per 1000 characters for TTS requests. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

minimax/music-01
text-to-audio

minimax/music-01

Minimax Music-01 Synthesizes Accompaniment And Vocals Simultaneously To Produce Complete Songs Across Diverse Styles. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

minimax/speech-2.6-hd
text-to-audio

minimax/speech-2.6-hd

Minimax Speech 2.6 HD: Ultra-human, low-latency (< 250ms) TTS with voice cloning, text normalization and support for 40+ languages. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

minimax/speech-2.6-turbo
text-to-audio

minimax/speech-2.6-turbo

Minimax Speech 2.6 Turbo is a Text-to-Speech model offering ultra-human voice cloning, industry-leading text normalization, sub-250ms latency and 40+ language support. Pricing: $0.06 per 1000 characters. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

minimax/music-02
text-to-audio

minimax/music-02

Minimax Music-02 is a compact, fast, cost-effective MoE music generator (230B params, 10B active) for high-quality music production. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

minimax/speech-02-hd
text-to-audio

minimax/speech-02-hd

Minimax Speech 02 HD is Minimax's high-definition text-to-speech model delivering clear HD voices; pricing $0.05 per 1,000 characters. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

wavespeed-ai/vibevoice
text-to-audio

wavespeed-ai/vibevoice

wavespeed-ai/vibevoice is an advanced voice generation model for producing high-fidelity, natural, and expressive speech from text, with optional speaker/region-style control for more precise results and easy integration into real-world applications. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

minimax/speech-2.8-turbo
text-to-audio

minimax/speech-2.8-turbo

MiniMax Speech 2.8 Turbo is a high-definition text-to-speech model with natural and expressive voice synthesis. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

minimax/speech-2.8-hd
text-to-audio

minimax/speech-2.8-hd

MiniMax Speech 2.8 HD is a high-definition text-to-speech model with natural and expressive voice synthesis for premium audio quality. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

wavespeed-ai/qwen3-tts/text-to-speech
text-to-audio

wavespeed-ai/qwen3-tts/text-to-speech

Qwen3 TTS: Multi-language, multi-voice text-to-speech synthesis with style control. Supports 11 languages and 9 voice characters. Ready-to-use REST inference API, best performance, no cold starts, affordable pricing.

wavespeed-ai/qwen3-tts/voice-clone
audio-to-audio

wavespeed-ai/qwen3-tts/voice-clone

Qwen3 TTS Voice Clone: Clone any voice from a reference audio and generate speech in that voice. Ready-to-use REST inference API, best performance, no cold starts, affordable pricing.

wavespeed-ai/qwen3-tts/voice-design
text-to-audio

wavespeed-ai/qwen3-tts/voice-design

Qwen3 TTS Voice Design: Generate speech with custom voice characteristics described in natural language. Ready-to-use REST inference API, best performance, no cold starts, affordable pricing.

microsoft/vibevoice
text-to-audio

microsoft/vibevoice

Microsoft VibeVoice text-to-speech model generates long-form speech from text with multi-speaker dialogue support. Choose from 9 voice presets across English, Chinese, and Hindi. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

inworld/inworld-1.5-max/text-to-speech
text-to-audio

inworld/inworld-1.5-max/text-to-speech

Inworld 1.5 Max delivers premium text-to-speech synthesis with 56+ multilingual voices, adjustable speaking rate, and high-fidelity natural-sounding audio output. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

inworld/inworld-1.5-mini/text-to-speech
text-to-audio

inworld/inworld-1.5-mini/text-to-speech

Inworld 1.5 Mini delivers high-quality text-to-speech synthesis with 56+ multilingual voices, adjustable speaking rate, and natural-sounding audio output. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

google/gemini-2.5-pro/text-to-speech
text-to-audio

google/gemini-2.5-pro/text-to-speech

Google Gemini 2.5 Pro Text-to-Speech delivers natural multi-speaker voice synthesis with 30+ voices across 24 languages. Perfect for dialogues, conversations, and multilingual content. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

google/gemini-2.5-flash/text-to-speech
text-to-audio

google/gemini-2.5-flash/text-to-speech

Google Gemini 2.5 Flash Text-to-Speech delivers fast, natural multi-speaker voice synthesis with 30+ voices across 24 languages at lower cost. Perfect for dialogues, conversations, and multilingual content. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Speech Generation API — 价格与性能

通过单一 REST API 运行 Speech Generation 系列中的任意模型。按生成计费 — 无订阅、无最低消费 — 在 99.9% 可用性的基础设施上提供行业领先的延迟。

为什么在 WaveSpeedAI 上运行 Speech Generation

透明定价

每个 Speech Generation 模型都有按调用计价。价格在每个模型的页面上列出 — 不收取额外的平台费。

为低延迟优化

大多数 Speech Generation 图像模型在 2 秒内完成。视频和 3D 模型比自托管方案快数倍。

99.9% 可用性

多区域故障转移和自动重试可确保您的生产流量保持在线 — 即使在供应商故障期间。

常见问题

Speech Generation API 多少钱?+

每个模型在其模型页面上都列有自己的按调用价格。我们按每次成功生成计费,没有订阅费或最低消费。

Speech Generation 模型在 WaveSpeedAI 上有多快?+

本系列中的图像模型通常在 2 秒内完成。视频和 3D 模型取决于时长和分辨率,但通常比自托管运行快数倍。

不用信用卡可以试用 API 吗?+

可以 — 每个账户在注册时获得 $1 的免费额度,足以在不使用信用卡的情况下试用大多数 Speech Generation 模型。

有速率限制吗?+

标准账户有充足的并发任务限制。企业版计划提供自定义 RPM、更高并发和专用容量 — 详情请联系销售。