Kling Lipsync Text to Video | AI Digital Human API

Kling Lipsync Text-to-Video

Make any face speak your words with AI-powered lip synchronization. Upload a video, enter your text, choose a voice, and Kling Lipsync will generate realistic lip movements perfectly matched to the synthesized speech — ideal for dubbing, content localization, and creative projects.

Why It Looks Great

Realistic lip sync: AI-generated mouth movements accurately match the spoken audio for natural-looking results.
Multiple voice options: Choose from a variety of voice characters to match your content style.
Bilingual support: Generate speech in English (en) or Chinese (zh).
Adjustable speed: Control the speaking pace with the voice speed parameter.
Text-driven workflow: Simply type what you want the character to say — no audio recording needed.

Parameters

Parameter	Required	Description
video	Yes	Source video with a visible face (upload or public URL).
text	Yes	The text you want the character to speak.
voice_id	Yes	Voice character selection (e.g., genshin_klee2).
voice_language	No	Language for speech synthesis: en (English) or zh (Chinese). Default: en.
voice_speed	No	Speaking speed multiplier. Default: 1.

How to Use

Upload your video — drag and drop or paste a public URL. Ensure the face is clearly visible.
Enter your text — type the words you want the character to speak.
Select voice_id — choose a voice character that fits your content.
Choose language — select en for English or zh for Chinese.
Adjust speed (optional) — modify voice_speed to speak faster or slower.
Run — click the button to generate.
Download — preview and save your lip-synced video.

Pricing

Flat rate per generation.

Output	Cost
Per video	$0.14

Best Use Cases

Content Localization — Dub videos into different languages while maintaining natural lip movements.
Social Media & Entertainment — Create fun talking videos, memes, and viral content.
E-learning & Training — Generate instructional videos with consistent narration.
Marketing & Advertising — Produce multilingual ad variants from a single video shoot.
Character Animation — Bring static or animated characters to life with synchronized speech.

Pro Tips for Best Results

Use videos with clear, front-facing shots of the face for the most accurate lip sync.
Keep text length appropriate for the video duration — shorter clips work best with concise messages.
Match the voice character to the visual appearance for more believable results.
Test different voice_speed values to find the natural pacing for your content.
For multilingual projects, generate separate versions with appropriate voice_language settings.
Ensure good lighting on the face in the source video for cleaner lip tracking.

Notes

If using a URL for the video, ensure it is publicly accessible. A preview thumbnail confirms successful loading.
The face must be clearly visible throughout the video for accurate lip synchronization.
Processing time may vary based on video length and current queue load.
Best results are achieved with videos where the subject is speaking or has a neutral expression.

Kling Lipsync Text To Video API — Quick start

Grab a WaveSpeedAI API key, then call POST https://api.wavespeed.ai/api/v3/kwaivgi/kling-lipsync/text-to-video with your input as JSON. The endpoint returns a prediction id; poll the prediction endpoint until status flips to completed, then read the output URL from data.outputs[0]. Examples for Kling Lipsync Text To Video below.

HTTP example

# Submit the prediction
curl -X POST "https://api.wavespeed.ai/api/v3/kwaivgi/kling-lipsync/text-to-video" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $WAVESPEED_API_KEY" \
  -d '{
    "video": "https://example.com/your-input.mp4",
    "voice_id": "genshin_klee2",
    "voice_language": "en",
    "voice_speed": 1
}'

# Response includes a prediction id. Poll for the result:
curl -X GET "https://api.wavespeed.ai/api/v3/predictions/{request_id}/result" \
  -H "Authorization: Bearer $WAVESPEED_API_KEY"

# When status is "completed", read the output from data.outputs[0].

Node.js example

// npm install wavespeed
const WaveSpeed = require('wavespeed');

const client = new WaveSpeed(); // reads WAVESPEED_API_KEY from env

const result = await client.run("kwaivgi/kling-lipsync/text-to-video", {
        "video": "https://example.com/your-input.mp4",
        "voice_id": "genshin_klee2",
        "voice_language": "en",
        "voice_speed": 1
});

console.log(result.outputs[0]); // → URL of the generated output

Python example

# pip install wavespeed
import wavespeed

output = wavespeed.run(
    "kwaivgi/kling-lipsync/text-to-video",
    {
    "video": "https://example.com/your-input.mp4",
    "voice_id": "genshin_klee2",
    "voice_language": "en",
    "voice_speed": 1
}
)

print(output["outputs"][0])  # → URL of the generated output

Kling Lipsync Text To Video API — Frequently asked questions

What is the Kling Lipsync Text To Video API?

Kling Lipsync Text To Video is a Kuaishou model for talking-avatar generation, exposed as a REST API on WaveSpeedAI. Kling TextToVideo by Kwaivgi creates videos with lifelike lip movements that precisely sync to input text for natural speaking visuals. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing. You can call it programmatically or try it from the playground above.

How do I call the Kling Lipsync Text To Video API?

POST your input parameters to the model's REST endpoint (shown in the API tab of this playground) with your WaveSpeedAI API key in the Authorization header. Submission returns a prediction ID; poll the prediction endpoint until status flips to "completed", then read the output URL from the result. The playground generates a ready-to-paste code sample in Python, JavaScript, or cURL for whatever inputs you've set. Full request/response shape is documented at https://wavespeed.ai/docs/docs-api/kwaivgi/kwaivgi-kling-lipsync-text-to-video.

How much does Kling Lipsync Text To Video cost per run?

Kling Lipsync Text To Video starts at $0.14 per run. That figure is the base price — the final charge scales with the parameters you set in the form (output size, length, count, references, or whatever knobs this model exposes), so a higher-quality or larger output costs more than a minimal one. The exact cost for your current input is shown live next to the Generate button before you submit, and the actual per-call charge is recorded on the prediction afterwards.

What inputs does Kling Lipsync Text To Video accept?

Key inputs: `video`, `text`, `voice_id`, `voice_language`, `voice_speed`. The full JSON schema (types, defaults, allowed values) is rendered above the Generate button and mirrored in the API reference at https://wavespeed.ai/docs/docs-api/kwaivgi/kwaivgi-kling-lipsync-text-to-video.

How long does Kling Lipsync Text To Video take to generate?

Average end-to-end generation time on WaveSpeedAI is around 166 seconds per request — measured across recent runs. Queue time scales with global demand; live status is visible in the prediction record.

Can I use Kling Lipsync Text To Video outputs commercially?

Commercial usage rights depend on the model's license, set by its provider (Kuaishou). The license summary appears on the model card above; see WaveSpeedAI's Terms of Service for platform-level conditions.

ExamplesView all

Related Models

README