Latentsync
Playground
Try it on WavespeedAI!LatentSync synchronizes video and audio inputs to generate seamless synchronized content. Perfect for lip-syncing, audio dubbing, and video-audio alignment tasks.
Features
LatentSync — Audio-to-Video Lip Sync
LatentSync is a state-of-the-art end-to-end lip-sync framework built on audio-conditioned latent diffusion. It turns your talking-head videos into perfectly synchronized performances while preserving high-resolution details and natural expressions.
🌟 Key Capabilities
End-to-End Lip Synchronization
Transform any talking-head clip into a lip-synced video:
- Takes a source video plus target audio as input
- Generates frame-accurate mouth movements without 3D meshes or 2D landmarks
- Preserves identity, pose, background and global scene structure
High-Resolution Talking Heads
Built on latent diffusion to deliver:
- Sharp, detailed faces at high resolution
- Natural facial expressions and subtle mouth shapes
- Works for both real and stylized (e.g., anime) characters from the reference video
Temporal Consistency
LatentSync introduces Temporal REPresentation Alignment (TREPA) to:
- Reduce flicker, jitter and frame-to-frame artifacts
- Keep head pose, lips and jaw motion stable over long sequences
- Maintain smooth, coherent motion at video frame rates
Multilingual & Robust
Designed for real-world content:
- Supports multiple languages and accents
- Robust to different speakers and recording conditions
- Handles a variety of video styles and camera setups
🎬 Core Features
- Audio-Conditioned Latent Diffusion — Directly models audio–visual correlations in the latent space for efficient, high-quality generations.
- TREPA Temporal Alignment — Uses temporal representations to enforce consistency across frames.
- Improved Lip-Sync Supervision — Refined training strategies for better lip–audio alignment on standard benchmarks.
- Resolution Flexibility — Supports HD talking-head synthesis with controllable output resolution and frame rate.
- Open-Source Ecosystem — Public code, checkpoints and simple CLI/GUI tools for quick integration into your pipeline.
🚀 How to Use
-
Prepare Source Video
Provide a clear talking-head clip (.mp4) of the identity you want to animate. Please at least upload a video with resolution higher than 480p. Higher resolutions (720p, 1080p and 4k) are recommended.- Face should be visible and mostly unobstructed
- Stable framing (minimal extreme motion) works best
-
Provide Target Audio
Upload the speech you want the subject to say (e.g.,.wav,.mp3).- Use clean audio with minimal background noise
- Trim leading/trailing silence if possible
-
Run Inference
The system will generate a lip-synced talking-head video aligned with your audio.
💰 Pricing
Minimum price: $0.15,
- If the audio is less than 5 seconds. The minimum price will be $0.15
- And the price will adapted based on the duration of input audio
💡 Pro Tips
- Use high-quality, well-lit source videos with a clear view of the mouth.
- Keep audio clean and dry — avoid heavy music, echo, and strong background noise.
- For long speeches, consider segmenting audio into shorter chunks to improve stability and resource usage.
- Match the frame rate of the output video to your target platform (e.g., 24/25/30 FPS).
- If you encounter artifacts, try:
- Slightly lowering resolution
- Increasing sampling steps
- Choosing a video segment where the head is more stable
Authentication
For authentication details, please refer to the Authentication Guide.
API Endpoints
Submit Task & Query Result
# Submit the task
curl --location --request POST "https://api.wavespeed.ai/api/v3/wavespeed-ai/latentsync" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${WAVESPEED_API_KEY}" \
--data-raw '{}'
# Get the result
curl --location --request GET "https://api.wavespeed.ai/api/v3/predictions/${requestId}/result" \
--header "Authorization: Bearer ${WAVESPEED_API_KEY}"
Parameters
Task Submission Parameters
Request Parameters
| Parameter | Type | Required | Default | Range | Description |
|---|---|---|---|---|---|
| audio | string | Yes | - | - | The URL of the audio to be synchronized. |
| video | string | Yes | - | The URL of the video to be synchronized. |
Response Parameters
| Parameter | Type | Description |
|---|---|---|
| code | integer | HTTP status code (e.g., 200 for success) |
| message | string | Status message (e.g., “success”) |
| data.id | string | Unique identifier for the prediction, Task Id |
| data.model | string | Model ID used for the prediction |
| data.outputs | array | Array of URLs to the generated content (empty when status is not completed) |
| data.urls | object | Object containing related API endpoints |
| data.urls.get | string | URL to retrieve the prediction result |
| data.has_nsfw_contents | array | Array of boolean values indicating NSFW detection for each output |
| data.status | string | Status of the task: created, processing, completed, or failed |
| data.created_at | string | ISO timestamp of when the request was created (e.g., “2023-04-01T12:34:56.789Z”) |
| data.error | string | Error message (empty if no error occurred) |
| data.timings | object | Object containing timing details |
| data.timings.inference | integer | Inference time in milliseconds |