Latentsync

Latentsync

Playground

Try it on WavespeedAI!

LatentSync synchronizes video and audio inputs to generate seamless synchronized content. Perfect for lip-syncing, audio dubbing, and video-audio alignment tasks.

Features

LatentSync — Audio-to-Video Lip Sync

LatentSync is a state-of-the-art end-to-end lip-sync framework built on audio-conditioned latent diffusion. It turns your talking-head videos into perfectly synchronized performances while preserving high-resolution details and natural expressions.


🌟 Key Capabilities

End-to-End Lip Synchronization

Transform any talking-head clip into a lip-synced video:

  • Takes a source video plus target audio as input
  • Generates frame-accurate mouth movements without 3D meshes or 2D landmarks
  • Preserves identity, pose, background and global scene structure

High-Resolution Talking Heads

Built on latent diffusion to deliver:

  • Sharp, detailed faces at high resolution
  • Natural facial expressions and subtle mouth shapes
  • Works for both real and stylized (e.g., anime) characters from the reference video

Temporal Consistency

LatentSync introduces Temporal REPresentation Alignment (TREPA) to:

  • Reduce flicker, jitter and frame-to-frame artifacts
  • Keep head pose, lips and jaw motion stable over long sequences
  • Maintain smooth, coherent motion at video frame rates

Multilingual & Robust

Designed for real-world content:

  • Supports multiple languages and accents
  • Robust to different speakers and recording conditions
  • Handles a variety of video styles and camera setups

🎬 Core Features

  • Audio-Conditioned Latent Diffusion — Directly models audio–visual correlations in the latent space for efficient, high-quality generations.
  • TREPA Temporal Alignment — Uses temporal representations to enforce consistency across frames.
  • Improved Lip-Sync Supervision — Refined training strategies for better lip–audio alignment on standard benchmarks.
  • Resolution Flexibility — Supports HD talking-head synthesis with controllable output resolution and frame rate.
  • Open-Source Ecosystem — Public code, checkpoints and simple CLI/GUI tools for quick integration into your pipeline.

🚀 How to Use

  1. Prepare Source Video
    Provide a clear talking-head clip (.mp4) of the identity you want to animate. Please at least upload a video with resolution higher than 480p. Higher resolutions (720p, 1080p and 4k) are recommended.

    • Face should be visible and mostly unobstructed
    • Stable framing (minimal extreme motion) works best
  2. Provide Target Audio
    Upload the speech you want the subject to say (e.g., .wav, .mp3).

    • Use clean audio with minimal background noise
    • Trim leading/trailing silence if possible
  3. Run Inference
    The system will generate a lip-synced talking-head video aligned with your audio.


💰 Pricing

Minimum price: $0.15,

  • If the audio is less than 5 seconds. The minimum price will be $0.15
  • And the price will adapted based on the duration of input audio

💡 Pro Tips

  • Use high-quality, well-lit source videos with a clear view of the mouth.
  • Keep audio clean and dry — avoid heavy music, echo, and strong background noise.
  • For long speeches, consider segmenting audio into shorter chunks to improve stability and resource usage.
  • Match the frame rate of the output video to your target platform (e.g., 24/25/30 FPS).
  • If you encounter artifacts, try:
    • Slightly lowering resolution
    • Increasing sampling steps
    • Choosing a video segment where the head is more stable

Authentication

For authentication details, please refer to the Authentication Guide.

API Endpoints

Submit Task & Query Result


# Submit the task
curl --location --request POST "https://api.wavespeed.ai/api/v3/wavespeed-ai/latentsync" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${WAVESPEED_API_KEY}" \
--data-raw '{}'

# Get the result
curl --location --request GET "https://api.wavespeed.ai/api/v3/predictions/${requestId}/result" \
--header "Authorization: Bearer ${WAVESPEED_API_KEY}"

Parameters

Task Submission Parameters

Request Parameters

ParameterTypeRequiredDefaultRangeDescription
audiostringYes--The URL of the audio to be synchronized.
videostringYes-The URL of the video to be synchronized.

Response Parameters

ParameterTypeDescription
codeintegerHTTP status code (e.g., 200 for success)
messagestringStatus message (e.g., “success”)
data.idstringUnique identifier for the prediction, Task Id
data.modelstringModel ID used for the prediction
data.outputsarrayArray of URLs to the generated content (empty when status is not completed)
data.urlsobjectObject containing related API endpoints
data.urls.getstringURL to retrieve the prediction result
data.has_nsfw_contentsarrayArray of boolean values indicating NSFW detection for each output
data.statusstringStatus of the task: created, processing, completed, or failed
data.created_atstringISO timestamp of when the request was created (e.g., “2023-04-01T12:34:56.789Z”)
data.errorstringError message (empty if no error occurred)
data.timingsobjectObject containing timing details
data.timings.inferenceintegerInference time in milliseconds

Result Request Parameters

© 2025 WaveSpeedAI. All rights reserved.