Home/Explore/Avatar Lipsync/wavespeed-ai/latentsync
video-to-video

video-to-video

LatentSync

wavespeed-ai/latentsync

LatentSync synchronizes video and audio inputs to generate seamless synchronized content. Perfect for lip-syncing, audio dubbing, and video-audio alignment tasks.

Hint: You can drag and drop a file or click to upload

Hint: You can drag and drop a file or click to upload

Idle

Your request will cost $0.05 per run.

For $1 you can run this model approximately 20 times.

One more thing::

ExamplesView all

README

LatentSync — Audio-to-Video Lip Sync

LatentSync is a state-of-the-art end-to-end lip-sync framework built on audio-conditioned latent diffusion. It turns your talking-head videos into perfectly synchronized performances while preserving high-resolution details and natural expressions.

🌟 Key Capabilities

End-to-End Lip Synchronization

Transform any talking-head clip into a lip-synced video:

  • Takes a source video plus target audio as input
  • Generates frame-accurate mouth movements without 3D meshes or 2D landmarks
  • Preserves identity, pose, background and global scene structure

High-Resolution Talking Heads

Built on latent diffusion to deliver:

  • Sharp, detailed faces at high resolution
  • Natural facial expressions and subtle mouth shapes
  • Works for both real and stylized (e.g., anime) characters from the reference video

Temporal Consistency

LatentSync introduces Temporal REPresentation Alignment (TREPA) to:

  • Reduce flicker, jitter and frame-to-frame artifacts
  • Keep head pose, lips and jaw motion stable over long sequences
  • Maintain smooth, coherent motion at video frame rates

Multilingual & Robust

Designed for real-world content:

  • Supports multiple languages and accents
  • Robust to different speakers and recording conditions
  • Handles a variety of video styles and camera setups

🎬 Core Features

  • Audio-Conditioned Latent Diffusion — Directly models audio–visual correlations in the latent space for efficient, high-quality generations.
  • TREPA Temporal Alignment — Uses temporal representations to enforce consistency across frames.
  • Improved Lip-Sync Supervision — Refined training strategies for better lip–audio alignment on standard benchmarks.
  • Resolution Flexibility — Supports HD talking-head synthesis with controllable output resolution and frame rate.
  • Open-Source Ecosystem — Public code, checkpoints and simple CLI/GUI tools for quick integration into your pipeline.

🚀 How to Use

  1. Prepare Source Video
    Provide a clear talking-head clip (.mp4) of the identity you want to animate. Please at least upload a video with resolution higher than 480p. Higher resolutions (720p, 1080p and 4k) are recommended.

    • Face should be visible and mostly unobstructed
    • Stable framing (minimal extreme motion) works best
  2. Provide Target Audio
    Upload the speech you want the subject to say (e.g., .wav, .mp3).

    • Use clean audio with minimal background noise
    • Trim leading/trailing silence if possible
  3. Run Inference
    The system will generate a lip-synced talking-head video aligned with your audio.

💰 Pricing

Minimum price: $0.15,

  • If the audio is less than 5 seconds. The minimum price will be $0.15
  • And the price will adapted based on the duration of input audio

💡 Pro Tips

  • Use high-quality, well-lit source videos with a clear view of the mouth.
  • Keep audio clean and dry — avoid heavy music, echo, and strong background noise.
  • For long speeches, consider segmenting audio into shorter chunks to improve stability and resource usage.
  • Match the frame rate of the output video to your target platform (e.g., 24/25/30 FPS).
  • If you encounter artifacts, try:
    • Slightly lowering resolution
    • Increasing sampling steps
    • Choosing a video segment where the head is more stable