WaveSpeed.ai
Accueil/Explorer/Avatar Lipsync Models/wavespeed-ai/ltx-2.3/lipsync
digital-human

digital-human

LTX-2.3 Lipsync

wavespeed-ai/ltx-2.3/lipsync

LTX-2.3 Lipsync generates talking head videos from audio with synchronized lip movements and natural facial expressions. Built on DiT-based architecture with improved audio-visual alignment quality. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Input

Hint: You can drag and drop a file or click to upload

Hint: You can drag and drop a file or click to upload

preview

Idle

Votre requête coûtera $0.1 par exécution.

Pour $1 vous pouvez exécuter ce modèle environ 10 fois.

ExemplesTout voir

README

LTX-2.3 Lipsync

LTX-2.3 Lipsync is an advanced AI model that generates talking head videos from audio and an optional reference image. Built on the LTX-2.3 DiT-based architecture with improved audio-visual quality, it creates realistic lip-synced videos that match your audio input.

Why Choose This?

  • Improved quality Enhanced audio-visual alignment with better lip sync accuracy and natural facial movements.

  • Audio-driven generation Automatically generates video with synchronized lip movements from audio input.

  • Optional reference image Provide a portrait image to use as the base, or let the model use a default portrait.

  • Flexible resolution Supports 480p, 720p, and 1080p outputs to balance quality and cost.

  • Automatic duration Video length automatically matches audio duration (5-20 seconds).

Parameters

ParameterRequiredDescription
audioYesAudio file URL - duration determines video length (5-20s)
imageNoReference portrait image (optional)
promptNoText prompt to guide generation style and motion
resolutionNoOutput resolution: 480p, 720p (default), or 1080p
seedNoRandom seed for reproducibility (-1 for random)

Resolution Options

ResolutionBest For
480pFast previews, iteration, lowest cost
720pBalanced quality and cost (default)
1080pFinal delivery, maximum detail

How to Use

  1. Upload your audio — the audio track that drives the video (5-20 seconds).
  2. Upload reference image (optional) — a portrait to use as the base character.
  3. Add prompt (optional) — describe the style or motion you want.
  4. Select resolution — 480p for iteration, 720p for balance, 1080p for final output.
  5. Run — submit and download the lip-synced video.

Pricing

Pricing is based on audio duration (automatically detected):

Resolution5s10s15s20s
480p$0.10$0.20$0.30$0.40
720p$0.15$0.30$0.45$0.60
1080p$0.20$0.40$0.60$0.80

Best Use Cases

  • Talking Head Videos — Generate spokesperson videos from audio recordings.
  • Content Localization — Create videos in multiple languages from audio tracks.
  • Virtual Presenters — Generate AI presenters for training, marketing, or education.
  • Audio-to-Video — Convert podcasts or audio content into video format.
  • Character Animation — Bring portraits to life with synchronized speech.

Pro Tips

  • Audio quality directly affects lip sync accuracy - use clear audio.
  • Provide a frontal portrait image for best results.
  • Use prompt to guide facial expressions and style.
  • Iterate at 480p to verify results, then render at higher resolution.
  • Audio duration is automatically detected - no need to specify manually.

Notes

  • Audio duration must be between 5-20 seconds.
  • If no reference image is provided, a default portrait will be used.
  • Video length automatically matches audio duration.

Related Models