WaveSpeed.ai
Inicio/Explorar/Avatar Lipsync Models/wavespeed-ai/ltx-2-19b/lipsync
digital-human

digital-human

LTX-2 19b

wavespeed-ai/ltx-2-19b/lipsync

LTX-2 Lipsync generates synchronized talking head videos from a reference image and audio input. Powered by the 19B DiT architecture, it produces high-fidelity lip-synced videos with natural head movements. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Input

Hint: You can drag and drop a file or click to upload

Hint: You can drag and drop a file or click to upload

preview

Idle

Tu solicitud costará $0.1 por ejecución.

Con $1 puedes ejecutar este modelo aproximadamente 10 veces.

EjemplosVer todo

README

LTX-2 19B Lipsync

LTX-2 Lipsync is an audio-driven digital human model that generates synchronized talking head videos from a reference image and audio input. Powered by the 19B-parameter DiT (Diffusion Transformer) architecture, it produces high-fidelity lip-synced videos with natural head movements and expressions.

Why Choose This?

  • Audio-driven generation Simply provide an audio file and optional reference image — the model handles lip-sync, head motion, and expressions automatically.

  • High-fidelity output Leverages the 19B-parameter LTX-2 architecture for detailed, temporally consistent video with natural mouth movements.

  • Flexible resolution Supports 480p, 720p, and 1080p outputs to balance quality and cost.

  • Variable duration Video length is automatically determined by audio duration (5-20 seconds max).

Parameters

ParameterRequiredDescription
audioYesAudio file URL for lip-sync (determines video length)
imageNoReference portrait image (JPG or PNG)
promptNoOptional text to guide generation style
resolutionNoOutput resolution: 480p, 720p (default), or 1080p
seedNoRandom seed for reproducibility (-1 for random)

Resolution Options

ResolutionBest For
480pFast previews, iteration, lowest cost
720pBalanced quality and cost (default)
1080pFinal delivery, maximum detail

How to Use

  1. Upload your audio — the audio file that drives lip-sync and determines video duration.
  2. Upload your image (optional) — the reference portrait that defines the speaker's appearance.
  3. Write your prompt (optional) — describe any style or motion preferences.
  4. Select resolution — 480p for iteration, 720p for balance, 1080p for final output.
  5. Run — submit and download the lip-synced video.

Pricing

Resolution5s10s15s20s
480p$0.075$0.15$0.225$0.30
720p$0.10$0.20$0.30$0.40
1080p$0.15$0.30$0.45$0.60

Billing Rules

  • Base price: $0.10 (720p, 5 seconds)
  • Duration: Determined by audio length (min 5s, max 20s billing)
  • Resolution multiplier: 480p = 0.75×, 720p = 1×, 1080p = 1.5×
  • Total cost = duration × $0.10 × resolution_multiplier / 5

Best Use Cases

  • Digital Avatars — Create talking head videos for virtual presenters and avatars.
  • Content Creation — Generate lip-synced videos for social media and marketing.
  • Localization — Dub existing content with new audio while maintaining visual consistency.
  • Accessibility — Create sign language or narrated content with synchronized visuals.
  • Education — Produce instructional videos with talking head presenters.

Pro Tips

  • Use clear, high-quality audio for best lip-sync results.
  • Provide a front-facing portrait image with visible mouth for optimal generation.
  • Use high-quality, sharp, well-lit portrait images for best results.
  • Iterate at 480p to dial in results, then render at higher resolution for final output.
  • Use fixed seed when comparing variations to isolate changes.

Notes

  • Maximum video duration is 20 seconds (determined by audio length).
  • Audio longer than 20 seconds will be truncated.
  • The aspect ratio of output video is influenced by your input image.

Related Models