Home/Explore/Avatar Lipsync/wavespeed-ai/infinitetalk

image-to-video

wavespeed-ai/infinitetalk

InfiniteTalk is an audio-driven conversational AI video generation model. Create talking or singing videos from a single image and audio input. Our endpoint starts with $0.15 per 5 seconds (480p) or $0.3 per 5 seconds (720p) video generation and supports a maximum generation length of 10 minutes.

Hint: You can drag and drop a file or click to upload

Hint: You can drag and drop a file or click to upload

preview

Hint: You can drag and drop a file or click to upload

Idle

Your request will cost $0.15 per run.

For $10 you can run this model approximately 66 times.

One more thing:

ExamplesView all

README

InfiniteTalk

What is InfiniteTalk?

InfiniteTalk produces videos with precise lip sync, aligning the head, face, and body movements to the audio. It maintains identity across unlimited-length videos and also offers image-to-video generation, turning static photos into lively speaking or singing videos.

Why it looks great

  • Accurate lip synchronization: aligns lip motion precisely with audio, preserving natural rhythm and pronunciation.

  • Full-body coherence: captures head movements, facial expressions, and posture changes beyond the lips.

  • Identity preservation: maintains consistent facial identity and visual style across frames.

  • Image-to-video capability: turns static photos into realistic speaking or singing videos.

  • Instruction following: accepts text prompts to control scene, pose, or behavior while syncing to audio.

Pricing

Output ResolutionCost per 5 secondsMax Length
480p$0.1510 minutes
720p$0.3010 minutes

Billing Rules

  • Minimum charge: 5 seconds

  • Per-second rate = (price per 5 seconds) ÷ 5

  • Billed duration = video length in seconds (rounded up), with a 5-second minimum

  • Total cost = billed duration × per-second rate (by output resolution)

How to Use

  1. Upload the audio file.
  2. Upload the image (the person to animate).
  3. (Optional) Upload a mask_image to specify which regions can move.
  4. (Optional) Add a prompt to guide expression, style, or pose.
  5. Select the resolution (480p or 720p).
  6. Set the seed (set a fixed number for reproducibility).
  7. Submit the job and download the result once it's ready.

Note

  • Max clip length per job: up to 10 minutes
  • Processing speed: approximately 10–30 seconds of wall time per 1 second of video (varies by resolution and queue load)

More Versions

Reference