Infinitetalk Multi
Playground
Try it on WavespeedAI!InfiniteTalk Multi is an audio-driven multi-character conversational AI video generation model. Create talking or singing videos from a single image and 2 audio inputs. Our endpoint starts with $0.15 per 5 seconds video generation (480p) and supports a maximum generation length of 10 minutes.
Features
InfiniteTalk Multi
If you want to try the video-to-video version, visit: https://wavespeed.ai/models/wavespeed-ai/infinitetalk/video-to-video
If you want to try the single-character version, visit: https://wavespeed.ai/models/wavespeed-ai/infinitetalk
What is InfiniteTalk?
Given an input video and audio track, InfiniteTalk synthesizes a new video with accurate lip synchronization while simultaneously aligning head movements, body posture, and facial expressions with the audio. Unlike traditional dubbing methods that focus solely on lips, InfiniteTalk enables infinite-length video generation with accurate lip synchronization and consistent identity preservation. Beside, InfiniteTalk can also be used as an image-audio-to-video model with an image and an audio as input. InfiniteTalk transforms static photos into dynamic speaking videos by making the person speak or sing exactly what you want them to say.
Pricing
Our endpoint starts with $0.15 per 5 seconds (480p) or $0.3 per 5 seconds (720p) video generation and supports a maximum generation length of 10 minutes.
How InfiniteTalk Works
InfiniteTalk leverages advanced AI technology to understand both audio signals and visual information.
Audio Analysis: InfiniteTalk uses a powerful audio encoder (Wav2Vec) to understand the nuances of speech, including rhythm, tone, and pronunciation patterns.
Visual Understanding: Built on the robust Wan2.1 video diffusion model, InfiniteTalk understands human anatomy, facial expressions, and body movements.
Perfect Synchronization: Through sophisticated attention mechanisms, InfiniteTalk learns to perfectly align lip movements with audio while maintaining natural facial expressions and body language.
Instruction Following: Unlike simpler methods, InfiniteTalk can follow text prompts to control the scene, pose, and overall behavior while maintaining audio synchronization.
Authentication
For authentication details, please refer to the Authentication Guide.
API Endpoints
Submit Task & Query Result
# Submit the task
curl --location --request POST "https://api.wavespeed.ai/api/v3/wavespeed-ai/infinitetalk/multi" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${WAVESPEED_API_KEY}" \
--data-raw '{
"prompt": "",
"order": "meanwhile",
"resolution": "480p",
"seed": -1
}'
# Get the result
curl --location --request GET "https://api.wavespeed.ai/api/v3/predictions/${requestId}/result" \
--header "Authorization: Bearer ${WAVESPEED_API_KEY}"
Parameters
Task Submission Parameters
Request Parameters
Parameter | Type | Required | Default | Range | Description |
---|---|---|---|---|---|
left_audio | string | Yes | - | - | The audio of the persion on the left for generating the output. |
right_audio | string | Yes | - | - | The audio of the persion on the right for generating the output. |
image | string | Yes | - | The image for generating the output. | |
prompt | string | No | - | The prompt for generating the output. | |
order | string | No | meanwhile | meanwhile, left_right, right_left | The order of the two audio sources in the output video, "meanwhile" means both audio sources will play at the same time, "left_right" means the left audio will play first then the right audio will play, "right_left" means the right audio will play first then the left audio will play. |
resolution | string | No | 480p | 480p, 720p | The resolution of the output video. |
seed | integer | No | -1 | -1 ~ 2147483647 | The random seed to use for the generation. -1 means a random seed will be used. |
Response Parameters
Parameter | Type | Description |
---|---|---|
code | integer | HTTP status code (e.g., 200 for success) |
message | string | Status message (e.g., “success”) |
data.id | string | Unique identifier for the prediction, Task Id |
data.model | string | Model ID used for the prediction |
data.outputs | array | Array of URLs to the generated content (empty when status is not completed ) |
data.urls | object | Object containing related API endpoints |
data.urls.get | string | URL to retrieve the prediction result |
data.has_nsfw_contents | array | Array of boolean values indicating NSFW detection for each output |
data.status | string | Status of the task: created , processing , completed , or failed |
data.created_at | string | ISO timestamp of when the request was created (e.g., “2023-04-01T12:34:56.789Z”) |
data.error | string | Error message (empty if no error occurred) |
data.timings | object | Object containing timing details |
data.timings.inference | integer | Inference time in milliseconds |