Wan 2.1 Multitalk

Playground

MultiTalk (WAN 2.1) is an audio-driven AI that turns a single image and audio into talking or singing conversational videos. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Features

MultiTalk

Transform static photos into dynamic speaking videos with MultiTalk — a revolutionary audio-driven video generation framework by MeiGen-AI. Unlike traditional talking head methods, MultiTalk animates full conversations with realistic lip synchronization, natural body movements, and even multi-person interactions.

Why It Looks Great

Perfect lip sync: Advanced audio encoding (Wav2Vec) captures speech nuances including rhythm, tone, and pronunciation for precise synchronization.
Multi-person support: Generate videos with multiple speakers interacting naturally in the same scene.
Full body animation: Goes beyond facial movements to include natural gestures, expressions, and body language.
Dynamic camera control: Powered by Uni3C controlnet for subtle camera movements and professional cinematography.
Prompt-guided generation: Follow text instructions to control scene, pose, and behavior while maintaining audio sync.
Extended duration: Support for videos up to 10 minutes long.

How It Works

MultiTalk combines three powerful technologies for optimal results:

Component	Function
MultiTalk Core	Audio-to-motion synthesis with perfect lip synchronization
Wan2.1	Video diffusion model for realistic human anatomy, expressions, and movements
Uni3C	Camera controlnet for dynamic, professional-looking scene control

How to Use

Upload your image — provide a photo with one or more people.
Upload your audio — add the speech or song you want the subject to perform.
Write your prompt (optional) — describe the scene, pose, or behavior you want.
Set duration — choose your desired video length.
Run — click the button to generate.
Download — preview and save your talking video.

Pricing

Per 5-second billing based on audio duration. Maximum video length: 10 minutes.

Metric	Cost
Per 5 seconds	$0.15

Billing Rules

Minimum charge: 5 seconds ($0.15)
Maximum duration: 600 seconds (10 minutes)
Billed duration: Audio length rounded up to nearest 5-second increment
Total cost: (Billed duration ÷ 5) × $0.15

Examples

Audio Length	Billed Duration	Calculation	Total Cost
3s	5s (minimum)	5 ÷ 5 × $0.15	$0.15
12s	15s	15 ÷ 5 × $0.15	$0.45
30s	30s	30 ÷ 5 × $0.15	$0.90
1m (60s)	60s	60 ÷ 5 × $0.15	$1.80
5m (300s)	300s	300 ÷ 5 × $0.15	$9.00
10m (600s)	600s (maximum)	600 ÷ 5 × $0.15	$18.00

Best Use Cases

Virtual Presentations — Create professional talking head videos from a single photo.
Content Localization — Dub videos into different languages with perfect lip sync.
Music & Performance — Generate singing videos with synchronized mouth movements.
Conversational Content — Produce multi-person dialogue scenes for storytelling.
Marketing & Advertising — Create spokesperson videos without filming sessions.

Wan2.1 Text-to-Video / Image-to-Video — For general video generation without audio sync.
Uni3C Camera Control — For creating custom camera motion transfers.

Pro Tips for Best Results

Use clear, front-facing photos with visible faces for the best lip synchronization.
High-quality audio with minimal background noise produces more accurate results.
For multi-person scenes, ensure all faces are clearly visible in the source image.
Add scene descriptions in your prompt to enhance the visual context and atmosphere.
Start with shorter clips to test synchronization before generating longer videos.

Notes

If using URLs, ensure they are publicly accessible.
Processing time scales with video duration and complexity.
Best results come from clear speech audio and well-lit portrait images.
For singing content, ensure the audio has clear vocal tracks.

Authentication

For authentication details, please refer to the Authentication Guide.

API Endpoints

Submit Task & Query Result


# Submit the task
curl --location --request POST "https://api.wavespeed.ai/api/v3/wavespeed-ai/wan-2.1/multitalk" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${WAVESPEED_API_KEY}" \
--data-raw '{
    "seed": -1
}'

# Get the result
curl --location --request GET "https://api.wavespeed.ai/api/v3/predictions/${requestId}/result" \
--header "Authorization: Bearer ${WAVESPEED_API_KEY}"

Parameters

Task Submission Parameters

Request Parameters

Parameter	Type	Required	Default	Range	Description
image	string	Yes		-	The image for generating the output.
audio	string	Yes	-	-	The audio for generating the output.
prompt	string	No		-	The positive prompt for the generation.
seed	integer	No	-1	-1 ~ 2147483647	The random seed to use for the generation. -1 means a random seed will be used.

Response Parameters

Parameter	Type	Description
code	integer	HTTP status code (e.g., 200 for success)
message	string	Status message (e.g., “success”)
data.id	string	Unique identifier for the prediction, Task Id
data.model	string	Model ID used for the prediction
data.outputs	array	Array of URLs to the generated content (empty when status is not `completed`)
data.urls	object	Object containing related API endpoints
data.urls.get	string	URL to retrieve the prediction result
data.has_nsfw_contents	array	Array of boolean values indicating NSFW detection for each output
data.status	string	Status of the task: `created`, `processing`, `completed`, or `failed`
data.created_at	string	ISO timestamp of when the request was created (e.g., “2023-04-01T12:34:56.789Z”)
data.error	string	Error message (empty if no error occurred)
data.timings	object	Object containing timing details
data.timings.inference	integer	Inference time in milliseconds

Result Request Parameters

Parameter	Type	Required	Default	Description
id	string	Yes	-	Task ID

Result Response Parameters

Parameter	Type	Description
code	integer	HTTP status code (e.g., 200 for success)
message	string	Status message (e.g., “success”)
data	object	The prediction data object containing all details
data.id	string	Unique identifier for the prediction, the ID of the prediction to get
data.model	string	Model ID used for the prediction
data.outputs	string	Array of URLs to the generated content (empty when status is not completed).
data.urls	object	Object containing related API endpoints
data.urls.get	string	URL to retrieve the prediction result
data.status	string	Status of the task: `created`, `processing`, `completed`, or `failed`
data.created_at	string	ISO timestamp of when the request was created (e.g., “2023-04-01T12:34:56.789Z”)
data.error	string	Error message (empty if no error occurred)
data.timings	object	Object containing timing details
data.timings.inference	integer	Inference time in milliseconds

Wan 2.1 Mocha Wan 2.1 Synthetic To Real Ditto