Wan 2.1 Multitalk | AI Digital Human API

MultiTalk

Transform static photos into dynamic speaking videos with MultiTalk — a revolutionary audio-driven video generation framework by MeiGen-AI. Unlike traditional talking head methods, MultiTalk animates full conversations with realistic lip synchronization, natural body movements, and even multi-person interactions.

Why It Looks Great

Perfect lip sync: Advanced audio encoding (Wav2Vec) captures speech nuances including rhythm, tone, and pronunciation for precise synchronization.
Multi-person support: Generate videos with multiple speakers interacting naturally in the same scene.
Full body animation: Goes beyond facial movements to include natural gestures, expressions, and body language.
Dynamic camera control: Powered by Uni3C controlnet for subtle camera movements and professional cinematography.
Prompt-guided generation: Follow text instructions to control scene, pose, and behavior while maintaining audio sync.
Extended duration: Support for videos up to 10 minutes long.

How It Works

MultiTalk combines three powerful technologies for optimal results:

Component	Function
MultiTalk Core	Audio-to-motion synthesis with perfect lip synchronization
Wan2.1	Video diffusion model for realistic human anatomy, expressions, and movements
Uni3C	Camera controlnet for dynamic, professional-looking scene control

How to Use

Upload your image — provide a photo with one or more people.
Upload your audio — add the speech or song you want the subject to perform.
Write your prompt (optional) — describe the scene, pose, or behavior you want.
Set duration — choose your desired video length.
Run — click the button to generate.
Download — preview and save your talking video.

Pricing

Per 5-second billing based on audio duration. Maximum video length: 10 minutes.

Metric	Cost
Per 5 seconds	$0.15

Billing Rules

Minimum charge: 5 seconds ($0.15)
Maximum duration: 600 seconds (10 minutes)
Billed duration: Audio length rounded up to nearest 5-second increment
Total cost: (Billed duration ÷ 5) × $0.15

Examples

Audio Length	Billed Duration	Calculation	Total Cost
3s	5s (minimum)	5 ÷ 5 × $0.15	$0.15
12s	15s	15 ÷ 5 × $0.15	$0.45
30s	30s	30 ÷ 5 × $0.15	$0.90
1m (60s)	60s	60 ÷ 5 × $0.15	$1.80
5m (300s)	300s	300 ÷ 5 × $0.15	$9.00
10m (600s)	600s (maximum)	600 ÷ 5 × $0.15	$18.00

Best Use Cases

Virtual Presentations — Create professional talking head videos from a single photo.
Content Localization — Dub videos into different languages with perfect lip sync.
Music & Performance — Generate singing videos with synchronized mouth movements.
Conversational Content — Produce multi-person dialogue scenes for storytelling.
Marketing & Advertising — Create spokesperson videos without filming sessions.

Related Models

Wan2.1 Text-to-Video / Image-to-Video — For general video generation without audio sync.
Uni3C Camera Control — For creating custom camera motion transfers.

Pro Tips for Best Results

Use clear, front-facing photos with visible faces for the best lip synchronization.
High-quality audio with minimal background noise produces more accurate results.
For multi-person scenes, ensure all faces are clearly visible in the source image.
Add scene descriptions in your prompt to enhance the visual context and atmosphere.
Start with shorter clips to test synchronization before generating longer videos.

Notes

If using URLs, ensure they are publicly accessible.
Processing time scales with video duration and complexity.
Best results come from clear speech audio and well-lit portrait images.
For singing content, ensure the audio has clear vocal tracks.

Wan 2.1 Multitalk API — Quick start

Grab a WaveSpeedAI API key, then call POST https://api.wavespeed.ai/api/v3/wavespeed-ai/wan-2.1/multitalk with your input as JSON. The endpoint returns a prediction id; poll the prediction endpoint until status flips to completed, then read the output URL from data.outputs[0]. Examples for Wan 2.1 Multitalk below.

HTTP example

# Submit the prediction
curl -X POST "https://api.wavespeed.ai/api/v3/wavespeed-ai/wan-2.1/multitalk" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $WAVESPEED_API_KEY" \
  -d '{
    "image": "https://example.com/your-input.jpg",
    "audio": "https://example.com/your-audio.mp3",
    "prompt": "A cinematic shot of a city at sunset, soft golden light",
    "seed": -1
}'

# Response includes a prediction id. Poll for the result:
curl -X GET "https://api.wavespeed.ai/api/v3/predictions/{request_id}/result" \
  -H "Authorization: Bearer $WAVESPEED_API_KEY"

# When status is "completed", read the output from data.outputs[0].

Node.js example

// npm install wavespeed
const WaveSpeed = require('wavespeed');

const client = new WaveSpeed(); // reads WAVESPEED_API_KEY from env

const result = await client.run("wavespeed-ai/wan-2.1/multitalk", {
        "image": "https://example.com/your-input.jpg",
        "audio": "https://example.com/your-audio.mp3",
        "prompt": "A cinematic shot of a city at sunset, soft golden light",
        "seed": -1
});

console.log(result.outputs[0]); // → URL of the generated output

Python example

# pip install wavespeed
import wavespeed

output = wavespeed.run(
    "wavespeed-ai/wan-2.1/multitalk",
    {
    "image": "https://example.com/your-input.jpg",
    "audio": "https://example.com/your-audio.mp3",
    "prompt": "A cinematic shot of a city at sunset, soft golden light",
    "seed": -1
}
)

print(output["outputs"][0])  # → URL of the generated output

Wan 2.1 Multitalk API — Frequently asked questions

What is the Wan 2.1 Multitalk API?

Wan 2.1 Multitalk is a WaveSpeedAI model for talking-avatar generation, exposed as a REST API on WaveSpeedAI. MultiTalk (WAN 2.1) is an audio-driven AI that turns a single image and audio into talking or singing conversational videos. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing. You can call it programmatically or try it from the playground above.

How do I call the Wan 2.1 Multitalk API?

POST your input parameters to the model's REST endpoint (shown in the API tab of this playground) with your WaveSpeedAI API key in the Authorization header. Submission returns a prediction ID; poll the prediction endpoint until status flips to "completed", then read the output URL from the result. The playground generates a ready-to-paste code sample in Python, JavaScript, or cURL for whatever inputs you've set. Full request/response shape is documented at https://wavespeed.ai/docs/docs-api/wavespeed-ai/wan-2.1-multitalk.

How much does Wan 2.1 Multitalk cost per run?

Wan 2.1 Multitalk starts at $0.15 per run. That figure is the base price — the final charge scales with the parameters you set in the form (output size, length, count, references, or whatever knobs this model exposes), so a higher-quality or larger output costs more than a minimal one. The exact cost for your current input is shown live next to the Generate button before you submit, and the actual per-call charge is recorded on the prediction afterwards.

What inputs does Wan 2.1 Multitalk accept?

Key inputs: `prompt`, `image`, `audio`, `seed`. The full JSON schema (types, defaults, allowed values) is rendered above the Generate button and mirrored in the API reference at https://wavespeed.ai/docs/docs-api/wavespeed-ai/wan-2.1-multitalk.

How long does Wan 2.1 Multitalk take to generate?

Average end-to-end generation time on WaveSpeedAI is around 276 seconds per request — measured across recent runs. Queue time scales with global demand; live status is visible in the prediction record.

Can I use Wan 2.1 Multitalk outputs commercially?

Commercial usage rights depend on the model's license, set by its provider (WaveSpeedAI). The license summary appears on the model card above; see WaveSpeedAI's Terms of Service for platform-level conditions.

示例查看全部

相关模型

README