Skywork Ai Skyreels V3 Standard Multi Avatar
Playground
Try it on WavespeedAI!SkyReels V3 Standard Multi Avatar is a fast AI talking avatar video generation model that creates multi-speaker avatar videos from one image, multiple audio tracks, and bounding boxes. Ready-to-use REST inference API for group avatar videos, digital humans, virtual presenters, dialogue scenes, education content, marketing creatives, and professional multi-avatar video workflows with simple integration, no coldstarts, and affordable pricing.
Features
Skywork AI SkyReels V3 Standard Multi Avatar
Skywork AI SkyReels V3 Standard Multi Avatar generates a two-speaker avatar video from a single first-frame image plus separate left and right audio tracks. It is designed for dialogue scenes, interviews, presenter pairs, and other multi-character speaking workflows where each speaker is driven by their own audio input.
Why Choose This?
-
Two-speaker avatar generation Animate two speakers from a single scene image with separate audio tracks for each side.
-
Independent left/right audio control Upload different audio clips for the left and right speakers to drive each character separately.
-
Prompt-guided scene behavior Use a text prompt to guide mood, speaking style, scene setup, or camera feel.
-
Speaker detection control Choose whether speaker detection is based on
bodyorface. -
Simple workflow Upload one image, upload two audio clips, write a prompt, and generate the final conversation video.
-
Production-ready API Suitable for conversations, interviews, presenter scenes, and short-form multi-character avatar content.
Parameters
| Parameter | Required | Description |
|---|---|---|
| prompt | Yes | Text prompt describing the scene, action, camera, or avatar behavior. |
| first_frame_image | Yes | Input image used as the first frame and visual source for the two-speaker scene. |
| left_audio | Yes | Audio for the speaker on the left side of the image. |
| right_audio | Yes | Audio for the speaker on the right side of the image. |
| bboxes_type | No | Bounding box target type for speaker detection. Supported values: body or face. Default: body. |
How to Use
- Upload the first-frame image — provide the scene image containing the two speakers.
- Upload left speaker audio — add the audio for the person on the left side of the image.
- Upload right speaker audio — add the audio for the person on the right side of the image.
- Write your prompt — describe the speaking behavior, mood, scene setup, or camera style.
- Choose speaker detection type (optional) — use
bodyorfacedepending on how you want the model to identify each speaker. - Submit — run the model and download the generated video.
Example Prompt
Let the two speakers talk naturally in a professional office setting, with subtle head movement, realistic facial expressions, and stable identity for both people.
Pricing
Pricing is based on the combined duration of both audio tracks.
Billing Rules
- Base price is $0.08 per second
- Total billed duration = left audio duration + right audio duration
- Total price = $0.08 × (left audio duration + right audio duration)
prompt,first_frame_image, andbboxes_typedo not affect pricing
Example Costs
| Left Audio | Right Audio | Total Billed Duration | Cost |
|---|---|---|---|
| 5s | 5s | 10s | $0.80 |
| 8s | 6s | 14s | $1.12 |
| 10s | 10s | 20s | $1.60 |
| 12s | 15s | 27s | $2.16 |
Best Use Cases
- Two-person conversations — Create dialogue scenes with separate speaking control for each person.
- Interview videos — Animate interviewer and guest from a single scene image.
- Presenter pairs — Generate two-host explainer or announcement videos.
- Character conversations — Build short dialogue clips for storytelling or social content.
- Virtual spokesperson scenes — Create multi-speaker brand or business communication videos.
Pro Tips
- Use a clear image where the left and right speakers are visually distinct.
- Upload clean audio for both sides to improve lip-sync and speaking clarity.
- Use
facewhen facial positioning is more reliable than full-body placement. - Use
bodywhen the characters are farther from the camera or their full pose matters. - Keep the prompt simple and focused on speaking behavior, mood, or scene intent.
- Make sure the left and right audio assignments match the actual positions of the people in the image.
Notes
prompt,first_frame_image,left_audio, andright_audioare required.bboxes_typedefaults tobody.- Pricing depends on the sum of both audio durations.
- This workflow is intended for two-speaker avatar video generation from a single scene image.
Related Models
- Skywork AI SkyReels V3 Standard Single Avatar — Standard single-avatar talking video generation.
- Skywork AI SkyReels V3 Pro Single Avatar — Higher-quality single-avatar speaking video generation.
- Skywork AI SkyReels V3 Pro Multi Avatar — Higher-tier multi-avatar generation for more advanced character scenes.
- Skywork AI SkyReels V3 Reference-to-Video — Generate videos from one or more reference images and a prompt.
Authentication
For authentication details, please refer to the Authentication Guide.
API Endpoints
Submit Task & Query Result
# Submit the task
curl --location --request POST "https://api.wavespeed.ai/api/v3/skywork-ai/skyreels-v3-standard/multi-avatar" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${WAVESPEED_API_KEY}" \
--data-raw '{
"bboxes_type": "body"
}'
# Get the result
curl --location --request GET "https://api.wavespeed.ai/api/v3/predictions/${requestId}/result" \
--header "Authorization: Bearer ${WAVESPEED_API_KEY}"
Parameters
Task Submission Parameters
Request Parameters
| Parameter | Type | Required | Default | Range | Description |
|---|---|---|---|---|---|
| prompt | string | Yes | - | Text prompt describing the scene, action, camera, or avatar behavior. | |
| first_frame_image | string | Yes | - | - | Input image URL. |
| left_audio | string | Yes | - | - | Audio for the speaker on the left side of the image. |
| right_audio | string | Yes | - | - | Audio for the speaker on the right side of the image. |
| bboxes_type | string | No | body | body, face | Bounding box target type for speaker detection. |
Response Parameters
| Parameter | Type | Description |
|---|---|---|
| code | integer | HTTP status code (e.g., 200 for success) |
| message | string | Status message (e.g., “success”) |
| data.id | string | Unique identifier for the prediction, Task Id |
| data.model | string | Model ID used for the prediction |
| data.outputs | array | Array of URLs to the generated content (empty when status is not completed) |
| data.urls | object | Object containing related API endpoints |
| data.urls.get | string | URL to retrieve the prediction result |
| data.status | string | Status of the task: created, processing, completed, or failed |
| data.created_at | string | ISO timestamp of when the request was created (e.g., “2023-04-01T12:34:56.789Z”) |
| data.error | string | Error message (empty if no error occurred) |
| data.timings | object | Object containing timing details |
| data.timings.inference | integer | Inference time in milliseconds |
Result Request Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
| id | string | Yes | - | Task ID |
Result Response Parameters
| Parameter | Type | Description |
|---|---|---|
| code | integer | HTTP status code (e.g., 200 for success) |
| message | string | Status message (e.g., “success”) |
| data | object | The prediction data object containing all details |
| data.id | string | Unique identifier for the prediction, the ID of the prediction to get |
| data.model | string | Model ID used for the prediction |
| data.outputs | string | Array of URLs to the generated content. |
| data.urls | object | Object containing related API endpoints |
| data.urls.get | string | URL to retrieve the prediction result |
| data.status | string | Status of the task: created, processing, completed, or failed |
| data.created_at | string | ISO timestamp of when the request was created (e.g., “2023-04-01T12:34:56.789Z”) |
| data.error | string | Error message (empty if no error occurred) |
| data.timings | object | Object containing timing details |
| data.timings.inference | integer | Inference time in milliseconds |