SkyReels V3 Pro Multi Avatar is a high-quality AI talking avatar video generation model that creates multi-speaker avatar videos from one image, multiple audio tracks, and bounding boxes. Ready-to-use REST inference API for group avatar videos, digital humans, virtual presenters, dialogue scenes, education content, marketing creatives, and professional multi-avatar video workflows with simple integration, no coldstarts, and affordable pricing.
Siap
$0.12per run·~83 / $10
Skywork AI SkyReels V3 Pro Multi Avatar generates a two-speaker avatar video from a single first-frame image plus separate left and right audio tracks. It is designed for higher-quality multi-character speaking scenes, with stronger realism, smoother facial animation, and more polished lip-sync than the Standard variant.
Two-speaker avatar generation Animate two speakers from a single scene image with separate audio tracks for each side.
Higher-quality multi-avatar performance The Pro variant is built for stronger realism, cleaner lip-sync, and more polished facial animation.
Separate left and right speaker control Upload different audio clips for the left and right speakers to drive each character independently.
Prompt-guided scene behavior Add a prompt to guide mood, scene setup, speaking style, or camera feel.
Speaker detection control
Use bboxes_type to control whether speaker detection is based on body or face.
Production-ready workflow Suitable for conversations, interviews, presenter scenes, and other multi-character speaking video workflows.
| Parameter | Required | Description |
|---|---|---|
| prompt | Yes | Text prompt describing the scene, action, camera, or avatar behavior. |
| first_frame_image | Yes | Input image used as the first frame and visual source for the two-speaker scene. |
| left_audio | Yes | Audio for the speaker on the left side of the image. |
| right_audio | Yes | Audio for the speaker on the right side of the image. |
| bboxes_type | No | Bounding box target type for speaker detection. Supported values: body or face. Default: body. |
body or face depending on how you want the model to identify each speaker.Let the two speakers talk naturally in a professional office setting, with subtle head movement, realistic facial expressions, and stable identity for both people.
Pricing is based on the combined duration of both audio tracks.
prompt, first_frame_image, and bboxes_type do not affect pricing| Left Audio | Right Audio | Total Billed Duration | Cost |
|---|---|---|---|
| 5s | 5s | 10s | $1.20 |
| 8s | 6s | 14s | $1.68 |
| 10s | 10s | 20s | $2.40 |
| 12s | 15s | 27s | $3.24 |
face when facial positioning is more reliable than full-body placement.body when the characters are farther from the camera or their full pose matters.prompt, first_frame_image, left_audio, and right_audio are required.bboxes_type defaults to body.