SkyReels V3 Standard Multi Avatar is a fast AI talking avatar video generation model that creates multi-speaker avatar videos from one image, multiple audio tracks, and bounding boxes. Ready-to-use REST inference API for group avatar videos, digital humans, virtual presenters, dialogue scenes, education content, marketing creatives, and professional multi-avatar video workflows with simple integration, no coldstarts, and affordable pricing.
ว่าง
$0.08ต่อครั้ง·~12 / $1
Skywork AI SkyReels V3 Standard Multi Avatar generates a two-speaker avatar video from a single first-frame image plus separate left and right audio tracks. It is designed for dialogue scenes, interviews, presenter pairs, and other multi-character speaking workflows where each speaker is driven by their own audio input.
Two-speaker avatar generation Animate two speakers from a single scene image with separate audio tracks for each side.
Independent left/right audio control Upload different audio clips for the left and right speakers to drive each character separately.
Prompt-guided scene behavior Use a text prompt to guide mood, speaking style, scene setup, or camera feel.
Speaker detection control
Choose whether speaker detection is based on body or face.
Simple workflow Upload one image, upload two audio clips, write a prompt, and generate the final conversation video.
Production-ready API Suitable for conversations, interviews, presenter scenes, and short-form multi-character avatar content.
| Parameter | Required | Description |
|---|---|---|
| prompt | Yes | Text prompt describing the scene, action, camera, or avatar behavior. |
| first_frame_image | Yes | Input image used as the first frame and visual source for the two-speaker scene. |
| left_audio | Yes | Audio for the speaker on the left side of the image. |
| right_audio | Yes | Audio for the speaker on the right side of the image. |
| bboxes_type | No | Bounding box target type for speaker detection. Supported values: body or face. Default: body. |
body or face depending on how you want the model to identify each speaker.Let the two speakers talk naturally in a professional office setting, with subtle head movement, realistic facial expressions, and stable identity for both people.
Pricing is based on the combined duration of both audio tracks.
prompt, first_frame_image, and bboxes_type do not affect pricing| Left Audio | Right Audio | Total Billed Duration | Cost |
|---|---|---|---|
| 5s | 5s | 10s | $0.80 |
| 8s | 6s | 14s | $1.12 |
| 10s | 10s | 20s | $1.60 |
| 12s | 15s | 27s | $2.16 |
face when facial positioning is more reliable than full-body placement.body when the characters are farther from the camera or their full pose matters.prompt, first_frame_image, left_audio, and right_audio are required.bboxes_type defaults to body.