Enjoy 50% OFF Vidu Q3 & Q3 Pro models • Only on WaveSpeedAI | May 20 – Jun 2

SkyReels V3 Standard Multi Avatar API

skywork-ai /

SkyReels V3 Standard Multi Avatar is a fast AI talking avatar video generation model that creates multi-speaker avatar videos from one image, multiple audio tracks, and bounding boxes. Ready-to-use REST inference API for group avatar videos, digital humans, virtual presenters, dialogue scenes, education content, marketing creatives, and professional multi-avatar video workflows with simple integration, no coldstarts, and affordable pricing.

digital-human
Input

Drag & drop or click to upload

preview

Drag & drop or click to upload

Drag & drop or click to upload

Idle

$0.08per run·~12 / $1

ExamplesView all

Related Models

README

Skywork AI SkyReels V3 Standard Multi Avatar

Skywork AI SkyReels V3 Standard Multi Avatar generates a two-speaker avatar video from a single first-frame image plus separate left and right audio tracks. It is designed for dialogue scenes, interviews, presenter pairs, and other multi-character speaking workflows where each speaker is driven by their own audio input.

Why Choose This?

  • Two-speaker avatar generation Animate two speakers from a single scene image with separate audio tracks for each side.

  • Independent left/right audio control Upload different audio clips for the left and right speakers to drive each character separately.

  • Prompt-guided scene behavior Use a text prompt to guide mood, speaking style, scene setup, or camera feel.

  • Speaker detection control Choose whether speaker detection is based on body or face.

  • Simple workflow Upload one image, upload two audio clips, write a prompt, and generate the final conversation video.

  • Production-ready API Suitable for conversations, interviews, presenter scenes, and short-form multi-character avatar content.

Parameters

ParameterRequiredDescription
promptYesText prompt describing the scene, action, camera, or avatar behavior.
first_frame_imageYesInput image used as the first frame and visual source for the two-speaker scene.
left_audioYesAudio for the speaker on the left side of the image.
right_audioYesAudio for the speaker on the right side of the image.
bboxes_typeNoBounding box target type for speaker detection. Supported values: body or face. Default: body.

How to Use

  1. Upload the first-frame image — provide the scene image containing the two speakers.
  2. Upload left speaker audio — add the audio for the person on the left side of the image.
  3. Upload right speaker audio — add the audio for the person on the right side of the image.
  4. Write your prompt — describe the speaking behavior, mood, scene setup, or camera style.
  5. Choose speaker detection type (optional) — use body or face depending on how you want the model to identify each speaker.
  6. Submit — run the model and download the generated video.

Example Prompt

Let the two speakers talk naturally in a professional office setting, with subtle head movement, realistic facial expressions, and stable identity for both people.

Pricing

Pricing is based on the combined duration of both audio tracks.

Billing Rules

  • Base price is $0.08 per second
  • Total billed duration = left audio duration + right audio duration
  • Total price = $0.08 × (left audio duration + right audio duration)
  • prompt, first_frame_image, and bboxes_type do not affect pricing

Example Costs

Left AudioRight AudioTotal Billed DurationCost
5s5s10s$0.80
8s6s14s$1.12
10s10s20s$1.60
12s15s27s$2.16

Best Use Cases

  • Two-person conversations — Create dialogue scenes with separate speaking control for each person.
  • Interview videos — Animate interviewer and guest from a single scene image.
  • Presenter pairs — Generate two-host explainer or announcement videos.
  • Character conversations — Build short dialogue clips for storytelling or social content.
  • Virtual spokesperson scenes — Create multi-speaker brand or business communication videos.

Pro Tips

  • Use a clear image where the left and right speakers are visually distinct.
  • Upload clean audio for both sides to improve lip-sync and speaking clarity.
  • Use face when facial positioning is more reliable than full-body placement.
  • Use body when the characters are farther from the camera or their full pose matters.
  • Keep the prompt simple and focused on speaking behavior, mood, or scene intent.
  • Make sure the left and right audio assignments match the actual positions of the people in the image.

Notes

  • prompt, first_frame_image, left_audio, and right_audio are required.
  • bboxes_type defaults to body.
  • Pricing depends on the sum of both audio durations.
  • This workflow is intended for two-speaker avatar video generation from a single scene image.

Related Models

Accessibility:This website uses AI models provided by third parties.