Introducing Vidu Q3 Reference To Video on WaveSpeedAI

Vidu Q3 Reference-to-Video: Multi-Entity Consistent Video Generation from Reference Images

Creating AI-generated video with consistent characters has been one of the hardest problems in generative AI — until now. Vidu Q3 Reference-to-Video Mix solves this challenge by generating cinematic, multi-entity consistent videos from 1–4 reference images combined with a text prompt. Available today on WaveSpeedAI with no cold starts and pay-per-second pricing, this model lets creators, marketers, and developers produce character-driven video content where every subject stays visually coherent from the first frame to the last.

Built by ShengShu Technology — the team behind the globally top-ranked Vidu video generation platform — Q3 Reference-to-Video represents a leap forward from single-image animation. Instead of hoping your character looks the same across clips, you supply reference images that lock in identity, style, and appearance, then describe the scene you want. The result is production-ready video with synchronized audio, up to 1080p resolution, and up to 16 seconds of duration.

Try Vidu Q3 Reference-to-Video on WaveSpeedAI →

How Vidu Q3 Reference-to-Video Works

Vidu Q3 Reference-to-Video uses ShengShu’s proprietary U-ViT (Universal Vision Transformer) architecture, specifically engineered for multi-entity consistency. Here’s the workflow:

Upload 1–4 reference images — These establish the visual identity of characters, objects, or style elements you want preserved in the output video.
Write a text prompt — Describe the scene, action, camera movement, and atmosphere. A built-in Prompt Enhancer can automatically improve your descriptions for richer output.
Configure output settings — Choose your aspect ratio (16:9, 9:16, 1:1, and more), resolution (480p, 720p, or 1080p), and duration (up to 16 seconds).
Generate — The model blends all reference images into a cohesive, motion-consistent video with optional synchronized audio.

What sets this apart from standard image-to-video models is the multi-reference fusion. Traditional models animate a single image. Vidu Q3 Reference-to-Video combines multiple source images — different characters, different angles, different style references — into a single unified scene while preserving each entity’s distinct identity throughout the clip.

Technical Specifications

Parameter	Details
Input	1–4 reference images + text prompt
Resolution	480p, 720p, 1080p
Duration	Up to 16 seconds
Aspect Ratios	16:9, 9:16, 1:1, and more
Audio	Native synchronized audio generation (optional)
Reproducibility	Seed parameter for consistent results

Key Features of Vidu Q3 Reference-to-Video Mix

Multi-entity character consistency — Upload separate reference images for different characters and they’ll both appear in the output with their identities preserved. No more “character drift” between frames.
Native audio-visual generation — Vidu Q3 is the industry’s first long-form AI video model to deliver synchronized audio and video in a single pass, including ambient sound, dialogue-ready lip sync, and atmospheric audio.
1080p native rendering — Full HD output without artificial upscaling. Frames are clean, detailed, and well-balanced even in high-contrast scenes.
Up to 16 seconds per clip — The longest maximum duration among leading AI video models, giving you enough time for complete product demos, story arcs, and cinematic sequences.
Built-in Prompt Enhancer — Automatically enriches your scene descriptions for more detailed, cinematic output without requiring prompt engineering expertise.
Deterministic output with seed control — Lock in a specific result and iterate on resolution or duration changes while maintaining the same creative direction.

Best Use Cases for Vidu Q3 Reference-to-Video

Character-Driven Storytelling and Animation

Create animated series with consistent characters across multiple episodes. Upload character reference sheets and generate scene after scene where your protagonist looks identical every time. ShengShu demonstrated this capability at SXSW 2026, showcasing the world’s first AI solution for animated series production — and Vidu Q3 Reference-to-Video is the engine behind it.

Brand mascots and influencer avatars need to look the same across every piece of content. Upload your brand character’s reference images once, then generate dozens of short-form videos for TikTok, Instagram Reels, or YouTube Shorts — all visually consistent, all produced in minutes instead of days.

Product Marketing and E-Commerce Video

Place your product in dynamic, cinematic scenes without a photo studio. Upload product photos from multiple angles, write a prompt describing the lifestyle context, and generate marketing videos that showcase your product in action. The multi-reference input helps the model understand your product’s 3D structure for more accurate rendering.

Creative Concepting and Storyboard Prototyping

Pitch decks and storyboards come alive when you can show stakeholders actual video instead of static frames. Rapidly prototype multi-character scenes by uploading reference images of each character and describing the interaction. Iterate at 480p for speed, then render the approved concept at 1080p.

Music Videos and Short Films

Combine multiple character references with atmospheric prompts to generate music video sequences. With native audio generation, you can even produce synchronized ambient soundscapes alongside the visual output — then layer your own soundtrack in post-production.

Style-Consistent Video Series

Maintain a unified visual aesthetic across an entire content series. Upload the same style reference images for every generation to ensure your brand’s look and feel stays locked in, whether you’re producing 5 videos or 50.

Start generating consistent video content →

Vidu Q3 Reference-to-Video Pricing and API Access

WaveSpeedAI offers Vidu Q3 Reference-to-Video with straightforward per-second billing and no subscription required.

Pricing Table

Duration	480p	720p / 1080p
5s	$0.35	$0.77
10s	$0.70	$1.54
15s	$1.05	$2.31

Billing rates:

480p: $0.07 per second
720p / 1080p: $0.154 per second

API Integration

Integrate Vidu Q3 Reference-to-Video directly into your application with WaveSpeedAI’s REST API. No cold starts, no GPU provisioning — just send a request and get video back.

import json
import os
import time
from urllib.request import Request, urlopen

api_key = os.environ["WAVESPEED_API_KEY"]
headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
payload = {
    "prompt": "A cinematic ocean wave at sunrise, highly detailed",
    "images": [
        "https://interactive-examples.mdn.mozilla.net/media/cc0-images/painted-hand-298-332.jpg"
    ],
    "aspect_ratio": "16:9",
    "resolution": "720p",
    "duration": 5,
    "generate_audio": True
}

def request_json(url, data=None):
    request = Request(url, data=data, headers=headers, method="POST" if data else "GET")
    with urlopen(request) as response:
        return json.load(response)

# 1. Submit the prediction.
submit_body = request_json("https://api.wavespeed.ai/api/v3/vidu/q3/reference-to-video", json.dumps(payload).encode())
task = submit_body.get("data", submit_body)
prediction_id = task.get("id")
if not prediction_id:
    raise RuntimeError("Submission response did not contain a prediction id")
result_url = task.get("urls", {}).get("get") or f"https://api.wavespeed.ai/api/v3/predictions/{prediction_id}/result"

# 2. Poll until the prediction finishes.
while True:
    body = request_json(result_url)
    result = body.get("data", body)
    status = result.get("status")
    if status == "completed":
        print(result.get("outputs", []))
        break
    if status in {"failed", "cancelled", "timeout"}:
        raise RuntimeError(result)
    if status not in {"created", "processing"}:
        raise RuntimeError(f"Unexpected status: {status}")
    time.sleep(2)

WaveSpeedAI advantages:

No cold starts — Models are always warm and ready to generate
Pay-per-use — No subscriptions, no minimum commitments
REST API — Standard HTTP integration that works with any language or framework

Explore the full Vidu model collection on WaveSpeedAI for additional video generation capabilities.

Tips for Best Results with Vidu Q3 Reference-to-Video

Use clear, well-lit reference images — High-quality inputs with distinct subjects produce the most accurate identity preservation. Avoid blurry or heavily filtered source images.
Start at 480p for rapid iteration — Test your prompt and reference combination at lower resolution before committing to a 1080p render. This saves both time and cost.
Provide multiple angles when possible — If you want the model to understand a character’s full appearance, include front-facing and profile reference images. More references give the model a richer understanding of your subject’s 3D structure.
Write detailed, specific prompts — Instead of “two people talking,” try “two characters seated at a café table, warm afternoon light, one gesturing while speaking, shallow depth of field.” Use the built-in Prompt Enhancer if you want automatic improvement.
Use the seed parameter for consistency — Once you find a result you like, lock the seed and iterate on resolution, duration, or prompt tweaks while maintaining the same creative direction.
Disable audio when adding your own soundtrack — Set generate_audio to false if you plan to add custom music or voiceover in post-production to avoid conflicting audio layers.

Frequently Asked Questions About Vidu Q3 Reference-to-Video

What is Vidu Q3 Reference-to-Video?

Vidu Q3 Reference-to-Video is an AI video generation model that creates cinematic, multi-entity consistent videos from 1–4 reference images combined with a text prompt, supporting resolutions up to 1080p and durations up to 16 seconds with optional synchronized audio.

How much does Vidu Q3 Reference-to-Video cost?

Pricing starts at $0.07 per second for 480p and $0.154 per second for 720p/1080p on WaveSpeedAI, with no subscription required — you only pay for what you generate.

Can I use Vidu Q3 Reference-to-Video via API?

Yes. WaveSpeedAI provides a REST API for Vidu Q3 Reference-to-Video with no cold starts. You can integrate it into any application using the WaveSpeed Python SDK or standard HTTP requests.

How many reference images can I use with Vidu Q3 Reference-to-Video?

You can upload 1 to 4 reference images per generation. Each image helps the model understand characters, styles, or visual elements you want preserved in the output video.

Does Vidu Q3 Reference-to-Video generate audio?

Yes. Vidu Q3 includes native synchronized audio generation enabled by default, producing ambient sound and atmosphere alongside the video. You can disable this feature if you prefer to add your own audio in post-production.

Ready to create character-consistent AI video from your own reference images? Try Vidu Q3 Reference-to-Video on WaveSpeedAI today — no cold starts, no subscription, just results.