Introducing Alibaba WAN 2.6 Reference To Video Flash on WaveSpeedAI
Alibaba WAN 2.6 Reference-to-Video Flash is Now Available on WaveSpeedAI
Speed meets consistency. WaveSpeedAI is excited to announce the launch of Alibaba WAN 2.6 Reference-to-Video Flash, the fast, distilled variant of Alibaba’s identity-preserving video generation model. If you’ve been working with reference-to-video workflows and wished the results came back faster, this model is built for you — delivering the same character consistency and multi-shot storytelling in a fraction of the generation time.
What is WAN 2.6 Reference-to-Video Flash?
WAN 2.6 Reference-to-Video Flash is the speed-optimized counterpart to the standard WAN 2.6 Reference-to-Video model. Distilled from the full-size model, it retains the core capability that makes the WAN 2.6 R2V family unique: you upload reference images of characters, props, or scenes, write a text prompt describing the video you want, and the model generates new video shots that faithfully preserve the identity and appearance of your reference subjects.
The Flash version achieves significantly faster inference — generating videos in seconds rather than minutes — while maintaining the visual quality, motion coherence, and identity preservation that define the WAN 2.6 series. It supports up to 5 reference images, 720p and 1080p output, durations of 5 or 10 seconds, and optional synchronized audio generation.
Key Features
-
Multi-Reference Input: Upload up to 5 reference images to guide the generation. Multiple angles and viewpoints of the same subject yield better identity preservation — a substantial upgrade over typical single-reference workflows
-
Identity Preservation at Speed: The Flash model maintains facial features, clothing, body proportions, and distinctive characteristics of your reference subjects across every generated frame, now with dramatically reduced wait times
-
Multi-Shot Composition: Choose between a single continuous shot or an automatic multi-shot composition that breaks your prompt into multiple coherent shots with smooth transitions — cinematic storytelling from a single API call
-
Built-In Audio Generation: Enable optional synchronized audio, including background music, ambient sounds, and Foley effects, matched to the generated video content. No post-production dubbing required
-
Resolution Flexibility: Generate in 720p (1280×720 or 720×1280) or 1080p (1920×1080 or 1080×1920) to match your output requirements — landscape or portrait
-
Prompt Expansion: A built-in prompt enhancer can automatically refine your descriptions into richer, more detailed prompts, improving generation quality without requiring expert prompt engineering
Real-World Use Cases
Character-Driven Social Media Content
Create TikToks, Reels, and YouTube Shorts featuring consistent characters across multiple videos. Upload a few photos of your character or brand mascot, describe the scene, and generate on-brand content at scale. The Flash speed makes rapid iteration practical — test dozens of variations in the time the standard model produces a handful.
Marketing and Advertising Prototyping
Generate product demos, brand commercials, and campaign concepts featuring specific people or characters with consistent identity across all shots. Use the multi-shot mode to produce structured ad sequences complete with synchronized audio, cutting days of pre-production down to minutes.
Narrative Storytelling and Animation
Build short narrative sequences where characters maintain their appearance across scene changes. The multi-reference capability lets you establish multiple characters in a single generation, while multi-shot mode handles transitions and pacing automatically. Writers and storyboard artists can visualize scenes almost as fast as they can describe them.
Rapid Pre-Visualization for Film
Directors and cinematographers can pre-visualize shots and sequences using reference photos of actors and locations. The Flash model’s speed enables a live creative feedback loop — adjust the prompt, regenerate, and see the result in seconds rather than waiting through lengthy render queues.
E-Commerce and Product Videos
Transform static product photos into dynamic product videos with consistent branding. Upload product images as references, describe the desired motion and environment, and generate polished video content ready for listings and ads.
Getting Started on WaveSpeedAI
Using WAN 2.6 Reference-to-Video Flash through the WaveSpeedAI API is straightforward:
import wavespeed
output = wavespeed.run(
"alibaba/wan-2.6/reference-to-video-flash",
{
"reference_urls": [
"https://example.com/character-front.jpg",
"https://example.com/character-side.jpg"
],
"prompt": "A woman walks through a sunlit garden, turning to smile at the camera",
"size": "1280*720",
"duration": 5,
"shot_type": "multi"
},
)
print(output["outputs"][0])
Configuration Options
| Parameter | Description |
|---|---|
reference_urls | 1-5 reference images for character and scene guidance |
prompt | Text description of the video scene and motion |
size | Output resolution: 720p or 1080p, landscape or portrait |
duration | Video length: 5 or 10 seconds |
shot_type | single for one continuous shot, multi for varied compositions |
enable_audio | Generate synchronized audio (enabled by default) |
enable_prompt_expansion | Auto-enhance your prompt (disabled by default) |
Pricing
| Resolution | Duration | Audio Off | Audio On |
|---|---|---|---|
| 720p | 5s | $0.25 | $0.50 |
| 720p | 10s | $0.375 | $0.75 |
| 1080p | 5s | $0.40 | $0.80 |
| 1080p | 10s | $0.60 | $1.20 |
Starting at just $0.25 per video — a fraction of what comparable models charge for identity-consistent generation.
Pro Tips
- Use multiple reference images from different angles for the most accurate identity preservation
- Select
multishot type for cinematic, dynamic compositions with automatic transitions - Disable audio when you don’t need it — processing is faster and costs half as much
- Use 720p for rapid prototyping and drafts, then switch to 1080p for final production renders
- Add a negative prompt like
"blurry, distorted, deformed"to sharpen output quality - If your generated video lacks sound, add phrasing like “with background ambience” to your prompt
Why WaveSpeedAI?
WaveSpeedAI provides the ideal infrastructure for WAN 2.6 Reference-to-Video Flash:
- No Cold Starts: Every request begins processing immediately — no waiting for model initialization
- Fast Inference: Optimized infrastructure paired with the Flash model’s distilled architecture means you get results in seconds
- Affordable Pricing: Identity-consistent video generation starting at $0.25, with transparent per-generation billing
- Simple REST API: Drop reference-to-video generation into any application or workflow with a single API call
Start Generating Today
Alibaba WAN 2.6 Reference-to-Video Flash brings identity-preserving video generation into real-time creative workflows. It’s the same multi-reference input, the same character consistency, and the same multi-shot storytelling — delivered at the speed your projects demand.
Whether you’re iterating on ad concepts, building a library of character-driven content, or pre-visualizing scenes for production, this model removes the wait and lets you focus on the creative work.
Try it now at wavespeed.ai/models/alibaba/wan-2.6/reference-to-video-flash.


