Alibaba WAN 2.6 | Reference-To-Video From Multi-View Images With Identity Consistency

Alibaba / WAN 2.6 — Reference-to-Video (wan2.6-ref2v)

WAN 2.6 Reference-to-Video is Alibaba’s WanXiang 2.6 model for turning example videos + a text prompt into new shots. Provide up to two reference clips and the model learns their style, motion, and framing, then generates a new 5–10s video at up to 1080p.

🚀 Highlights

Reference-driven motion & style – Mimic camera moves, pacing and composition from your reference videos while following your prompt.
Up to two reference videos – Blend style from one clip and motion from another, or use different angles of the same scene.
Cinematic resolutions – Choose from 720p, or 1080p (portrait or landscape).
Story-aware generation – Works with prompt expansion and multishots to build richer, multi-shot sequences.
Audio-ready pipeline – Optional audio field for workflows that need motion aligned to external sound.

Output format: MP4 video at the selected size and duration.

🧩 Parameters

prompt* Text description of the new scene: characters, actions, environment, camera motion, mood, style, etc.
videos* 1–2 reference clips (URLs or uploads). These guide style, camera work, pacing, and motion structure.
negative_prompt Things to avoid, e.g. watermark, text, distortion, extra limbs.
audio (optional) External audio track for advanced pipelines where timing should loosely follow a given soundtrack. For most use cases you can leave this empty.
size One of the following resolution presets:
- 1280×720 or 720×1280 → 720p
- 1920×1080 or 1080×1920 → 1080p
duration Video length: 5 s or 10 s.
shot_type
- single – Single-shot clip.
- multi – When combined with enable_prompt_expansion, WAN 2.6 can break your idea into multiple shots of the same scene.
enable_prompt_expansion If enabled, Alibaba’s prompt optimizer expands short prompts into a richer internal script before generation.
seed Random seed. Set -1 for a new random result each time, or fix to a specific integer for reproducible layout and motion.

💰 Pricing

Resolution	Sizes (W×H)	5 s	10 s
720p	1280×720 / 720×1280	$1.00	$1.50
1080p	1920×1080 / 1080×1920	$1.50	$2.25

✅ How to Use

Prepare 1–2 reference videos
- Clean motion, stable framing, and clear style work best.
- You can use two angles of the same scene, or two stylistically similar clips.
Write your prompt
- Describe what should happen in the new video, not just what’s in the references.
- Example: “Cyberpunk alley at night, hero walking toward camera, slow dolly-in, neon reflections on wet ground, cinematic color grading.”
(Optional) Add a negative_prompt
- Keep it short and focused: watermark, text, logo, extra limbs, low resolution.
Choose size and duration
- 720p/1080p according to your platform (Reels, TikTok, YouTube, etc.).
- 5 s for quick shots, 10 s for more complex actions.
Configure multishots & prompt expansion
- Turn on enable_prompt_expansion for shorter prompts.
- Enable multishots if you want WAN 2.6 to create a multi-shot sequence.
Set seed (optional)
- Use a fixed seed to iterate while keeping composition similar.
Run the model and download the generated clip.

💡 Prompt & Reference Tips

Keep reference content and prompt aligned – if references show a city night scene, avoid asking for a sunny beach.
Use two references when you want to mix:
- video A’s camera & motion + video B’s lighting/style.
Mention where you want the model to follow reference closely, e.g.: “Follow reference camera speed and angles, but change character outfit to futuristic armor.”
For portrait/vertical social content, select 480×832, 720×1280, or 1080×1920; for YouTube-style landscape, use the corresponding wide resolutions.

More Models to Try

vidu/reference-to-video-q2 Vidu’s Q2 reference-to-video model for turning style and motion from example clips into new shots, ideal for anime-style edits, trailers, and storyboards.
google/veo3.1/reference-to-video Google Veo 3.1 reference-conditioned video generator, designed for high-fidelity cinematic motion that closely follows your reference footage.
kwaivgi/kling-video-o1/reference-to-video Kwaivgi’s Kling Video O1 reference-to-video model, great for copying camera language and pacing from a sample clip while changing characters or scenes.
bytedance/seedance-v1-lite/reference-to-video ByteDance SeeDance v1 Lite, a lightweight reference-to-video model for fast, style-consistent generations based on short example videos.

Alibaba WAN 2.6 Reference-to-Video turns character, prop, or scene references—single or multi-view—into new video shots with preserved identity, style, and layout plus smooth, coherent motion. Ready-to-use REST inference API, best performance, no cold starts, affordable pricing.

ExamplesView all

README