Home/Explore/Wan 2.2 Video Models/wavespeed-ai/wan-2.2/speech-to-video

image-to-video

wavespeed-ai/wan-2.2/speech-to-video

Wan-2.2-S2V is a video model that generates high-quality videos from static images and audio, with realistic facial expressions, body movements. Our endpoint starts with $0.15 per 5 seconds video generation (480p) and supports a maximum generation length of 120 seconds.

Doc

Hint: You can drag and drop a file or click to upload

preview

Hint: You can drag and drop a file or click to upload

Idle

Your request will cost $0.15 per run.

For $10 you can run this model approximately 66 times.

One more thing:

ExamplesView all

README

Wan-2.2-S2V

What is Wan-2.2-S2V?

Wan-2.2-S2V is a video model that generates high-quality videos from static images and audio, with realistic facial expressions, body movements, and professional camera work for film and television applications.

Pricing

Our endpoint starts with $0.15 per 5 seconds (480p) or $0.3 per 5 seconds (720p) video generation and supports a maximum generation length of 120 seconds.

How Wan-2.2-S2V Works

Wan-2.2-S2V leverages advanced AI technology to understand both audio signals and visual information.

Audio Analysis: Wan-2.2-S2V uses a powerful audio encoder (Wav2Vec) to understand the nuances of speech, including rhythm, tone, and pronunciation patterns.

Visual Understanding: Built on the robust Wan2.2 video diffusion model (you can visit our Wan2.2 workflow for t2v/i2v eneration), Wan-2.2-S2V understands human anatomy, facial expressions, and body movements.

Perfect Synchronization: Through sophisticated attention mechanisms, Wan-2.2-S2V learns to perfectly align lip movements with audio while maintaining natural facial expressions and body language.

Instruction Following: Unlike simpler methods, Wan-2.2-S2V can follow text prompts to control the scene, pose, and overall behavior while maintaining audio synchronization.