Home/Explore/wavespeed-ai/hunyuan-avatar

audio-to-video

wavespeed-ai/hunyuan-avatar

Hunyuan Avatar is an audio-driven conversational AI video generation model. Create talking or singing videos from a single image and audio input. Our endpoint starts with $0.15 per 5 seconds video generation (480p/720p) and supports a maximum generation length of 120 seconds.

Doc

Hint: You can drag and drop a file or click to upload

Hint: You can drag and drop a file or click to upload

preview

Idle

Your request will cost $0.15 per run.

For $10 you can run this model approximately 66 times.

One more thing:

ExamplesView all

README

Hunyuan Avatar - High-Fidelity Audio-Driven Human Animation

Transform audio and images into high-quality AI avatar videos with Hunyuan Avatar, an advanced audio-driven human animation model designed for creating dynamic, emotion-controllable, and multi-character dialogue videos.

Overview HunyuanAvatar is a High-Fidelity Audio-Driven Human Animation model for Multiple Characters. The model excels at generating highly dynamic videos while preserving character consistency, achieving precise emotion alignment between characters and audio, and enabling multi-character audio-driven animation through innovative multimodal diffusion transformer (MM-DiT) architecture.

Key Capabilities Create production-ready avatar videos with:

  • Character Consistency Preservation Generate dynamic videos while maintaining strong character consistency Character image injection module eliminates condition mismatch between training and inference Fine-tune facial characteristics across different poses and expressions

  • Audio-Driven Animation High-fidelity audio-driven human animation capabilities Audio Emotion Module (AEM) extracts and transfers emotional cues from reference images Face-Aware Audio Adapter (FAA) enables independent audio injection for multi-character scenarios

  • Multi-Character Support Generate multi-character dialogue videos from single inputs Independent audio injection via cross-attention for multiple characters Realistic avatars in dynamic, immersive scenarios