Nano Banana 2Nano Banana 2 is live
WaveSpeed.ai
Inicio/Explorar/Avatar Lipsync Models/bytedance/avatar-omni-human-1.5
digital-human

digital-human

Bytedance Avatar OmniHuman 1.5

bytedance/avatar-omni-human-1.5

OmniHuman 1.5 converts audio and visual cues into lifelike avatar animations for virtual humans, storytelling, and interactive agents. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Input

Hint: You can drag and drop a file or click to upload

preview

Hint: You can drag and drop a file or click to upload

If enabled, the output will be encoded into a BASE64 string instead of a URL. This property is only available through the API.

Idle

Tu solicitud costará $0.16 por ejecución.

Con $10 puedes ejecutar este modelo aproximadamente 62 veces.

EjemplosVer todo

README

bytedance/avatar-omni-human-1.5

ByteDance Avatar Omni-Human 1.5 is an advanced vision-audio fusion model designed to animate avatars through cognitive and emotional simulation. By combining image and audio inputs, it brings static portraits to life — generating natural facial expressions, synchronized lip movements, and realistic emotional responses.

🧠 Concept

Inspired by the paper “Instilling an Active Mind in Avatars via Cognitive Simulation”, the model simulates attention, emotion, and cognition to create avatars that don’t just move — they react intelligently.

🌟 Key Features

  • Audio-Driven Realism Generates precise lip-sync and emotional nuance directly from voice input.

  • Expressive Cognitive Simulation Models subtle eye movements, micro-expressions, and reactive behavior to emulate human presence.

  • Universal Avatar Adaptation Works with any static portrait or illustration to create consistent, lifelike performance.

  • Cross-Domain Support Handles both photorealistic and stylized avatars, adapting its realism to the visual style.

  • Flexible Output Encoding Choose between URL output or BASE64 encoding for seamless integration via API.

⚙️ Parameters

ParameterDescription
image*Upload a reference portrait or character image (JPG / PNG).
audio*Upload or link to an audio file (WAV / MP3) for lip-sync and emotion mapping.

💰 Pricing

MetricPrice
Per second of generated audio$0.25 / s

💡 Use Cases

  • Digital Avatars & VTubing — Drive realistic avatars from real voices in real time.
  • Virtual Humans & NPCs — Give game or metaverse characters believable cognitive reactions.
  • Marketing & Storytelling — Create expressive digital spokespeople or narrators.
  • AI Companions & Education — Build avatars that engage naturally in learning or dialogue contexts.

📝 Notes

  • The longer the audio, the higher the total cost (calculated per second).
  • For best results, use clear, high-quality audio and well-lit frontal images.
  • BASE64 output is API-only, useful for direct embedding into web applications.