Home/Explore/character-ai/ovi/image-to-video
image-to-video

image-to-video

Character-AI Ovi | Image To Video With Audio From Text Or Image Inputs | WaveSpeedAI

character-ai/ovi/image-to-video

Ovi is a Veo-3-like image-to-video model that generates synchronized video and audio from text or text+image prompts. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Hint: You can drag and drop a file or click to upload

preview

Idle

Your request will cost $0.15 per run.

For $10 you can run this model approximately 66 times.

One more thing::

ExamplesView all

README

Ovi (I2V Version)

Ovi is a veo-3 like, image-to-audio-video (I2AV) generation model that creates synchronized video and audio from a single image plus a descriptive text prompt.

It is designed for short-form storytelling, where a still image is brought to life with cinematic motion, dialogue, and sound.

🌟 Key Features

  • 🎬 Image → Video+Audio – Bring a static image to life with synchronized audiovisual output.
  • 📝 Prompt-driven – Use text prompts to control scene dynamics, style, and audio.
  • 🗣️ Speech & Sound – Insert dialogue or sound effects using special tags.
  • ⏱️ Short-form Output – Generates 5-second clips at 24 FPS.

💲 Pricing

Video LengthCost
5 seconds$0.15

Billing Rules

  • Minimum charge: 5 seconds

🎨 How to Use

  1. Upload Image

    • Provide a reference image as the base frame.
    • Make sure the URL is valid and accessible (a preview should appear).
  2. Enter Prompt

    • Describe scene motion, style, and atmosphere.

    • Use tags for sound:

      • <S> ... <E> → Speech (converted into spoken audio)
      • <AUDCAP> ... <ENDAUDCAP> → Background audio / effects
  3. Set Seed

    • -1 = random output
    • Any fixed number = reproducible results
  4. Run

    • Click Run $0.15 to generate your 5s image-to-audio-video clip.
    • Preview and download the result.

📝 Prompt Example

A wide shot of a medieval knight standing in the rain, sword planted into the ground, glowing with mystical energy.  
<S>I will defend this land until my last breath.<E>  
<AUDCAP>Thunder rolls across the dark sky, distant war drums echo.<ENDAUDCAP>

🙏 Acknowledgements

  • Wan2.2 – Video backbone initialization
  • MMAudio – Audio encoder/decoder inspiration

⭐ Citation

If Ovi is useful, please ⭐ the repo and cite the paper:

@misc{low2025ovitwinbackbonecrossmodal,
      title={Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation}, 
      author={Chetwin Low and Weimin Wang and Calder Katyal},
      year={2025},
      eprint={2510.01284},
      archivePrefix={arXiv},
      primaryClass={cs.MM},
      url={https://arxiv.org/abs/2510.01284}, 
}