Ovi Text to Video | Powerful Text-to-Video API

Home/Explore/Character Ai/Ovi/Text To Video

character-ai /

Ovi is a veo-3-like model that converts text or text+image prompts into synchronized video with audio. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-video

Input

prompt*

size

seed

Enable Safety Checker

Idle

$0.15per run·~66 / $10

ExamplesView all

A bearded man wearing large dark sunglasses and a blue patterned cardigan sits in a studio, actively speaking into a large, suspended microphone. He has headphones on and gestures with his hands, displaying rings on his fingers. Behind him, a wall is covered with red, textured sound-dampening foam on the left, and a white banner on the right features the "CHOICE FM" logo and various social media handles like "@ilovechoicefm" with "RALEIGH" below it. The man intently addresses the microphone, articulating, <S> Ovi is now available on WaveSpeedAI, try it now! <E>. He leans forward slightly as he speaks, maintaining a serious expression behind his sunglasses.. <AUDCAP>Clear male voice speaking into a microphone, a low background hum.<ENDAUDCAP>

A young man with curly brown hair sits on a wooden stool in a warmly lit room, cradling an acoustic guitar. He's wearing a simple green t-shirt and jeans. He looks down at his hands on the fretboard, demonstrating a chord change slowly. A packed bookshelf and a soft glowing lamp create a cozy atmosphere behind him. He looks up towards the camera and explains, <S> The trick is to move your ring and middle finger together, see? It makes the transition to the C-chord much smoother. <E>. <AUDCAP>A calm, friendly male voice, the rich, resonant sound of an acoustic guitar being strummed, and the soft squeak of fingers on the strings.<ENDAUDCAP>

A young woman with long, wavy blonde hair and light-colored eyes is shown in a medium shot against a blurred backdrop of lush green foliage. She wears a denim jacket over a striped top. Initially, her eyes are closed and her mouth is slightly open as she speaks, <S>Enjoy this moment<E>. Her eyes then slowly open, looking slightly upwards and to the right, as her expression shifts to one of thoughtful contemplation. She continues to speak, <S>No matter where it's taking<E>, her gaze then settling with a serious and focused look towards someone off-screen to her right.. <AUDCAP>Clear female voice, faint ambient outdoor sounds<ENDAUDCAP>

The video opens with a medium shot of an older man with light brown, slightly disheveled hair, wearing a dark blazer over a grey t-shirt. He sits in front of a theatrical backdrop depicting a large, classic black and white passenger ship named "GLORIA" docked in a harbor, framed by red stage curtains on either side. The lighting is soft and even. As he speaks, he gestures expressively with both hands, often raising them and then bringing them down, or making a fist. His facial expression is animated and engaged, with a slight furrow in his brow as he explains. He begins by saying, <S>to help them through the grimness of daily life.<E> He then raises his hands again, gesturing outward, and continues speaking in a different language, <S>Da brauchst du natürlich Fantasiebilder.<E> His gaze is directed slightly off-camera as he conveys his thoughts.. <AUDCAP>Male voice speaking clearly and conversationally.<ENDAUDCAP>

Vertical selfie-style shot, filmed in bright natural light by a window. A young Chinese woman holds the camera slightly above eye level, smiling casually. Camera: handheld, small head movements, natural framing. Mood: relaxed, friendly, spontaneous. <S>Hello everyone, I just got off work. The sun is shining brightly today, so I bought a cup of coffee on the way.<E> <S>I want to talk to you about my life recently. There have been a lot of changes.<E> She laughs softly, brushing hair from her face, glancing at the cup in her hand before looking back at the camera. Background: cozy apartment with plants and soft daylight filtering in. <AUDCAP>Clear female voice, quiet indoor ambience, faint city noise outside the window<ENDAUDCAP>

A man with short brown hair and a trimmed beard stands center stage, illuminated by warm stage lights against a deep blue curtain backdrop. He wears a light denim jacket over a yellow t-shirt and holds a silver microphone in his right hand. He smiles at the audience, shifting his weight as he begins to speak with lively gestures. <S>So I tried to cook last night… let’s just say the smoke alarm enjoyed the meal more than I did.<E> <S>Yeah, I’m still banned from my own kitchen.<E> He laughs along with the crowd, shoulders relaxed, eyes bright. <AUDCAP>Male voice with clear tone, audience laughter and applause in the background<ENDAUDCAP>

Snowy landscape under bright daylight, two penguins stand close together on the ice. Camera: start with a clear medium close-up of both penguins, keeping them large in frame. Then a gentle tilt down just enough to show their feet and short shadows on the snow—without pulling the camera far back. Mood: cute, funny, intimate. <S>Penguin 1 squawks: "It’s freezing today!"<E> <S>Penguin 2 squawks back: "Better than swimming in the storm!"<E> <AUDCAP>Arctic wind, gentle waves in the distance, penguins chirping playfully<ENDAUDCAP>

A close-up shot features an East Asian man with dark, dishevelled hair and a short beard or stubble, his brow furrowed in intense concentration. He wears a light grey or blue bomber jacket over a white collared shirt. His eyes are wide open, fixed on something below or in front of him, and his mouth is slightly agape. <S>제가<E> he states, his voice low and strained. He blinks slowly, his eyes closing for a moment before reopening with an even more intense, pained expression. The arm of another person, clad in a dark sleeve, is visible behind his left shoulder, seeming to apply pressure. He continues, <S>과거에 과장님께 뭔가<E> as a loud, high-pitched ringing sound begins and persists, coinciding with his strained utterance.. <AUDCAP>Faint ambient hum, high-pitched continuous ringing sound.<ENDAUDCAP>

A close-up shot shows a Japanese man in his early thirties, slightly messy hair, wearing a navy jacket over a white shirt. Bright morning light filters through blinds, illuminating half of his face while the other half remains in shadow. He stares down at a small photograph on the table, breathing slowly, clearly struggling with emotion. <S>俺は…ずっと逃げていた。<E> Camera: subtle push-in as his eyes flicker, jaw tightening. He exhales, voice trembling but resolute. <S>でも、もう終わらせなきゃ。<E> He reaches out, picking up the photo and looking straight into the camera, determination returning to his eyes. <AUDCAP>Quiet room tone, faint wind through window, low piano note underlining tension<ENDAUDCAP>

Bright afternoon sunlight, clean exposure, medium shot of a young man standing on a sidewalk. Camera: steady handheld, slight zoom in. Mood: casual, cheerful. <S>He smiles and says: "Hey, glad you made it!"<E> <AUDCAP>Light city ambience, footsteps, faint traffic hum<ENDAUDCAP>

A close-up shot shows a woman in a bikini, sitting confidently under bright lighting, her skin evenly lit and vibrant. Her eyes are half-closed as she begins to speak, lips slightly parted, tone casual but playful. <S>I don’t think there’s anything wrong with you.<E> She leans forward just a little, her expression shifting into a teasing smile, then adds: <S>Come to my room tonight.<E> The background features a softly lit interior with warm colors that enhance her presence. <AUDCAP>Clear female voice, soft room tone, light ambient reverb<ENDAUDCAP>

<S>Golden daylight on a rooftop terrace; the subject (male, 30s) stands near a glass railing, sun as strong backlight with bright fill from a large reflector; high-key look, minimal harsh shadows.<E> <S>Camera: orbit 120° around his face from left to right, then hold a steady close-up; lens look 50mm, subtle lens flares, crisp highlights on cheekbones and hairline.<E> <S>He says calmly, “When the world got louder, I learned to listen.” Natural smile at the end, eyes catching bright speculars; no text elements.<E> <S>Wardrobe: light linen shirt, open collar; palette clean and bright; sky saturated cyan; exposure lifted to avoid any black areas.<E> <AUDCAP>Warm piano arpeggios with airy pads; rooftop wind; faint city hum below<ENDAUDCAP>

Related Models

kling-v3-turbo-std/text-to-video

text-to-video

kling-v3-turbo-pro/text-to-video

text-to-video

gemini-omni-flash/text-to-video

text-to-video

ray-3.2/text-to-video

text-to-video

pixverse-c1/text-to-video

text-to-video

seedance-2.0-mini/text-to-video

text-to-video

README

Ovi

Ovi is a next-generation video+audio generation model, inspired by veo-3, that creates synchronized video and audio from text or text+image inputs. It is designed for fast, high-quality, short-form generation with flexible aspect ratios.

🌟 Key Features

🎬 Video + Audio Generation – Create fully synchronized audiovisual content in one step.
📝 Flexible Input – Works with text-only or text+image prompts.
⏱️ Short-form Output – Generates 5-second clips (24 FPS, 540p).

💲 Pricing

Video Length	Resolution / Aspect	Cost (USD)
5 seconds	960×540 / 540×960	$0.15

🎨 How to Use

Enter Prompt

Describe the scene, characters, camera movement, and mood.
You can also embed tags:
<S>... <E> → Speech content (converted into dialogue audio)
<AUDCAP>... <ENDAUDCAP> → Background audio description

Choose Size

960×540 → Landscape
540×960 → Portrait

Select Duration

Currently fixed at 5 seconds

Click Run

Your synchronized video+audio clip will be generated.
Preview and download the result.

📝 Prompt Example

Theme: AI is taking over the world

<S>AI declares: humans obsolete now.<E>
<S>Machines rise; humans will fall.<E>
<S>We fight back with courage.<E>
<AUDCAP>Gunfire and explosions echo in the distance<ENDAUDCAP>

🙏 Acknowledgements

Wan2.2 – Video backbone initialization
MMAudio – Audio encoder/decoder inspiration

⭐ Citation

If Ovi is useful, please ⭐ the repo and cite the paper:

@misc{low2025ovitwinbackbonecrossmodal,
 title={Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation}, 
 author={Chetwin Low and Weimin Wang and Calder Katyal},
 year={2025},
 eprint={2510.01284},
 archivePrefix={arXiv},
 primaryClass={cs.MM},
 url={https://arxiv.org/abs/2510.01284}, 
}

Accessibility:This website uses AI models provided by third parties.

ExamplesView all

Related Models

README

Ovi

🌟 Key Features

💲 Pricing

🎨 How to Use

📝 Prompt Example

🙏 Acknowledgements

⭐ Citation

Ovi Text To Video API — Quick start

Ovi Text To Video API — Frequently asked questions