Seedance 2.0 15% TANIEJ | Twórz w Video Generator →

Ovi Text to Video

character-ai /

Ovi is a veo-3-like model that converts text or text+image prompts into synchronized video with audio. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-video
Wejście

Bezczynny

$0.15za uruchomienie·~66 / $10

Dalej:

PrzykładyZobacz wszystkie

A bearded man wearing large dark sunglasses and a blue patterned cardigan sits in a studio, actively speaking into a large, suspended microphone. He has headphones on and gestures with his hands, displaying rings on his fingers. Behind him, a wall is covered with red, textured sound-dampening foam on the left, and a white banner on the right features the "CHOICE FM" logo and various social media handles like "@ilovechoicefm" with "RALEIGH" below it. The man intently addresses the microphone, articulating, <S> Ovi is now available on WaveSpeedAI, try it now! <E>. He leans forward slightly as he speaks, maintaining a serious expression behind his sunglasses.. <AUDCAP>Clear male voice speaking into a microphone, a low background hum.<ENDAUDCAP>

A young man with curly brown hair sits on a wooden stool in a warmly lit room, cradling an acoustic guitar. He's wearing a simple green t-shirt and jeans. He looks down at his hands on the fretboard, demonstrating a chord change slowly. A packed bookshelf and a soft glowing lamp create a cozy atmosphere behind him. He looks up towards the camera and explains, <S> The trick is to move your ring and middle finger together, see? It makes the transition to the C-chord much smoother. <E>. <AUDCAP>A calm, friendly male voice, the rich, resonant sound of an acoustic guitar being strummed, and the soft squeak of fingers on the strings.<ENDAUDCAP>

A young woman with long, wavy blonde hair and light-colored eyes is shown in a medium shot against a blurred backdrop of lush green foliage. She wears a denim jacket over a striped top. Initially, her eyes are closed and her mouth is slightly open as she speaks, <S>Enjoy this moment<E>. Her eyes then slowly open, looking slightly upwards and to the right, as her expression shifts to one of thoughtful contemplation. She continues to speak, <S>No matter where it's taking<E>, her gaze then settling with a serious and focused look towards someone off-screen to her right.. <AUDCAP>Clear female voice, faint ambient outdoor sounds<ENDAUDCAP>

The video opens with a medium shot of an older man with light brown, slightly disheveled hair, wearing a dark blazer over a grey t-shirt. He sits in front of a theatrical backdrop depicting a large, classic black and white passenger ship named "GLORIA" docked in a harbor, framed by red stage curtains on either side. The lighting is soft and even. As he speaks, he gestures expressively with both hands, often raising them and then bringing them down, or making a fist. His facial expression is animated and engaged, with a slight furrow in his brow as he explains. He begins by saying, <S>to help them through the grimness of daily life.<E> He then raises his hands again, gesturing outward, and continues speaking in a different language, <S>Da brauchst du natürlich Fantasiebilder.<E> His gaze is directed slightly off-camera as he conveys his thoughts.. <AUDCAP>Male voice speaking clearly and conversationally.<ENDAUDCAP>

Vertical selfie-style shot, filmed in bright natural light by a window. A young Chinese woman holds the camera slightly above eye level, smiling casually. Camera: handheld, small head movements, natural framing. Mood: relaxed, friendly, spontaneous. <S>Hello everyone, I just got off work. The sun is shining brightly today, so I bought a cup of coffee on the way.<E> <S>I want to talk to you about my life recently. There have been a lot of changes.<E> She laughs softly, brushing hair from her face, glancing at the cup in her hand before looking back at the camera. Background: cozy apartment with plants and soft daylight filtering in. <AUDCAP>Clear female voice, quiet indoor ambience, faint city noise outside the window<ENDAUDCAP>

A man with short brown hair and a trimmed beard stands center stage, illuminated by warm stage lights against a deep blue curtain backdrop. He wears a light denim jacket over a yellow t-shirt and holds a silver microphone in his right hand. He smiles at the audience, shifting his weight as he begins to speak with lively gestures. <S>So I tried to cook last night… let’s just say the smoke alarm enjoyed the meal more than I did.<E> <S>Yeah, I’m still banned from my own kitchen.<E> He laughs along with the crowd, shoulders relaxed, eyes bright. <AUDCAP>Male voice with clear tone, audience laughter and applause in the background<ENDAUDCAP>

Snowy landscape under bright daylight, two penguins stand close together on the ice. Camera: start with a clear medium close-up of both penguins, keeping them large in frame. Then a gentle tilt down just enough to show their feet and short shadows on the snow—without pulling the camera far back. Mood: cute, funny, intimate. <S>Penguin 1 squawks: "It’s freezing today!"<E> <S>Penguin 2 squawks back: "Better than swimming in the storm!"<E> <AUDCAP>Arctic wind, gentle waves in the distance, penguins chirping playfully<ENDAUDCAP>

A close-up shot features an East Asian man with dark, dishevelled hair and a short beard or stubble, his brow furrowed in intense concentration. He wears a light grey or blue bomber jacket over a white collared shirt. His eyes are wide open, fixed on something below or in front of him, and his mouth is slightly agape. <S>제가<E> he states, his voice low and strained. He blinks slowly, his eyes closing for a moment before reopening with an even more intense, pained expression. The arm of another person, clad in a dark sleeve, is visible behind his left shoulder, seeming to apply pressure. He continues, <S>과거에 과장님께 뭔가<E> as a loud, high-pitched ringing sound begins and persists, coinciding with his strained utterance.. <AUDCAP>Faint ambient hum, high-pitched continuous ringing sound.<ENDAUDCAP>

A close-up shot shows a Japanese man in his early thirties, slightly messy hair, wearing a navy jacket over a white shirt. Bright morning light filters through blinds, illuminating half of his face while the other half remains in shadow. He stares down at a small photograph on the table, breathing slowly, clearly struggling with emotion. <S>俺は…ずっと逃げていた。<E> Camera: subtle push-in as his eyes flicker, jaw tightening. He exhales, voice trembling but resolute. <S>でも、もう終わらせなきゃ。<E> He reaches out, picking up the photo and looking straight into the camera, determination returning to his eyes. <AUDCAP>Quiet room tone, faint wind through window, low piano note underlining tension<ENDAUDCAP>

Bright afternoon sunlight, clean exposure, medium shot of a young man standing on a sidewalk. Camera: steady handheld, slight zoom in. Mood: casual, cheerful. <S>He smiles and says: "Hey, glad you made it!"<E> <AUDCAP>Light city ambience, footsteps, faint traffic hum<ENDAUDCAP>

A close-up shot shows a woman in a bikini, sitting confidently under bright lighting, her skin evenly lit and vibrant. Her eyes are half-closed as she begins to speak, lips slightly parted, tone casual but playful. <S>I don’t think there’s anything wrong with you.<E> She leans forward just a little, her expression shifting into a teasing smile, then adds: <S>Come to my room tonight.<E> The background features a softly lit interior with warm colors that enhance her presence. <AUDCAP>Clear female voice, soft room tone, light ambient reverb<ENDAUDCAP>

<S>Golden daylight on a rooftop terrace; the subject (male, 30s) stands near a glass railing, sun as strong backlight with bright fill from a large reflector; high-key look, minimal harsh shadows.<E> <S>Camera: orbit 120° around his face from left to right, then hold a steady close-up; lens look 50mm, subtle lens flares, crisp highlights on cheekbones and hairline.<E> <S>He says calmly, “When the world got louder, I learned to listen.” Natural smile at the end, eyes catching bright speculars; no text elements.<E> <S>Wardrobe: light linen shirt, open collar; palette clean and bright; sky saturated cyan; exposure lifted to avoid any black areas.<E> <AUDCAP>Warm piano arpeggios with airy pads; rooftop wind; faint city hum below<ENDAUDCAP>

Powiązane modele

README

Ovi

Ovi is a next-generation video+audio generation model, inspired by veo-3, that creates synchronized video and audio from text or text+image inputs. It is designed for fast, high-quality, short-form generation with flexible aspect ratios.

🌟 Key Features

  • 🎬 Video + Audio Generation – Create fully synchronized audiovisual content in one step.
  • 📝 Flexible Input – Works with text-only or text+image prompts.
  • ⏱️ Short-form Output – Generates 5-second clips (24 FPS, 540p).

💲 Pricing

Video LengthResolution / AspectCost (USD)
5 seconds960×540 / 540×960$0.15

🎨 How to Use

  1. Enter Prompt
  • Describe the scene, characters, camera movement, and mood.

  • You can also embed tags:

  • <S>... <E> → Speech content (converted into dialogue audio)

  • <AUDCAP>... <ENDAUDCAP> → Background audio description

  1. Choose Size
  • 960×540 → Landscape
  • 540×960 → Portrait
  1. Select Duration
  • Currently fixed at 5 seconds
  1. Click Run
  • Your synchronized video+audio clip will be generated.
  • Preview and download the result.

📝 Prompt Example

Theme: AI is taking over the world

<S>AI declares: humans obsolete now.<E>
<S>Machines rise; humans will fall.<E>
<S>We fight back with courage.<E>
<AUDCAP>Gunfire and explosions echo in the distance<ENDAUDCAP>

🙏 Acknowledgements

  • Wan2.2 – Video backbone initialization
  • MMAudio – Audio encoder/decoder inspiration

⭐ Citation

If Ovi is useful, please ⭐ the repo and cite the paper:

@misc{low2025ovitwinbackbonecrossmodal,
 title={Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation}, 
 author={Chetwin Low and Weimin Wang and Calder Katyal},
 year={2025},
 eprint={2510.01284},
 archivePrefix={arXiv},
 primaryClass={cs.MM},
 url={https://arxiv.org/abs/2510.01284}, 
}
Dostępność:Ta strona korzysta z modeli AI udostępnianych przez podmioty trzecie.

Ovi Text To Video API — Quick start

Grab a WaveSpeedAI API key, then call POST https://api.wavespeed.ai/api/v3/character-ai/ovi/text-to-video with your input as JSON. The endpoint returns a prediction id; poll the prediction endpoint until status flips to completed, then read the output URL from data.outputs[0]. Examples for Ovi Text To Video below.

HTTP example
# Submit the prediction
curl -X POST "https://api.wavespeed.ai/api/v3/character-ai/ovi/text-to-video" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $WAVESPEED_API_KEY" \
  -d '{
    "prompt": "A cinematic shot of a city at sunset, soft golden light",
    "size": "960*540",
    "seed": -1
}'

# Response includes a prediction id. Poll for the result:
curl -X GET "https://api.wavespeed.ai/api/v3/predictions/{request_id}/result" \
  -H "Authorization: Bearer $WAVESPEED_API_KEY"

# When status is "completed", read the output from data.outputs[0].
Node.js example
// npm install wavespeed
const WaveSpeed = require('wavespeed');

const client = new WaveSpeed(); // reads WAVESPEED_API_KEY from env

const result = await client.run("character-ai/ovi/text-to-video", {
        "prompt": "A cinematic shot of a city at sunset, soft golden light",
        "size": "960*540",
        "seed": -1
});

console.log(result.outputs[0]); // → URL of the generated output
Python example
# pip install wavespeed
import wavespeed

output = wavespeed.run(
    "character-ai/ovi/text-to-video",
    {
    "prompt": "A cinematic shot of a city at sunset, soft golden light",
    "size": "960*540",
    "seed": -1
}
)

print(output["outputs"][0])  # → URL of the generated output

Ovi Text To Video API — Frequently asked questions

What is the Ovi Text To Video API?

Ovi Text To Video is a Character Ai model for video generation, exposed as a REST API on WaveSpeedAI. Ovi is a veo-3-like model that converts text or text+image prompts into synchronized video with audio. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing. You can call it programmatically or try it from the playground above.

How do I call the Ovi Text To Video API?

POST your input parameters to the model's REST endpoint (shown in the API tab of this playground) with your WaveSpeedAI API key in the Authorization header. Submission returns a prediction ID; poll the prediction endpoint until status flips to "completed", then read the output URL from the result. The playground generates a ready-to-paste code sample in Python, JavaScript, or cURL for whatever inputs you've set. Full request/response shape is documented at https://wavespeed.ai/docs/docs-api/character-ai/character-ai-ovi-text-to-video.

How much does Ovi Text To Video cost per run?

Ovi Text To Video starts at $0.15 per run. That figure is the base price — the final charge scales with the parameters you set in the form (output size, length, count, references, or whatever knobs this model exposes), so a higher-quality or larger output costs more than a minimal one. The exact cost for your current input is shown live next to the Generate button before you submit, and the actual per-call charge is recorded on the prediction afterwards.

What inputs does Ovi Text To Video accept?

Key inputs: `prompt`, `size`, `seed`. The full JSON schema (types, defaults, allowed values) is rendered above the Generate button and mirrored in the API reference at https://wavespeed.ai/docs/docs-api/character-ai/character-ai-ovi-text-to-video.

How long does Ovi Text To Video take to generate?

Average end-to-end generation time on WaveSpeedAI is around 68 seconds per request — measured across recent runs. Queue time scales with global demand; live status is visible in the prediction record.

Can I use Ovi Text To Video outputs commercially?

Commercial usage rights depend on the model's license, set by its provider (Character Ai). The license summary appears on the model card above; see WaveSpeedAI's Terms of Service for platform-level conditions.