Introducing WaveSpeedAI Cosmos Predict 2.5 Text-to-Video on WaveSpeedAI

A New Dimension of AI Video Generation Arrives on WaveSpeedAI

The line between imagination and reality just got thinner. NVIDIA Cosmos Predict 2.5 Text-to-Video is now live on WaveSpeedAI — giving creators and developers the ability to generate cinematic video clips from nothing but a text description, powered by NVIDIA’s world foundation model technology, with no cold starts and simple flat pricing.

Cosmos Predict 2.5 is not just another text-to-video model. It is a World Foundation Model — a system designed to simulate and predict the physical world. Trained on 200 million curated video clips and refined through reinforcement learning-based post-training, it generates video that obeys the laws of physics. Rain falls downward. Leaves tumble convincingly in the wind. Light scatters through fog the way it does in the real world. The result is video that doesn’t just look good — it looks right.

What Is Cosmos Predict 2.5 Text-to-Video?

Cosmos Predict 2.5 Text-to-Video generates smooth, high-fidelity video clips from natural language descriptions alone. No reference images, no storyboards, no source footage required. Describe a scene — “a bustling Tokyo street at dusk, neon signs reflecting off rain-slicked pavement, pedestrians carrying umbrellas” — and the model creates a cinematic video clip that brings your words to life with realistic motion, lighting, and atmospheric effects.

The model is built on NVIDIA’s 2B parameter Cosmos Post-Trained architecture, a flow-based diffusion model that unifies text-to-video, image-to-video, and video-to-video capabilities into a single system. What sets it apart from other video generation models is its text encoder: Cosmos-Reason1, a Physical AI reasoning vision language model that doesn’t just parse your prompt — it reasons about the physical plausibility of the scene you describe. When you write “autumn leaves spiraling down from a maple tree,” the model understands that leaves don’t fall in straight lines, that wind creates asymmetric patterns, and that light filtering through a canopy creates shifting shadows on the ground.

On NVIDIA’s PAI-Bench evaluation, the Cosmos Predict 2.5-2B post-trained model achieves performance comparable to models many times its size. Despite having just 2 billion parameters, it matches the quality of the Wan 2.2 5B and Wan 2.1 14B models on diverse prompt sets — and leads the field in Image-to-World tasks with a top overall score of 0.810. This efficiency translates directly into faster inference and lower cost for you.

Key Features

World Foundation Model Architecture: Built on NVIDIA’s purpose-built Cosmos platform, trained specifically to understand how the physical world works — not just what it looks like, but how it moves, how light behaves, and how objects interact.
Physics-Grounded Generation: Water flows naturally, fabric drapes convincingly, shadows track with light sources, and atmospheric effects like fog, rain, and dust behave realistically. The model reasons about physical plausibility rather than hallucinating arbitrary motion.
Pure Text-to-Video: Generate complete video clips from text alone. No reference images, no seed frames, no auxiliary inputs. Describe what you want and get a finished video.
Built-In Prompt Enhancer: Not sure how to describe the exact scene in your head? The integrated Prompt Enhancer automatically refines your description, adding cinematic detail, atmospheric cues, and motion specifics that draw out the model’s best performance.
Reinforcement Learning Refinement: Post-trained with an RLHF-style reward model called VideoAlign that evaluates text alignment, motion quality, and visual fidelity — ensuring the model consistently produces high-quality results that match your intent.
Flat $0.25 Per Video: Every video costs exactly the same. No per-second billing, no resolution tiers, no surprise multipliers.

Real-World Use Cases

Cinematic Scene Generation

Cosmos Predict 2.5 excels at atmospheric, cinematic content. Describe a rain-soaked city street at night, a misty forest at dawn, or a desert highway at golden hour, and the model produces footage that rivals location shooting. Filmmakers and content creators can generate establishing shots, mood boards, and concept sequences without leaving their desk.

At $0.25 per video, you can rapidly prototype and produce scroll-stopping content for Instagram Reels, TikTok, and YouTube Shorts. Generate multiple variations of a concept, A/B test different visual approaches, and ship the winner — all through a single API call. The flat pricing makes experimentation virtually risk-free.

Marketing and Advertising

Generate promotional video content at a fraction of traditional production costs. Product launches, seasonal campaigns, and brand storytelling all become faster when you can describe a scene and have production-quality video in seconds. Marketing teams can iterate on creative concepts in real time rather than waiting for production schedules.

Concept Visualization and Previsualization

Bring creative ideas to life before committing to expensive production. Directors can previsualize scenes, game designers can prototype environments, and architects can generate atmospheric walkthroughs — all from text descriptions. The model’s physics awareness means these previews are grounded in reality, making them useful for actual creative decision-making.

Storytelling and Narrative Content

Writers and narrative designers can see their stories come alive. Describe a sequence of scenes and generate visual companions for scripts, novels, presentations, or educational materials. The model’s understanding of natural motion and environmental effects creates immersive visuals that enhance any narrative.

Getting Started on WaveSpeedAI

Generating video with Cosmos Predict 2.5 Text-to-Video takes just a few lines of code:

import wavespeed

output = wavespeed.run(
    "wavespeed-ai/cosmos-predict-2.5/text-to-video",
    {
        "prompt": "A quiet Japanese garden in autumn, golden maple leaves drifting slowly onto a still koi pond, soft afternoon light filtering through the canopy, gentle ripples spreading where each leaf touches the water",
    },
)

print(output["outputs"][0])

Tips for best results:

Be specific and descriptive — include details about the environment, lighting, weather, and camera movement. “A rainy cobblestone alley in Paris at dusk, warm light spilling from café windows, puddles reflecting neon signs, slow tracking shot” will dramatically outperform “rainy street.”
Use cinematic language — terms like “golden hour lighting,” “tracking shot,” “slow pan,” “shallow depth of field,” and “atmospheric haze” help the model generate more polished, professional-looking footage.
Describe motion explicitly — don’t just set the scene. Tell the model what moves and how: “leaves spiraling downward,” “waves crashing against rocks,” “steam rising from a coffee cup.”
Try the Prompt Enhancer — if your results aren’t matching your vision, enable the built-in Prompt Enhancer to automatically add the cinematic detail and specificity that draws out the model’s best work.
Include mood and atmosphere — emotional tone and atmospheric details like “melancholic,” “ethereal,” “bustling energy,” or “serene stillness” give the model additional creative direction.

Simple, Predictable Pricing

Output	Cost
Per video	$0.25

No per-second billing, no resolution tiers, no hidden fees. Every video costs a flat $0.25 — making Cosmos Predict 2.5 one of the most affordable text-to-video solutions available at this quality level.

Why Choose WaveSpeedAI for Cosmos Predict 2.5

No Cold Starts: Every request hits a warm, ready-to-serve instance. Your video generation begins immediately — no waiting for model loading or GPU provisioning.
Production-Ready REST API: Clean, well-documented endpoints that drop into any tech stack, content pipeline, or automated workflow with minimal integration effort.
Elastic Scalability: Whether you’re generating one video a day or ten thousand an hour, WaveSpeedAI’s infrastructure scales seamlessly with your demand.
Affordable at Any Volume: Flat per-video pricing with no minimums, no subscriptions, and no commitment. Pay only for what you generate.
Complete Cosmos Ecosystem: Access the full Cosmos Predict 2.5 family — including Image-to-Video and Video-to-Video — alongside other leading models like Wan 2.6 Text-to-Video, all through a single API.

Start Creating Today

NVIDIA Cosmos Predict 2.5 Text-to-Video is live and ready on WaveSpeedAI. Whether you’re a creator looking to turn ideas into cinematic footage, a marketing team scaling video production, or a developer building AI-powered video features into your product, Cosmos Predict 2.5 delivers world-foundation-model quality, physics-aware generation, and dead-simple pricing — all from a text prompt.

Try Cosmos Predict 2.5 Text-to-Video on WaveSpeedAI →