← Blog

Este artículo aún no está disponible en tu idioma. Mostrando la versión en inglés.

What Is NVIDIA Cosmos3-Nano? The 16B Omni World Model for Physical AI

NVIDIA Cosmos3-Nano is a 16B omnimodal world model for physical AI, combining reasoning, video, audio, image, and action generation in one open model.

By WaveSpeedAI 10 min read

NVIDIA just released Cosmos3-Nano on Hugging Face, and it is not a normal text-to-video checkpoint. It is a 16B parameter omnimodal world model designed for Physical AI: robotics, autonomous vehicles, smart spaces, industrial environments, simulation, and action reasoning.

The short version: Cosmos3-Nano can take combinations of text, image, video, audio, and action trajectories as input, then produce text, image, video, audio, or action outputs. That makes it closer to a physical-world operating model than a single-purpose image or video generator.

You can read the model card here: nvidia/Cosmos3-Nano on Hugging Face. NVIDIA also published the broader Cosmos 3 announcement in its newsroom and a practical developer overview on the NVIDIA Technical Blog.

What Cosmos3-Nano is

Cosmos3-Nano is the smaller release in NVIDIA’s Cosmos 3 family. “Nano” does not mean tiny in the usual open-model sense. The model card lists it at 16B trainable parameters, with BF16 weights and support through NVIDIA’s Cosmos framework, vLLM-Omni, vLLM, PyTorch, and Hugging Face Diffusers.

The model is released under the OpenMDW 1.1 license and the card says it is ready for commercial and non-commercial use. NVIDIA lists the release date as May 31, 2026 on Hugging Face and GitHub.

The important part is the model type. Cosmos3-Nano is an omni world model:

  • It can reason over physical scenes.
  • It can generate video and image outputs.
  • It can generate ambient audio muxed into video.
  • It can generate or condition on action trajectories.
  • It can support future-state prediction and embodied policy workflows.

That puts it in a different category from models like Veo, Runway, Seedance, or Kling. Those are primarily creative video models. Cosmos3-Nano is aimed at developers who need a model to understand, simulate, and act in physical environments.

Why it matters

Most generative video models answer a creative question:

What should this scene look like?

Cosmos3-Nano tries to answer a physical AI question:

What is happening in this scene, what might happen next, and what action should follow?

That distinction matters for robotics and autonomous systems. A warehouse robot, delivery vehicle, factory camera, or embodied agent does not only need pretty video. It needs a useful internal model of the world.

For example:

  • A robot arm needs to understand how objects move when grasped.
  • An autonomous vehicle needs future-state prediction under changing road conditions.
  • A smart-space system needs to reason about activity in fixed-camera footage.
  • A simulation pipeline needs synthetic data that preserves physical plausibility.
  • A robot policy model needs action trajectories, not just visual frames.

Cosmos3-Nano is NVIDIA’s attempt to put those capabilities under one open model family.

The architecture: Mixture-of-Transformers

NVIDIA describes Cosmos 3 as a Mixture-of-Transformers architecture. The model combines two complementary transformer systems:

ComponentRole
Autoregressive transformerGenerates discrete tokens such as text
Diffusion transformerGenerates continuous modalities such as images, video, audio, and actions

This is a pragmatic design. Text generation and video generation do not want the same decoding process. Text works well with next-token autoregressive decoding. Images, video, and audio often work better as denoising problems.

The architecture keeps both mechanisms in one model framework, so Cosmos3 can reason and generate across heterogeneous modalities without pretending that every output should be decoded the same way.

That is the technical reason Cosmos3-Nano feels different from a “video model plus captioner” stack. The goal is not to bolt a vision-language model onto a generator. The goal is a unified physical AI model that can reason, generate, and act.

Input and output capabilities

According to the model card, Cosmos3-Nano supports a wide input and output surface.

Generator inputs:

  • text
  • image
  • video with or without audio
  • action trajectory

Generator outputs:

  • image
  • video
  • audio
  • action
  • text

The model card lists common image and video formats such as JPG, PNG, WEBP, and MP4. Video inputs can be 256p, 480p, or 720p, and input video is capped at 5 frames for the generator path. Audio inputs are short, with a max length of 0.5 seconds. Action inputs cover multiple embodiments, including camera motion, autonomous vehicles, egocentric motion, Franka arms, Agibot, UR, Google robot, WidowX 250, and UMI.

For outputs, generated video is MP4. Generated audio is encoded as AAC and muxed into the video file. Video generation can run from 5 to 400 frames, with 189 frames listed as the default generation duration.

That combination is unusual. Most public video models expose text-to-video and image-to-video. Cosmos3-Nano exposes a more physical AI-oriented set of controls, including action-conditioned generation and action prediction.

What developers can build with it

Cosmos3-Nano is most interesting when the product is not a consumer video editor.

Synthetic data generation

Physical AI systems need data that covers edge cases. Real-world collection is expensive, slow, and sometimes unsafe. Cosmos3-Nano can help generate scenes, future states, and physical interactions that augment real datasets.

This does not mean synthetic data replaces real data. It means teams can expand coverage around rare weather, unusual object layouts, long-tail traffic situations, warehouse interactions, or robot manipulation states.

Future-state prediction

A world model should help estimate what happens next. If a camera sees a moving vehicle, a box sliding on a belt, or a robot arm approaching an object, the model can be used as part of a prediction workflow.

NVIDIA is careful about safety here. Cosmos3 outputs should not be treated as certified physical truth. For autonomous systems, generated predictions need external constraints, validation, and system-level guardrails.

Robot action reasoning

The action trajectory support is the most important differentiator. Cosmos3-Nano can condition on action sequences and generate physical rollouts, or infer action-like outputs from visual context.

That makes it relevant to:

  • robot policy development
  • manipulation planning
  • inverse dynamics experiments
  • data augmentation for embodied agents
  • sim-to-real research

Again, this is a research and development building block, not a drop-in safety controller.

Video and audio generation for physical scenes

Cosmos3-Nano can also generate video with ambient sound. The model card includes examples for text-to-video, image-to-video, text-to-video plus audio, and image-to-video plus audio through a local vLLM-Omni endpoint.

For content creators, that may sound like a competitor to video models such as Kling or Seedance. But the stronger use case is physical-scene generation: roads, warehouses, robots, indoor spaces, cameras, object motion, and environments where physical consistency matters more than stylized cinematic output.

How to run Cosmos3-Nano

The Hugging Face card shows three main paths:

  • NVIDIA Cosmos framework
  • vLLM-Omni serving
  • Hugging Face Diffusers

For deployment, NVIDIA recommends vLLM-Omni for an OpenAI-compatible API endpoint. A typical serving command looks like this:

vllm serve nvidia/Cosmos3-Nano \
  --omni \
  --host 0.0.0.0 \
  --port 8000

For simple Diffusers experimentation, the Hugging Face card also exposes the familiar pipeline pattern:

import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "nvidia/Cosmos3-Nano",
    dtype=torch.bfloat16,
    device_map="cuda",
)

image = pipe("A warehouse robot inspecting stacked crates").images[0]

For serious video or action workflows, use the Cosmos framework examples instead of treating this as a generic text-to-image model. The advanced examples rely on JSON-upsampled prompts, negative prompts, multipart requests, and model-specific parameters like frame count, FPS, inference steps, guidance scale, and action metadata.

Hardware and deployment notes

This is not a laptop toy model. The model card lists GB200 and H100 as test hardware, with NVIDIA Ampere, Hopper, and Blackwell as supported hardware microarchitecture families. Only BF16 precision is officially tested.

The card also says the model is not deployed by a Hugging Face Inference Provider at the time of writing. In practice, that means most developers will evaluate it through local NVIDIA GPU infrastructure, custom hosted inference, NIM-style deployment, or a specialized API provider once support lands.

For teams building production systems, the deployment question is not only “can I run it?” It is:

  • Can I keep latency low enough for the task?
  • Can I afford enough GPU memory for the target resolution and frame count?
  • Can I validate outputs against domain-specific constraints?
  • Can I log prompts, inputs, seeds, generated outputs, and action metadata for review?
  • Can I fail safely when the generated world state is uncertain?

Cosmos3-Nano is open, but physical AI deployment still needs serious infrastructure.

How Cosmos3-Nano compares with creative video models

Cosmos3-Nano should not be evaluated only by the same criteria as consumer video generators.

Model typePrimary goalBest fit
Creative video modelGenerate attractive clipsads, social video, cinematic b-roll
Multimodal video editorRevise media with mixed inputscreator workflows, product content
World modelReason about and simulate physical scenesrobotics, AV, smart spaces, synthetic data
Action modelPredict or generate control trajectoriesembodied policy learning

Seedance, Kling, Runway, and Veo are still better references for creator-facing video quality. Cosmos3-Nano is more relevant when the generated scene must connect to physical reasoning or action.

That is why this release matters. The market is splitting. AI video is one branch. Physical AI world models are another.

Limitations and safety

NVIDIA’s model card is explicit that Cosmos3 outputs should not be treated as physically accurate simulation, ground-truth reasoning, or safety-certified decision making.

That is the right warning. A generated future state can be useful as a hypothesis, but it is not a verified measurement. A generated action can be useful for research, but it is not automatically safe control. A synthetic scene can help expand a dataset, but it can still encode bias, missing cases, or physically implausible details.

Production teams should pair Cosmos3-Nano with:

  • domain validation
  • conventional simulators where exact physics matter
  • safety filters and guardrails
  • human review for high-risk outputs
  • real-world evaluation data
  • fallbacks when predictions are uncertain

The model is powerful because it unifies modalities. That also means mistakes can propagate across modalities. If the reasoning step misreads a scene, the generated video, audio, or action can inherit that error.

Why this matters for AI platforms

Cosmos3-Nano points to a broader shift: model APIs are moving beyond text, image, and video endpoints into world-state APIs.

Future AI platforms will not only expose:

generate_image(prompt)
generate_video(prompt)

They will expose higher-level physical AI tasks:

predict_next_state(scene, action)
generate_synthetic_rollout(environment, policy)
infer_action(video)
simulate_edge_case(route, weather, actors)

For developers, that means model routing gets more important. A request for a TikTok ad should go to a fast creative video model. A request for warehouse robot synthetic data should go to a physical AI world model. A request for a controlled action rollout should use an action-aware model, not a generic image-to-video endpoint.

Cosmos3-Nano is one of the clearest public examples of that next category.

Final take

NVIDIA Cosmos3-Nano is a 16B open omnimodal world model for physical AI. Its value is not just that it can generate video, images, audio, text, and actions. Its value is that those capabilities live inside a model family built for reasoning about physical environments.

If you are building creator tools, Cosmos3-Nano may be less immediately useful than Seedance, Kling, Runway, or Veo. If you are building robotics workflows, autonomous-system datasets, smart-space perception, or synthetic physical-world training data, it is one of the most important releases to test this month.

The big idea is simple: AI generation is moving from media creation toward physical-world simulation. Cosmos3-Nano is NVIDIA’s open entry point into that shift.

Sources