Bereit
{
"output": "A woman with wavy brown hair, wearing a white sweater and blue denim with a belt, is holding a black cassette in her right hand and looking at it. She is in a room with two cream-colored couches on either side, a brown table in the middle, and a lamp on it, along with two photo frames. Behind her, there are three windows with white curtains and brown drapes on the sides."
}$0.006pro Durchlauf·~166 / $1
NVIDIA Nemotron-3 Nano Omni Video is a multimodal video-language model for understanding and analyzing video content. Provide a video URL and an English prompt, and the model generates a text response for tasks such as video description, scene understanding, event summarization, and visual question answering over time-based media.
Video understanding with natural-language prompts Ask questions about a video or request summaries, descriptions, and structured analysis in plain English.
Temporal scene analysis Understand actions, events, transitions, and visual context across time instead of from a single frame only.
Flexible response control
Adjust max_tokens, temperature, and top_p to balance response length, determinism, and creativity.
Optional system steering
Use system_prompt to guide output style, response format, or task behavior for more controlled results.
Reasoning mode options
Choose between no_think and think depending on your preferred response mode and workflow.
Production-ready API Suitable for video analysis pipelines, multimodal assistants, content review systems, and automated media understanding workflows.
| Parameter | Required | Description |
|---|---|---|
| prompt | Yes | English text prompt sent to the model. |
| video_url | Yes | URL of the video to analyze. |
| system_prompt | No | Optional system prompt used to steer behavior, tone, or response style. |
| reasoning_mode | No | Reasoning mode: no_think (default) or think. |
| max_tokens | No | Maximum number of tokens to generate. Default: 1024. |
| temperature | No | Sampling temperature. Lower values are more deterministic. Default: 0.7. |
| top_p | No | Nucleus sampling probability mass. Default: 0.95. |
no_think or think depending on your workflow.max_tokens, temperature, and top_p.Describe this video in detail, including the setting, key actions, important scene changes, visible subjects, and the overall mood.
Billed by configured max_tokens.
| Max Tokens | Cost |
|---|---|
| 1000 | $0.006 |
| 1024 | $0.0061 |
| 2000 | $0.012 |
| 4000 | $0.024 |
| 8000 | $0.048 |
max_tokens value.max_tokens increases cost linearly.prompt, video_url, system_prompt, reasoning_mode, temperature, and top_p do not change pricing directly.system_prompt when you need a consistent output format, such as bullet summaries, labeled sections, or structured JSON-like responses.temperature lower when you want more stable and deterministic answers.max_tokens only when you need longer outputs, since pricing is tied to that value.prompt and video_url are required.prompt must be written in English.reasoning_mode = no_think, max_tokens = 1024, temperature = 0.7, and top_p = 0.95.max_tokens, not on other generation settings.