
video-to-text
Idle
Sua solicitação custará $0.005 por execução.
Por $1 você pode executar este modelo aproximadamente 200 vezes.
Molmo2 Video QA is a powerful video understanding model that answers questions about video content. Simply upload a video and ask anything — the model analyzes visual scenes, actions, objects, and context to deliver accurate, natural-language responses.
Built for developers and creators who need intelligent video comprehension without building complex pipelines.
Natural language understanding Ask questions in plain English about what happens in your video — no need for timestamps or frame-by-frame annotation.
Scene and action recognition Understands objects, people, movements, environments, and temporal sequences across the video.
Flexible video input Accepts video uploads or public URLs for seamless integration into existing workflows.
Fast and accurate Optimized for quick turnaround while maintaining high comprehension accuracy.
Production-ready API Ready-to-use REST endpoint with predictable per-second pricing and no cold starts.
Per-5-second billing with a 5-second minimum.
| Video Duration | Cost |
|---|---|
| Up to 5s | $0.005 |
| 10s | $0.01 |
| 30s | $0.03 |
| 60s | $0.06 |
| 120s (max) | $0.12 |