Molmo2 Video Understanding
Analyze and understand video content with Molmo2 Video Understanding. This intelligent video analysis model performs various tasks including summarization, scene description, object counting, and detailed analysis — perfect for video cataloging, content moderation, and automated video workflows.
Why It Works Great
- Multiple task modes: Summary, analysis, counting, scene description, and general Q&A.
- Custom instructions: Add specific focus areas or questions.
- Extended video support: Analyze videos up to 2 minutes long.
- Structured output: Get organized, task-specific results.
- Affordable: Starting at just $0.005 per video.
- Versatile analysis: From quick summaries to detailed breakdowns.
Parameters
| Parameter | Required | Description |
|---|
| video | Yes | Video to analyze (upload or public URL). |
| task | No | Analysis type: general, summary, analysis, counting, or scene_description. Default: general. |
| text | No | Additional instructions or focus areas for the analysis. |
How to Use
- Upload your video — drag and drop or paste a public URL.
- Select task — choose the type of analysis you need.
- Add instructions (optional) — specify focus areas or custom questions.
- Run — click the button to analyze.
- Review results — get structured analysis output.
Pricing
Per 5-second billing with minimum charge for videos ≤5 seconds. Maximum billable duration is 120 seconds.
| Duration | Cost |
|---|
| ≤5 seconds | $0.005 |
| 10 seconds | $0.01 |
| 30 seconds | $0.03 |
| 60 seconds | $0.06 |
| 120 seconds (max) | $0.12 |
Task Modes
| Task | Description | Best For |
|---|
| general | Open-ended video understanding and Q&A | Custom questions, flexible analysis |
| summary | Concise overview of video content | Quick content review, cataloging |
| analysis | Detailed breakdown of video elements | In-depth understanding, reports |
| counting | Count objects, people, or events | Inventory, crowd analysis, metrics |
| scene_description | Describe scenes and visual elements | Accessibility, content tagging |
Best Use Cases
- Video Cataloging — Automatically generate descriptions for video libraries.
- Content Moderation — Analyze video content for review workflows.
- Accessibility — Create text descriptions for visually impaired users.
- Search & Discovery — Generate metadata for video search systems.
- Analytics — Count objects, people, or events in footage.
- Summarization — Create quick summaries for long-form content.
Example Instructions
- "Focus on the people in the video and describe their actions."
- "Count how many cars appear in this video."
- "Summarize the main events in chronological order."
- "Describe the setting and atmosphere of each scene."
- "Identify any text or logos visible in the video."
- "What products are being demonstrated?"
Pro Tips for Best Results
- Choose the appropriate task mode for your specific need.
- Use "text" parameter to focus analysis on specific elements.
- Combine "general" task with custom questions for flexible Q&A.
- Use "counting" for metrics like people, objects, or occurrences.
- "scene_description" works great for accessibility and content tagging.
- Keep videos under 2 minutes for optimal processing.
Notes
- Maximum supported video duration is 120 seconds (2 minutes).
- If using a URL, ensure it is publicly accessible.
- Processing time scales with video length.
- Different tasks produce different output formats optimized for their purpose.