
vision-language
Idle
{ "objects": [ { "x_max": 0.6881352663040161, "x_min": 0.1556147336959839, "y_max": 0.9551899135112762, "y_min": 0.26160696148872375 } ] }
Your request will cost $0.001 per run.
For $1 you can run this model approximately 1000 times.
Moondream 3 Detect is a powerful vision-language model for identifying and localizing objects within images. It uses natural language input to detect specific items and returns their bounding box coordinates with high precision — ideal for visual search, annotation, and AI-assisted labeling.
Natural Language Object Queries Simply describe what you want to detect — e.g., “person,” “car,” “dog,” “chair.”
Accurate Bounding Boxes
Returns precise x_min, y_min, x_max, y_max coordinates for each detected instance.
Multi-Object Detection Supports multiple instances of the same category in one image.
Fast and Lightweight Optimized for real-time or batch detection workflows with low latency.
{
"image": "https://example.com/photo.jpg",
"prompt": "car"
}
{
"image": "https://example.com/photo.jpg",
"prompt": "person"
}
{
"image": "https://example.com/photo.jpg",
"prompt": "bicycle"
}
Bounding boxes are returned in normalized coordinates (range 0–1):
{
"objects": [
{
"x_min": 0.1556,
"x_max": 0.6881,
"y_min": 0.2610,
"y_max": 0.9551
}
]
}
where
(x_min, y_min) = top-left corner(x_max, y_max) = bottom-right cornerIf multiple objects are detected, all boxes appear in the "objects" array.