
vision-language
Idle
{ "answer": "The woman is wearing a pink baseball cap with a strap across her forehead. She is also wearing large silver hoop earrings and a pink fuzzy sweater. Her blonde hair is styled in loose waves, and she has her tongue sticking out slightly while looking directly at the camera. Behind her, there are several posters visible, including one with a pink background and an image of a cup." }
Your request will cost $0.001 per run.
For $1 you can run this model approximately 1000 times.
Moondream 3 Point is a vision-language model designed to identify and describe specific objects within an image using natural language. Instead of returning coordinates, it provides a concise textual description of the detected object, making it ideal for lightweight interactive queries and content understanding.
Locate and Describe Objects Enter a short text query (e.g., “hat”, “watch”, “phone”) and receive a natural-language description of that item in context.
Fast Single-Object Queries Optimized for fast, low-latency inference — perfect for real-time applications.
Readable Natural Output The model outputs a fluent English sentence describing the object’s appearance, position, and context.
Multilingual Understanding Capable of recognizing and describing objects in a wide range of visual scenarios.
{
"image": "https://example.com/photo.jpg",
"prompt": "hat"
}
{
"answer": "The woman is wearing a pink baseball cap with a strap across her forehead. She is also wearing large silver hoop earrings and a pink fuzzy sweater."
}
Use concise object names (e.g., “hat”, “car”, “tree”) for more accurate detection.
For precise bounding boxes or coordinates, use:
x_min, y_min, x_max, y_max bounding boxes.Supported formats: JPEG, PNG, WebP
Maximum image size: 10 MB
The current endpoint returns descriptive text in JSON format:
{"answer": "..."}
— it does not output coordinates.
For small or occluded objects, use higher-resolution input or switch to the Detect model for better spatial precision.