
image-to-text
Idle
{ "answer": "The woman is wearing a pink baseball cap with a strap across her forehead. She is also wearing large silver hoop earrings and a pink fuzzy sweater. Her blonde hair is styled in loose waves, and she has her tongue sticking out slightly while looking directly at the camera. Behind her, there are several posters visible, including one with a pink background and an image of a cup." }
您的请求将花费 $0.001 每次运行。
使用 $1 您可以运行此模型大约 1000 次。
还有一件事::
Moondream 3 Point is a vision-language model designed to identify and describe specific objects within an image using natural language. Instead of returning coordinates, it provides a concise textual description of the detected object, making it ideal for lightweight interactive queries and content understanding.
Locate and Describe Objects Enter a short text query (e.g., “hat”, “watch”, “phone”) and receive a natural-language description of that item in context.
Fast Single-Object Queries Optimized for fast, low-latency inference — perfect for real-time applications.
Readable Natural Output The model outputs a fluent English sentence describing the object’s appearance, position, and context.
Multilingual Understanding Capable of recognizing and describing objects in a wide range of visual scenarios.
{
"image": "https://example.com/photo.jpg",
"prompt": "hat"
}
{
"answer": "The woman is wearing a pink baseball cap with a strap across her forehead. She is also wearing large silver hoop earrings and a pink fuzzy sweater."
}
Use concise object names (e.g., “hat”, “car”, “tree”) for more accurate detection.
For precise bounding boxes or coordinates, use:
x_min, y_min, x_max, y_max bounding boxes.Supported formats: JPEG, PNG, WebP
Maximum image size: 10 MB
The current endpoint returns descriptive text in JSON format:
{"answer": "..."}
— it does not output coordinates.
For small or occluded objects, use higher-resolution input or switch to the Detect model for better spatial precision.