Moondream3 Point | Precise Object Point Localization For Images And Coordinates

Home/Explore/Content Detection Models/wavespeed-ai/moondream3-preview/point

image-to-text

wavespeed-ai/moondream3-preview/point

Moondream3 Point finds objects in images and returns precise coordinate points for computer vision tasks, enabling accurate point localization. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Documentation

Input

Enable Safety Checker

Idle

{
  "answer": "The woman is wearing a pink baseball cap with a strap across her forehead. She is also wearing large silver hoop earrings and a pink fuzzy sweater. Her blonde hair is styled in loose waves, and she has her tongue sticking out slightly while looking directly at the camera. Behind her, there are several posters visible, including one with a pink background and an image of a cup."
}

Enable Image Magnification

Your request will cost $0.001 per run.

For $1 you can run this model approximately 1000 times.

One more thing:

ExamplesView all

README

Moondream 3 — Point (Locate & Describe)

Moondream 3 Point is a vision-language model designed to identify and describe specific objects within an image using natural language. Instead of returning coordinates, it provides a concise textual description of the detected object, making it ideal for lightweight interactive queries and content understanding.

✨ Key Features

Locate and Describe Objects Enter a short text query (e.g., “hat”, “watch”, “phone”) and receive a natural-language description of that item in context.
Fast Single-Object Queries Optimized for fast, low-latency inference — perfect for real-time applications.
Readable Natural Output The model outputs a fluent English sentence describing the object’s appearance, position, and context.
Multilingual Understanding Capable of recognizing and describing objects in a wide range of visual scenarios.

⚙️ Example Usage

Locate & Describe “Hat”

{
  "image": "https://example.com/photo.jpg",
  "prompt": "hat"
}

Example Response

{
  "answer": "The woman is wearing a pink baseball cap with a strap across her forehead. She is also wearing large silver hoop earrings and a pink fuzzy sweater."
}

💡 Best Practices

Use concise object names (e.g., “hat”, “car”, “tree”) for more accurate detection.
For precise bounding boxes or coordinates, use:
- Moondream 3 Detect — returns x_min, y_min, x_max, y_max bounding boxes.
- A coordinate-enabled version of Moondream 3 Point (coming soon).
Supported formats: JPEG, PNG, WebP
Maximum image size: 10 MB

💰 Pricing

$0.001 per request
Volume and enterprise pricing available upon request.

📝 Notes

The current endpoint returns descriptive text in JSON format:
```
{"answer": "..."}
```
— it does not output coordinates.
For small or occluded objects, use higher-resolution input or switch to the Detect model for better spatial precision.