Home/Explore/AI Generation Assist Tools/wavespeed-ai/moondream3-preview/point
vision-language

vision-language

Moondream3 Point

wavespeed-ai/moondream3-preview/point

Moondream3 Point finds objects in images and returns precise coordinate points for computer vision tasks, enabling accurate point localization. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Hint: You can drag and drop a file or click to upload

preview
If set to true, the function will wait for the result before returning the response. This property is only available through the API.

Idle

{ "answer": "The woman is wearing a pink baseball cap with a strap across her forehead. She is also wearing large silver hoop earrings and a pink fuzzy sweater. Her blonde hair is styled in loose waves, and she has her tongue sticking out slightly while looking directly at the camera. Behind her, there are several posters visible, including one with a pink background and an image of a cup." }

Your request will cost $0.001 per run.

For $1 you can run this model approximately 1000 times.

ExamplesView all

README

Moondream 3 — Point (Locate & Describe)

Moondream 3 Point is a vision-language model designed to identify and describe specific objects within an image using natural language. Instead of returning coordinates, it provides a concise textual description of the detected object, making it ideal for lightweight interactive queries and content understanding.

✨ Key Features

  • Locate and Describe Objects Enter a short text query (e.g., “hat”, “watch”, “phone”) and receive a natural-language description of that item in context.

  • Fast Single-Object Queries Optimized for fast, low-latency inference — perfect for real-time applications.

  • Readable Natural Output The model outputs a fluent English sentence describing the object’s appearance, position, and context.

  • Multilingual Understanding Capable of recognizing and describing objects in a wide range of visual scenarios.

⚙️ Example Usage

Locate & Describe “Hat”

{
  "image": "https://example.com/photo.jpg",
  "prompt": "hat"
}

Example Response

{
  "answer": "The woman is wearing a pink baseball cap with a strap across her forehead. She is also wearing large silver hoop earrings and a pink fuzzy sweater."
}

💡 Best Practices

  • Use concise object names (e.g., “hat”, “car”, “tree”) for more accurate detection.

  • For precise bounding boxes or coordinates, use:

    • Moondream 3 Detect — returns x_min, y_min, x_max, y_max bounding boxes.
    • A coordinate-enabled version of Moondream 3 Point (coming soon).
  • Supported formats: JPEG, PNG, WebP

  • Maximum image size: 10 MB

💰 Pricing

  • $0.001 per request
  • Volume and enterprise pricing available upon request.

📝 Notes

  • The current endpoint returns descriptive text in JSON format:

    {"answer": "..."}
    

    — it does not output coordinates.

  • For small or occluded objects, use higher-resolution input or switch to the Detect model for better spatial precision.