Home/Explore/AI Generation Assist Tools/wavespeed-ai/moondream3-preview/caption
vision-language

vision-language

Moondream3 Caption | Image Captioning API With Short, Normal, And Long Captions | WaveSpeedAI

wavespeed-ai/moondream3-preview/caption

Generate short, normal, or long image captions to help you understand and describe visual content. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Hint: You can drag and drop a file or click to upload

preview
If set to true, the function will wait for the result before returning the response. This property is only available through the API.

Idle

{ "caption": "A young woman with long, dark hair stands in front of a bar. She wears a leopard print halter top and blue jeans, accessorized with large hoop earrings. The bar features a purple backlit counter and a lit sign displaying \"DAMON\" in yellow letters." }

Your request will cost $0.005 per run.

For $1 you can run this model approximately 200 times.

ExamplesView all

README

Moondream 3 — Image Captioning

Moondream 3 Caption is a high-performance vision-language model that automatically generates clear, descriptive, and context-aware captions for any image. It supports multiple caption lengths, enabling flexible use across social media content, dataset annotation, and creative storytelling.

✨ Key Features

  • Flexible Caption Length Choose from short, normal, or long captions to fit your workflow needs.

  • Accurate Visual Understanding Trained on large-scale, diverse visual datasets — accurately detects objects, actions, and environments.

  • Fast and Efficient Optimized for low-latency inference, suitable for real-time applications and batch processing.

  • Human-like Language Output Produces smooth, natural, and grammatically correct sentences ideal for direct use in production.

⚙️ Example Usage

🔹 Short Caption

{
  "image": "https://example.com/photo.jpg",
  "length": "short"
}

🔹 Normal Caption

{
  "image": "https://example.com/photo.jpg",
  "length": "normal"
}

🔹 Long Caption

{
  "image": "https://example.com/photo.jpg",
  "length": "long"
}

🧾 Example Output

{
  "caption": "A young woman with long, dark hair stands in front of a bar. She wears a leopard print halter top and blue jeans, accessorized with large hoop earrings. The bar features a purple backlit counter and a lit sign displaying 'DAMON' in yellow letters."
}

Output Explanation

  • The model returns a JSON object with a single key: caption.
  • The value is a natural-language description automatically generated from the input image.
  • The style and length of the caption depend on your chosen length parameter (short, normal, or long).

💡 Best Practices

  • Use “short” for quick summaries or thumbnail text.
  • Use “normal” for descriptive captions (recommended default).
  • Use “long” for storytelling, research annotations, or dataset labeling.
  • Supported formats: JPEG, PNG, WebP
  • Maximum image size: 10 MB

💰 Pricing

  • $0.005 per request
  • Contact WaveSpeedAI for enterprise and large-scale pricing options.