Moondream3 Caption | Image Captioning API With Short, Normal, And Long Captions

Moondream 3 — Image Captioning

Moondream 3 Caption is a high-performance vision-language model that automatically generates clear, descriptive, and context-aware captions for any image. It supports multiple caption lengths, enabling flexible use across social media content, dataset annotation, and creative storytelling.

✨ Key Features

Flexible Caption Length Choose from short, normal, or long captions to fit your workflow needs.
Accurate Visual Understanding Trained on large-scale, diverse visual datasets — accurately detects objects, actions, and environments.
Fast and Efficient Optimized for low-latency inference, suitable for real-time applications and batch processing.
Human-like Language Output Produces smooth, natural, and grammatically correct sentences ideal for direct use in production.

⚙️ Example Usage

🔹 Short Caption

{
  "image": "https://example.com/photo.jpg",
  "length": "short"
}

🔹 Normal Caption

{
  "image": "https://example.com/photo.jpg",
  "length": "normal"
}

🔹 Long Caption

{
  "image": "https://example.com/photo.jpg",
  "length": "long"
}

🧾 Example Output

{
  "caption": "A young woman with long, dark hair stands in front of a bar. She wears a leopard print halter top and blue jeans, accessorized with large hoop earrings. The bar features a purple backlit counter and a lit sign displaying 'DAMON' in yellow letters."
}

Output Explanation

The model returns a JSON object with a single key: caption.
The value is a natural-language description automatically generated from the input image.
The style and length of the caption depend on your chosen length parameter (short, normal, or long).

💡 Best Practices

Use “short” for quick summaries or thumbnail text.
Use “normal” for descriptive captions (recommended default).
Use “long” for storytelling, research annotations, or dataset labeling.
Supported formats: JPEG, PNG, WebP
Maximum image size: 10 MB

💰 Pricing

$0.005 per request
Contact WaveSpeedAI for enterprise and large-scale pricing options.

Generate short, normal, or long image captions to help you understand and describe visual content. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

ExamplesView all

README