
vision-language
Idle
{ "caption": "A young woman with long, dark hair stands in front of a bar. She wears a leopard print halter top and blue jeans, accessorized with large hoop earrings. The bar features a purple backlit counter and a lit sign displaying \"DAMON\" in yellow letters." }
Your request will cost $0.005 per run.
For $1 you can run this model approximately 200 times.
Moondream 3 Caption is a high-performance vision-language model that automatically generates clear, descriptive, and context-aware captions for any image. It supports multiple caption lengths, enabling flexible use across social media content, dataset annotation, and creative storytelling.
Flexible Caption Length Choose from short, normal, or long captions to fit your workflow needs.
Accurate Visual Understanding Trained on large-scale, diverse visual datasets — accurately detects objects, actions, and environments.
Fast and Efficient Optimized for low-latency inference, suitable for real-time applications and batch processing.
Human-like Language Output Produces smooth, natural, and grammatically correct sentences ideal for direct use in production.
{
"image": "https://example.com/photo.jpg",
"length": "short"
}
{
"image": "https://example.com/photo.jpg",
"length": "normal"
}
{
"image": "https://example.com/photo.jpg",
"length": "long"
}
{
"caption": "A young woman with long, dark hair stands in front of a bar. She wears a leopard print halter top and blue jeans, accessorized with large hoop earrings. The bar features a purple backlit counter and a lit sign displaying 'DAMON' in yellow letters."
}
caption.length parameter (short, normal, or long).