
vision-language
Idle
{ "answer": "The image shows a woman dressed in a princess costume, wearing a tiara and a necklace. She is standing in front of a building with cherry blossoms in the background. The woman is posing for the picture, looking directly at the camera. The scene evokes a sense of royalty and elegance, with the woman's attire and accessories suggesting a fairytale or fantasy setting." }
Your request will cost $0.005 per run.
For $1 you can run this model approximately 200 times.
Moondream 3 Query is an advanced vision-language model designed to understand images and answer natural-language questions about them. It combines fast inference, accurate scene understanding, and optional reasoning for visual explanation — ideal for analysis, education, and creative applications.
Visual Q&A Ask questions about any image — people, objects, actions, or scenes — and receive natural language answers.
Chain-of-Thought Reasoning Enable reasoning mode to let the model explain how it reached its conclusion, useful for analysis and debugging.
Accurate Visual Understanding Trained on diverse, high-quality image-text datasets for reliable recognition of complex visual contexts.
Fast and Lightweight Optimized for low latency and efficient inference while maintaining strong reasoning performance.
{
"image": "https://example.com/photo.jpg",
"prompt": "What is the person in the image doing?"
}
{
"image": "https://example.com/photo.jpg",
"prompt": "What emotions are visible in this scene?",
"reasoning": true
}