wavespeed-ai/uno

An AI model that transforms input images into new ones based on text prompts, blending reference visuals with your creative directions.

image-to-image

preview
preview
preview
If set to true, the safety checker will be enabled.

Idle

https://d2g64w682n9w0w.cloudfront.net/media/images/1746023923015579464_pVuJNQX5.jpeg

Your request will cost $0.05 per run.

For $1 you can run this model approximately 20 times.

README

UNO is developed by the Byte Dance team, UNO is a multi-image conditioned subject-to-image model iteratively trained from a text-to-image model. This model can achieve high consistency while ensuring controllability in both single-subject and multi-subject driven generation.

Key Features

  • Progressive Cross-Modal Alignment: This technique progressively aligns textual and visual information, enhancing the model's multimodal understanding and improving consistency and accuracy in image generation.
  • Universal Rotary Position Embedding (URPE): By incorporating rotary position encoding, the model maintains consistent positional representation across varying image sizes and aspect ratios, improving image generation quality.
  • High-Consistency Data Synthesis Pipeline: Leverages the contextual generation capability of diffusion transformers to construct highly consistent multi-subject paired datasets, providing a solid foundation for model training.

Use Cases

  • Character Consistency Generation: Ensures consistent appearance of characters across different images—useful in domains such as comics, animations, or games.
  • Product Advertisement Generation: Maintains product visual consistency in various scenes or backgrounds, beneficial for e-commerce marketing content creation.
  • Multi-Subject Scene Generation: Enables the coherent generation of images containing multiple specific subjects, preserving distinct features of each entity.

Accelerated Inference

Our accelerated inference approach leverages advanced optimization technology from WavespeedAI. This innovative fusion technique significantly reduces computational overhead and latency, enabling rapid image generation without compromising quality. The entire system is designed to efficiently handle large-scale inference tasks while ensuring that real-time applications achieve an optimal balance between speed and accuracy. For further details, please refer to the blog post.