Home/Explore/wavespeed-ai/uno

image-to-image

wavespeed-ai/uno

An AI model that transforms input images into new ones based on text prompts, blending reference visuals with your creative directions.

Doc
preview

Idle

Santa Claus is standing in front of the Christmas tree.

Your request will cost $0.05 per run.

For $1 you can run this model approximately 20 times.

One more thing:

ExamplesView all

Santa Claus is standing in front of the Christmas tree.
A woman wears the dress and holds a bag, in the flowers.
A woman wears the dress and holds a bag, in the flowers.
Give the alarm clock a small sun umbrella.
A doll is embedded in the crystal ball.
The girl is holding a cabbage doll.
A boy riding a horse.
The girl has the print from the image below on her.
The girl puts on a dress.
A girl wearing a petal skirt in the grass.

README

UNO – Universal In‑Context Diffusion Transformer 📸

A powerful subject-driven image synthesis model (developed by ByteDance Research) enabling both single-subject and multi-subject image generation with high consistency and controllability using diffusion transformers.

Implementation ✨

This model leverages a two-stage progressive cross‑modal alignment strategy, combined with Universal Rotary Position Embedding (UnoPE):

  1. Stage I: Fine-tune a pretrained T2I (text-to-image) model using generated single-subject in-context data.
  2. Stage II: Further train on multi-subject paired data to support scenes with multiple specified subjects. :contentReference[oaicite:1]{index=1}

Highlights:

  • Built on Diffusion Transformers (DiT) with FLUX.1-dev backbone
  • UnoPE maintains subject identity and reduces confusion across multiple subjects :contentReference[oaicite:2]{index=2}
  • Input: 1–4 reference images + text prompt
  • Output: synthesized image reflecting consistent subject(s) in context

Key Features

  • High-consistency, multi-subject generation—preserves unique subject traits across images :contentReference[oaicite:3]{index=3}
  • 🔁 Single → multi subject scaling via staged training
  • 🔧 Controllable layout and reference identity handling
  • 📐 Handles varying aspect ratios and resolutions (512–704px+) :contentReference[oaicite:4]{index=4}

Predictions Examples 🌟

  • Generating images of the same person in different settings
  • Placing multiple consistent products or characters in a single scene
  • Virtual try-on and identity-preserving e-commerce renders