UNO – Universal In‑Context Diffusion Transformer 📸

A powerful subject-driven image synthesis model (developed by ByteDance Research) enabling both single-subject and multi-subject image generation with high consistency and controllability using diffusion transformers.

Implementation ✨

This model leverages a two-stage progressive cross‑modal alignment strategy, combined with Universal Rotary Position Embedding (UnoPE):

Stage I: Fine-tune a pretrained T2I (text-to-image) model using generated single-subject in-context data.
Stage II: Further train on multi-subject paired data to support scenes with multiple specified subjects. :contentReference[oaicite:1]{index=1}

Highlights:

Built on Diffusion Transformers (DiT) with FLUX.1-dev backbone
UnoPE maintains subject identity and reduces confusion across multiple subjects :contentReference[oaicite:2]{index=2}
Input: 1–4 reference images + text prompt
Output: synthesized image reflecting consistent subject(s) in context

Key Features

✅ High-consistency, multi-subject generation—preserves unique subject traits across images :contentReference[oaicite:3]{index=3}
🔁 Single → multi subject scaling via staged training
🔧 Controllable layout and reference identity handling
📐 Handles varying aspect ratios and resolutions (512–704px+) :contentReference[oaicite:4]{index=4}

Predictions Examples 🌟

Generating images of the same person in different settings
Placing multiple consistent products or characters in a single scene
Virtual try-on and identity-preserving e-commerce renders

wavespeed-ai/uno

An AI model that transforms input images into new ones based on text prompts, blending reference visuals with your creative directions.

ExamplesView all

README

UNO – Universal In‑Context Diffusion Transformer 📸

Implementation ✨

Highlights:

Key Features

Predictions Examples 🌟