UNO – Universal In‑Context Diffusion Transformer 📸
A powerful subject-driven image synthesis model (developed by ByteDance Research) enabling both single-subject and multi-subject image generation with high consistency and controllability using diffusion transformers.
Implementation ✨
This model leverages a two-stage progressive cross‑modal alignment strategy, combined with Universal Rotary Position Embedding (UnoPE):
- Stage I: Fine-tune a pretrained T2I (text-to-image) model using generated single-subject in-context data.
- Stage II: Further train on multi-subject paired data to support scenes with multiple specified subjects. :contentReference[oaicite:1]{index=1}
Highlights:
- Built on Diffusion Transformers (DiT) with FLUX.1-dev backbone
- UnoPE maintains subject identity and reduces confusion across multiple subjects :contentReference[oaicite:2]{index=2}
- Input: 1–4 reference images + text prompt
- Output: synthesized image reflecting consistent subject(s) in context
Key Features
- ✅ High-consistency, multi-subject generation—preserves unique subject traits across images :contentReference[oaicite:3]{index=3}
- 🔁 Single → multi subject scaling via staged training
- 🔧 Controllable layout and reference identity handling
- 📐 Handles varying aspect ratios and resolutions (512–704px+) :contentReference[oaicite:4]{index=4}
Predictions Examples 🌟
- Generating images of the same person in different settings
- Placing multiple consistent products or characters in a single scene
- Virtual try-on and identity-preserving e-commerce renders