← Blog

HiDream-O1-Image-Dev: The 8B Pixel-Native Model That Beat 56B FLUX.2

HiDream-O1-Image-Dev is an 8B distilled image model that drops the VAE and the external text encoder, generates 2K natively, and outscores models 7x its size on GenEval, DPG, and HPSv3.

8 min read

On May 8, 2026, HiDream-ai open-sourced HiDream-O1-Image under the MIT license — and the architecture choice is the headline. Where almost every recent text-to-image model is a latent diffusion transformer (DiT operating on VAE-compressed tokens, with text routed through a frozen T5 or CLIP), HiDream-O1 throws out the latent stack entirely. It runs the diffusion transformer on raw pixels, with text and task conditions sharing the same token space.

Two checkpoints shipped: the full HiDream-O1-Image (50 steps, CFG 5.0) and the distilled HiDream-O1-Image-Dev (28 steps, CFG 0.0). Both have 8B parameters. As of May 5, 2026, the model — codenamed Peanut — sits at #8 on the Artificial Analysis Text-to-Image Arena, the highest-ranked open-weight entry on the board.

This piece walks through what’s actually different about the architecture, what the Dev distillation gives up versus the full model, and how the reported benchmarks line up against FLUX.2, Qwen-Image, and SD 3.5 Large.

The Pixel-Level Unified Transformer

Modern open image models almost universally share a recipe:

  1. A VAE compresses 1024×1024 RGB into ~64×64 latent tokens.
  2. A text encoder (T5-XXL, CLIP, Gemma) embeds the prompt in a separate vector space.
  3. A DiT denoises the latent tokens, cross-attending to the text embedding.

This is efficient — diffusion happens at 1/64th the spatial resolution — but it stacks three independently-trained components, each with its own failure modes. Latent VAEs lose fine detail and bleed colors at compression boundaries. Text encoders trained for retrieval don’t necessarily encode the spatial reasoning a generator needs. Cross-attention between two foreign embedding spaces is where text rendering and small-object accuracy typically break down.

HiDream-O1 collapses the stack. The Pixel-level Unified Transformer (UiT) treats pixel patches, text tokens, and task-condition tokens as members of one shared sequence. There is no VAE — the model operates on raw RGB patches. There is no separate text encoder — text tokens flow into the same transformer. Diffusion happens directly in pixel space.

The cost is obvious (more compute per token, since you can’t downsample 64×) and the team’s answer is sparsity and scheduling — the released technical report describes a flash scheduler with predefined timesteps that lets the Dev variant converge in 28 steps with guidance scale 0. The benefit, if the architecture works, is that every modality lives in one representation, which is exactly what you want when the same model needs to do text-to-image, instruction-driven editing, multi-reference personalization, and storyboard generation without head-swaps.

What HiDream-O1-Image-Dev actually does

The Dev checkpoint is guidance-distilled — it’s trained to produce CFG-conditioned outputs in a single forward pass, so you set guidance_scale=0.0 and skip the doubled compute that classifier-free guidance normally requires. That alone roughly halves wall-clock time at any step count.

Step count drops from 50 → 28 versus the full model. Combined with the CFG savings, Dev is meaningfully faster — the team’s own framing is “balanced trade-off between quality and computational demand,” which matches the I1 Dev variant’s positioning a year earlier.

Capabilities supported by the same checkpoint:

  • Text-to-image at up to 2048×2048 native resolution (no upscaler in the pipeline)
  • Instruction-based editing (--ref_images input.jpg --prompt "remove the earphones")
  • Subject-driven personalization — multi-reference identity preservation, takes 2+ reference images of the same subject and places them in new contexts
  • Long-text rendering — multilingual, with reported near-parity scores on English and Mandarin LongText-Bench
  • Storyboard generation — sequential frames with consistent characters/setting

The four tasks share weights. There’s no LoRA swap or adapter loading between text-to-image and editing — you just pass --ref_images to switch modes.

Benchmarks: where the 8B claim actually holds up

The technical report compares against the obvious open-weight peers (FLUX.2, Qwen-Image, SD 3.5 Large) and the strongest closed models on the human-preference benchmark. Five suites are reported:

BenchmarkWhat it measuresHiDream-O1 (8B)FLUX.2 Dev (56B)Qwen-Image (27B)SD 3.5 Large (13.6B)
GenEvalCompositional accuracy (objects, count, color, position)0.900.870.870.71
DPG-BenchDense prompt alignment89.8387.5788.3284.08
HPSv3Human preference (12 categories)10.379.289.94
CVTG-2KComplex visual text (2–5 regions)0.91280.89260.82880.6548
LongText-BenchMultilingual long-text rendering0.979 EN / 0.978 ZH

Two things stand out. First, HiDream-O1 wins every reported benchmark while being 7× smaller than FLUX.2 Dev and 3.4× smaller than Qwen-Image. Parameter count is no longer a clean proxy for quality once architecture and data composition diverge. Second, the text-rendering numbers are the most interesting — CVTG-2K and LongText-Bench specifically stress the failure mode where latent-space models historically collapse, and HiDream-O1’s pixel-native design is exactly the kind of change that should help there. The 0.979 / 0.978 EN/ZH split suggests the gain isn’t a quirk of English tokenization either.

The HPSv3 number (10.37/12) puts it ahead of DALL-E 3 and GPT Image 2 in the report’s tables — a closed-vs-open comparison that was unthinkable in this size class twelve months ago.

The Reasoning-Driven Prompt Agent

Bundled with the release is a separate prompt agent — not part of the diffusion model, but a wrapper that runs Gemma-4-31B-it (or any OpenAI-compatible API) over the user’s instruction before generation. The agent outputs JSON with three fields: reasoning trace, resolved implicit knowledge (e.g. “user said ‘a Tang Dynasty general’ — that means a specific armor style and weapons”), and a refined prompt with explicit layout/text-rendering specifications.

This is the same pattern as DALL-E 3’s GPT-4 prompt rewriter and Imagen 3’s Gemini integration, but shipped as a separate, swappable component you can run locally. For prompts where layout reasoning matters — multi-region text, specific spatial relationships, cultural specificity — running the agent first is what closes the gap to closed-source systems that have an LLM in the pipeline by default.

Running it locally

The repo is straightforward:

git clone https://github.com/HiDream-ai/HiDream-O1-Image.git
cd HiDream-O1-Image
pip install -r requirements.txt

Text-to-image with Dev:

python inference.py \
    --model_path /path/to/HiDream-O1-Image-Dev \
    --model_type dev \
    --prompt "A dog holds a sign that says 'HiDream-O1-Image release.'" \
    --output_image results/output.png

Editing with a reference image:

python inference.py \
    --model_path /path/to/HiDream-O1-Image-Dev \
    --model_type dev \
    --prompt "remove the earphones" \
    --ref_images input.jpg \
    --output_image results/edited.png

Subject-driven personalization works the same way — pass multiple reference images of the same subject:

python inference.py \
    --model_path /path/to/HiDream-O1-Image-Dev \
    --prompt "A young boy stands on steps wearing light blue jeans..." \
    --ref_images ref1.jpg ref2.jpg ref3.jpg \
    --output_image results/personalized.png

A web demo (python app.py --model_path ... --port 7860) is also included.

Flash attention is recommended but not required — there’s a documented one-line change in models/pipeline.py if it’s not available. VRAM scales with output resolution; 2K×2K generation is the model’s headline capability but expects substantial memory.

How it differs from HiDream-I1

The original HiDream-I1, released in early 2025, was a 17B sparse-MoE DiT operating in latent space — architecturally conventional, compete-on-quality. O1 is a reset: the parameter count goes down to 8B, the VAE and text encoder come out, and the architecture itself is the contribution. The naming convention is also a clear nod to OpenAI’s reasoning-model rebrand — “O1” signals the integrated prompt-reasoning agent, even though the diffusion model itself is a standard one-shot sampler.

If you’re choosing between them today: I1 Dev is older, well-supported across inference platforms, and proven in production. O1 Dev is newer, smaller, scores higher on every benchmark the team reported, and renders text far more reliably — but the pixel-native architecture is novel enough that third-party tooling (ComfyUI nodes, quantizations, LoRA training scripts) will take time to catch up.

Where it fits

HiDream-O1-Image-Dev is the most architecturally interesting open-weight image model release of 2026 so far. The team made a contrarian bet — drop the latent space, drop the external encoders, do everything in one transformer — and the benchmarks back the bet, especially in the long-tail categories (text rendering, complex composition, multilingual) where latent models have historically struggled.

The Dev variant specifically is the one most people will actually run: 28 steps, no CFG, MIT license, single-checkpoint multi-task. If you’ve been waiting for an open model that matches GPT Image 2 or DALL-E 3 on text-in-image quality without the closed-API price, this is it.

The repo is at github.com/HiDream-ai/HiDream-O1-Image, the Dev weights are at huggingface.co/HiDream-ai/HiDream-O1-Image-Dev, and a hosted Space is up for trying it without the local install.