Jib-Mix-Qwen-Image (Text-to-Image)
Jib-Mix-Qwen-Image is a finely tuned text-to-image generation model based on Qwen-Image 20B (MMDiT), optimized through the Jib-Mix portrait enhancement pipeline. It specializes in realistic human faces, cinematic lighting, and vivid artistic styles, delivering professional-grade visuals from simple text prompts — no LoRA setup needed.
Why it looks great
- Jib-Mix fine-tuning – Enhances facial structure, skin texture, and lighting realism, especially for close-ups and half-body portraits.
- Cinematic diffusion engine – Captures lifelike depth, atmosphere, and tone with consistent color harmony.
- Exceptional text rendering – Handles both Chinese and English typography natively, blending text naturally into the image.
- Broad style coverage – From photorealism to anime, oil painting, 3D, or stylized artwork—one model, infinite versatility.
- Identity consistency – Generates characters with coherent facial details and stable expressions across prompts.
Limits and Performance
- Max resolution per job: up to 1536 × 1536 pixels
- Output formats: JPEG / PNG / WEBP
- Processing speed: ~5–8 seconds per image (depending on prompt complexity)
- Prompt input: supports detailed, multi-line bilingual descriptions
Pricing
- $0.02 per image
Each image is billed individually.
How to Use
- Enter a prompt describing your desired image (Chinese or English).
- Set image size (width × height, up to 1536×1536).
- (Optional) Set a seed for reproducibility (
-1 = random).
- Choose output format (JPEG / PNG / WEBP).
- Generate → preview → iterate with refined prompts.
Pro tips for best quality
- Be specific — describe lighting, pose, emotion, and background for more control.
- For portraits, include keywords like cinematic lighting, soft focus, 8K detail, professional photo.
- Fix seed to maintain subject consistency across multiple outputs.
- Experiment with styles (e.g., realistic, anime, oil painting, CG render) to explore model versatility.
Note
- For best realism, ensure prompts describe camera angle, lighting, and environment — the model responds strongly to cinematic cues.