BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Paper • 2505.09568 • Published • 99
How to use fuhaddesmond/illuma with Sana:
# Load the model and infer image from text
import torch
from app.sana_pipeline import SanaPipeline
from torchvision.utils import save_image
sana = SanaPipeline("configs/sana_config/1024ms/Sana_1600M_img1024.yaml")
sana.from_pretrained("hf://fuhaddesmond/illuma")
image = sana(
prompt='a cyberpunk cat with a neon sign that says "Sana"',
height=1024,
width=1024,
guidance_scale=5.0,
pag_guidance_scale=2.0,
num_inference_steps=18,
) Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)
Illuma is an image generation model cloned from Salesforce/BLIP3o-NEXT-GRPO-TexT-3B — the first truly open-source image generation model with:
AR (3B Qwen2.5-VL) + SANA 1.5 Diffusion Decoder
Illuma uses a two-stage generation process:
The GRPO (Group Relative Policy Optimization) RL training improves text rendering in generated images (GenEval 0.73 → 0.90).
This model includes a custom handler (handler.py) for deployment on HF Inference Endpoints:
fuhaddesmond/illuma as the model repositoryimport requests
API_URL = "https://YOUR_ENDPOINT_ID.aws.endpoints.huggingface.cloud"
headers = {"Authorization": "Bearer hf_YOUR_TOKEN"}
payload = {
"inputs": "A neon sign that says 'ILLUMA' glowing in purple against a dark wall",
"parameters": {
"seq_len": 729,
"top_p": 0.95,
"top_k": 1200
}
}
response = requests.post(API_URL, headers=headers, json=payload)
import base64
from PIL import Image
from io import BytesIO
image_data = base64.b64decode(response.json()["image"])
image = Image.open(BytesIO(image_data))
image.save("illuma_output.png")
# Clone BLIP3o repo (BLIP3o-NEXT branch)
git clone --branch BLIP3o-NEXT --single-branch https://github.com/JiuhaiChen/BLIP3o.git
cd BLIP3o
pip install -e .
# Download model
python -c "from huggingface_hub import snapshot_download; print(snapshot_download(repo_id='fuhaddesmond/illuma', repo_type='model'))"
# Run inference
python inference.py /path/to/downloaded/model
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="fuhaddesmond/illuma",
repo_type="model"
)
| Detail | Value |
|---|---|
| Base Model | BLIP3o-NEXT-GRPO-TexT-3B |
| Parameters | ~4B (3B AR + diffusion decoder) |
| Architecture | Qwen2.5-VL + SANA 1.5 |
| License | Apache 2.0 |
| GRPO Training | GenEval 0.73 → 0.90 |
| Specialty | Text rendering in images |
@article{chen2025blip3,
title={BLIP3-o: A Family of Fully Open Unified Multimodal Models},
author={Chen, Jiuhai and others},
journal={arXiv preprint arXiv:2505.09568},
year={2025}
}