YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

YOFO-Qwen3-VL-2B

Hightlights

  • Efficient Compositional Judging: Built on Qwen3-VL-2B-Instruct, YOFO accepts an image and a structured requirement template and, in one inference step, produces a binary yes/no decision for each requirement by reading the logits of the final token associated with that requirement.
  • Fine-Grained Cross-Modal Reranking: Existing approaches face a fundamental trade-off: adapting MLLMs to output a single score misaligns with the generative nature of MLLMs and limits fine-grained requirement understanding, whereas autoregressively generating judging analyses is prohibitively slow in high-throughput settings. Leveraging the cross-modal understanding ability of Qwen3-VL and our template-conditioned method, our model judges all requirements decomposed from the query concurrently and accurately, facilitating the accuracy and interpretability of reranking.
  • Univeral Judging Capabilities: Trained on general-purpose data, YOFO learns general-purpose judging capabilities that transfer effectively across domains. Crucially, it can be deployed directly in specialized subdomains—such as fashion—without any fine-tuning or domain adaptation. This zero-shot generalization underscores the model’s practical utility in real-world scenarios where labeled data is scarce or domain shifts are common, positioning YOFO as a versatile solution for cross-domain recommendation systems.

For more details, including model architecture, implementation details and experimental results, please refer to our paper.

Usage

  • requirements
transformers>=4.57.0
torch==2.5.1

Example

from transformers import AutoModel

model_path = "Accio-Lab/yofo-Qwen3-VL-2B-Instruct"
yofo = AutoModel.from_pretrained(
    model_path,
    torch_dtype="bfloat16",
    trust_remote_code=True,
    attn_implementation="flash_attention_2"
)
yofo.eval()
yofo.cuda()

Now you can use the model's function compute_score to evaluate how well an image satisfies a given set of requirements. The function accepts a list of input pairs, where each pair consists of an image and a corresponding list of textual requirements. For each input pair, it returns a list of relevance scores—one for each requirement—indicating the model's confidence that the requirement is met by the image.

data = [
    {
        "image": "../../datasets/laion-reranker/images/605257.jpg",
        "requirements": [
            "The item has a visible pattern.",
            "The item has long sleeves.",
            "The item has an A-line silhouette.",
            "The item's primary color is red."
        ],
    },
    {
        "image": "../../datasets/laion-reranker/images/780764.jpg",
        "requirements": [
            "The item is a dress.",
            "The item has long sleeves.",
            "The item features lace-up details.",
            "The item is black."
        ],
    },
]
scores = yofo.compute_score(data, batch_size=2, num_workers=2)
# [[0.890625, 0.9765625, 0.7578125, 5.424022674560547e-06], [0.9921875, 0.90234375, 0.0091552734375, 1.0]]

Contact

  • If you have any questions about this model, please feel free to contact: tattoo.ysl@gmail.com.
  • We are actively seeking self-motivated researchers and research interns to join our team!
Downloads last month
14
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JayGarrick/yofo-Qwen3-VL-2B-Instruct

Quantizations
1 model

Paper for JayGarrick/yofo-Qwen3-VL-2B-Instruct