You Only Forward Once: An Efficient Compositional Judging Paradigm
Paper
•
2511.16600
•
Published
•
7
For more details, including model architecture, implementation details and experimental results, please refer to our paper.
transformers>=4.57.0
torch==2.5.1
from transformers import AutoModel
model_path = "Accio-Lab/yofo-Qwen3-VL-2B-Instruct"
yofo = AutoModel.from_pretrained(
model_path,
torch_dtype="bfloat16",
trust_remote_code=True,
attn_implementation="flash_attention_2"
)
yofo.eval()
yofo.cuda()
Now you can use the model's function compute_score to evaluate how well an image satisfies a given set of requirements. The function accepts a list of input pairs, where each pair consists of an image and a corresponding list of textual requirements. For each input pair, it returns a list of relevance scores—one for each requirement—indicating the model's confidence that the requirement is met by the image.
data = [
{
"image": "../../datasets/laion-reranker/images/605257.jpg",
"requirements": [
"The item has a visible pattern.",
"The item has long sleeves.",
"The item has an A-line silhouette.",
"The item's primary color is red."
],
},
{
"image": "../../datasets/laion-reranker/images/780764.jpg",
"requirements": [
"The item is a dress.",
"The item has long sleeves.",
"The item features lace-up details.",
"The item is black."
],
},
]
scores = yofo.compute_score(data, batch_size=2, num_workers=2)
# [[0.890625, 0.9765625, 0.7578125, 5.424022674560547e-06], [0.9921875, 0.90234375, 0.0091552734375, 1.0]]