An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs
Paper
• 2510.14660 • Published
• 1
Table 1. Results on the eval set
| Verifier Model | Rubric Precision | Rubric Recall | Rubric F1 | Sample Precision | Sample Recall | Sample F1 | Avg. F1 |
|---|---|---|---|---|---|---|---|
| Qwen3-1.7B | 0.41 | 0.49 | 0.34 | 0.48 | 0.40 | 0.32 | 0.33 |
| Qwen2.5-3B | 0.42 | 0.47 | 0.43 | 0.49 | 0.46 | 0.43 | 0.43 |
| Qwen3-4B | 0.56 | 0.62 | 0.57 | 0.61 | 0.58 | 0.58 | 0.58 |
| Qwen3-8B | 0.54 | 0.66 | 0.55 | 0.62 | 0.61 | 0.57 | 0.56 |
| LLaMA-3.1-8B | 0.45 | 0.54 | 0.42 | 0.34 | 0.41 | 0.32 | 0.37 |
| Qwen3-30B-A3B | 0.56 | 0.66 | 0.56 | 0.63 | 0.62 | 0.62 | 0.58 |
| Qwen2.5-32B-Instruct | 0.60 | 0.67 | 0.60 | 0.67 | 0.68 | 0.64 | 0.62 |
| Search-Gen-V-1.7B (SFT) | 0.63 | 0.62 | 0.62 | 0.66 | 0.66 | 0.66 | 0.64 |
| Search-Gen-V-4B (SFT) | 0.70 | 0.66 | 0.68 | 0.72 | 0.72 | 0.71 | 0.70 |
| Search-Gen-V-4B (SFT+RL) | 0.71 | 0.68 | 0.70 | 0.74 | 0.74 | 0.73 | 0.72 |
| Qwen3-235B-A22B-Instruct-2507 | 0.72 | 0.73 | 0.73 | 0.76 | 0.76 | 0.76 | 0.74 |
Table 2. Accuracy comparison on verifying rubrics in longform answers from DeepResearch Bench
| Verifier Model | Precision | Recall | F1 |
|---|---|---|---|
| Qwen3-4B | 0.42 | 0.56 | 0.42 |
| Search-Gen-V-4B | 0.59 | 0.57 | 0.57 |
| Qwen3-235B-A22B | 0.57 | 0.67 | 0.61 |
Table 3. Results on the short-form workload, HotpotQA
| Verifier Model | Precision | Recall | F1 |
|---|---|---|---|
| EM | 0.84 | 0.80 | 0.82 |
| Qwen3-4B | 0.83 | 0.70 | 0.71 |
| Search-Gen-V-4B | 0.86 | 0.76 | 0.77 |
| Qwen3-235B-A22B | 0.87 | 0.78 | 0.80 |
| EM + Qwen3-4B | 0.94 | 0.92 | 0.93 |
| EM + Search-Gen-V-4B | 0.95 | 0.93 | 0.94 |
| EM + Qwen3-235B-A22B | 0.96 | 0.94 | 0.95 |
@article{ma2025searchgenv,
title={AN EFFICIENT RUBRIC-BASED GENERATIVE VERIFIER FOR SEARCH-AUGMENTED LLMS},
author={Ma, Linyue and Xu, Yilong and Long, Xiang and Zheng, Zhi},
journal={arXiv preprint arXiv:2510.14660},
year={2025},
url={https://arxiv.org/abs/2510.14660}
}