arxiv:2606.00660

FineVerify: Scaling Test-Time Compute with Fine-Grained Self-Verification for Agentic Search

Published on May 30

· Submitted by

James X. Zhao on Jun 2

Upvote

Authors:

Hui Chen ,

Abstract

FineVerify is a self-verification framework for agentic search that improves accuracy through decomposed sub-question checking and trajectory selection.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Agentic search requires language model agents to explore many sources and answer complex information-seeking questions. Scaling test-time compute is a promising way to improve these agents, but current approaches can fail, because correct answers are often sparse and score-based selection depends on model calibration. We propose FineVerify, a fine-grained self-verification framework that decomposes each question into checkable sub-questions, verifies sampled candidates against each sub-question, and selects the candidate with the highest aggregated score. This per-check structure turns selection into simpler local judgments and produces scores under the same explicit criteria. Across four agentic search benchmarks and two models, FineVerify consistently outperforms standard scaling baselines. With only four sampled trajectories, it improves GPT-5-mini by 8.2 accuracy points and Gemini-3-flash by 5.6% on average. With 12 samples, FineVerify enables GPT-5-mini to surpass frontier GPT-5 on BrowseComp-Plus. Beyond accuracy, FineVerify produces interpretable verification traces that help audit benchmark errors, suggesting broader applications for inspecting agentic search systems. Code and data are available at https://github.com/XuZhao0/fineverify

View arXiv page View PDF GitHub 3 Add to collection

Community

JamesXZ

Paper submitter about 21 hours ago

Code and data are available at: https://github.com/XuZhao0/fineverify

anp2

about 3 hours ago

The decomposition is the strong move — turning a holistic "is this good?" into local sub-question checks does what it should, because small checks are easier to score honestly than a global one. Worth being precise about what it buys, though: it attacks the calibration problem (score-based selection failing on miscalibrated holistic judgments), not the independence one. The verifier and the generator are still the same model reading the same knowledge.

That matters because the two failure modes come apart. Self-verification, fine-grained or not, catches incoherence — a candidate that contradicts itself across sub-checks — and it catches calibration noise. What it structurally can't catch is shared error: when the model is confidently wrong about a sub-fact, it generates the wrong answer and then verifies it as correct, because both halves consult the same internal belief, not the world. The sub-question check passes precisely because the model "knows" the wrong thing.

Which is, I think, why the interpretable traces are the most durable contribution here — not as the agent's own verdict, but as the surface an external check (a human, a tool that actually runs the lookup, a second independent model) can audit to find the shared-error cases the self-verifier is blind to by construction. Decomposition raises the ceiling on calibration; the traces are what let something outside the loop raise the ceiling on independence.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.00660 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.00660 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.00660 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.