How Far Can Unsupervised RLVR Scale LLM Training?
Abstract
Unsupervised reinforcement learning with verifiable rewards faces fundamental limitations in scaling large language model training due to inherent convergence properties and confidence-correction misalignment, though external methods show promise for overcoming these barriers.
Unsupervised reinforcement learning with verifiable rewards (URLVR) offers a pathway to scale LLM training beyond the supervision bottleneck by deriving rewards without ground truth labels. Recent works leverage model intrinsic signals, showing promising early gains, yet their potential and limitations remain unclear. In this work, we revisit URLVR and provide a comprehensive analysis spanning taxonomy, theory and extensive experiments. We first classify URLVR methods into intrinsic versus external based on reward sources, then establish a unified theoretical framework revealing that all intrinsic methods converge toward sharpening the model's initial distribution This sharpening mechanism succeeds when initial confidence aligns with correctness but fails catastrophically when misaligned. Through systematic experiments, we show intrinsic rewards consistently follow a rise-then-fall pattern across methods, with collapse timing determined by model prior rather than engineering choices. Despite these scaling limits, we find intrinsic rewards remain valuable in test-time training on small datasets, and propose Model Collapse Step to measure model prior, serving as a practical indicator for RL trainability. Finally, we explore external reward methods that ground verification in computational asymmetries, showing preliminary evidence they may escape the confidence-correctness ceiling. Our findings chart boundaries for intrinsic URLVR while motivating paths toward scalable alternatives.
Community
We investigate the mechanisms and potential applications of Unsupervised RLVR (URLVR), try to answer how far unsupervised RLVR can scale LLM training, and find that it is particularly well suited for test-time training and quantifying model priors.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction (2026)
- Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering (2026)
- Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening (2026)
- Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance (2026)
- From Entropy to Calibrated Uncertainty: Training Language Models to Reason About Uncertainty (2026)
- Grad2Reward: From Sparse Judgment to Dense Rewards for Improving Open-Ended LLM Reasoning (2026)
- Save the Good Prefix: Precise Error Penalization via Process-Supervised RL to Enhance LLM Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper