Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning Paper • 2510.03259 • Published Sep 26, 2025 • 57
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense Paper • 2510.07242 • Published Oct 8, 2025 • 30
First Try Matters: Revisiting the Role of Reflection in Reasoning Models Paper • 2510.08308 • Published Oct 9, 2025 • 24
Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward Paper • 2510.03222 • Published Oct 3, 2025 • 76
Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States Paper • 2510.11052 • Published Oct 13, 2025 • 53
RLFR: Extending Reinforcement Learning for LLMs with Flow Environment Paper • 2510.10201 • Published Oct 11, 2025 • 36
Demystifying Reinforcement Learning in Agentic Reasoning Paper • 2510.11701 • Published Oct 13, 2025 • 33
QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs Paper • 2510.11696 • Published Oct 13, 2025 • 182
Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning Paper • 2510.27044 • Published Oct 30, 2025 • 6
Reverse-Engineered Reasoning for Open-Ended Generation Paper • 2509.06160 • Published Sep 7, 2025 • 151
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing Paper • 2509.22186 • Published Sep 26, 2025 • 164
FlowRL: Matching Reward Distributions for LLM Reasoning Paper • 2509.15207 • Published Sep 18, 2025 • 119
Towards a Unified View of Large Language Model Post-Training Paper • 2509.04419 • Published Sep 4, 2025 • 76
Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models Paper • 2509.06949 • Published Sep 8, 2025 • 57
Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR Paper • 2509.23808 • Published Sep 28, 2025 • 47
Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models Paper • 2511.23319 • Published Nov 28, 2025 • 24
Baichuan-M3: Modeling Clinical Inquiry for Reliable Medical Decision-Making Paper • 2602.06570 • Published Feb 6 • 61
Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability Paper • 2604.06628 • Published Apr 8 • 326
The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook Paper • 2604.02029 • Published Apr 2 • 151
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression Paper • 2604.04921 • Published Apr 6 • 114
Why Fine-Tuning Encourages Hallucinations and How to Fix It Paper • 2604.15574 • Published Apr 16 • 25
Hallucinations Undermine Trust; Metacognition is a Way Forward Paper • 2605.01428 • Published May 2 • 24
Efficient Training on Multiple Consumer GPUs with RoundPipe Paper • 2604.27085 • Published Apr 29 • 46
Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization Paper • 2604.24952 • Published Apr 27 • 6
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key Paper • 2605.06638 • Published 27 days ago • 15
RouteProfile: Elucidating the Design Space of LLM Profiles for Routing Paper • 2605.00180 • Published Apr 30 • 30
Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance Paper • 2605.15012 • Published 20 days ago • 4
BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE Paper • 2605.14438 • Published 20 days ago • 5
It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs Paper • 2605.20258 • Published 16 days ago • 30
The Unlearnability Phenomenon in RLVR for Language Models Paper • 2605.16787 • Published 18 days ago • 6
ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention Paper • 2605.23081 • Published 13 days ago • 41
Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth Paper • 2605.25052 • Published 10 days ago • 14
Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models Paper • 2605.25189 • Published 10 days ago • 4
How Far Will They Go? Red-Teaming Online Influence with Large Language Models Paper • 2605.22880 • Published 14 days ago • 6
DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning Paper • 2605.25604 • Published 9 days ago • 134
Self-Improving Language Models with Bidirectional Evolutionary Search Paper • 2605.28814 • Published 7 days ago • 58
Less is More: Early Stopping Rollout for On-Policy Distillation Paper • 2605.27028 • Published 8 days ago • 13
Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention Paper • 2605.29548 • Published 6 days ago • 8
On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters Paper • 2606.02437 • Published 2 days ago • 119