Papers
arxiv:2605.00488
Trading off rewards and errors in multi-armed bandits
Published on May 1
Authors:
Abstract
Multi-armed bandit algorithms balance accurate arm mean estimation and reward maximization through regret guarantees that interpolate between exploration and exploitation objectives.
AI-generated summary
In multi-armed bandits, the most-explored arms are the most informative, while reward maximization typically pulls only the best arm. We study the tradeoff between identifying arm means accurately and accumulating reward, and present an algorithm with regret guarantees that interpolates between the two objectives. We provide both upper and lower bounds and validate empirically.
Get this paper in your agent:
hf papers read 2605.00488 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Cite arxiv.org/abs/2605.00488 in a model README.md to link it from this page.
Datasets citing this paper 0
No dataset linking this paper
Cite arxiv.org/abs/2605.00488 in a dataset README.md to link it from this page.
Spaces citing this paper 0
No Space linking this paper
Cite arxiv.org/abs/2605.00488 in a Space README.md to link it from this page.
Collections including this paper 0
No Collection including this paper
Add this paper to a collection to link it from this page.