Papers
arxiv:2605.00488

Trading off rewards and errors in multi-armed bandits

Published on May 1
Authors:
,
,
,
,

Abstract

Multi-armed bandit algorithms balance accurate arm mean estimation and reward maximization through regret guarantees that interpolate between exploration and exploitation objectives.

AI-generated summary

In multi-armed bandits, the most-explored arms are the most informative, while reward maximization typically pulls only the best arm. We study the tradeoff between identifying arm means accurately and accumulating reward, and present an algorithm with regret guarantees that interpolates between the two objectives. We provide both upper and lower bounds and validate empirically.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.00488
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.00488 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.00488 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.00488 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.