PAC Subset Selection in Stochastic Multi-armed Bandits
We consider the problem of selecting, from among n real-valued random
variables, a subset of size m of those with the highest means, based
on efficiently sampling the random variables. This problem, which we
denote Explore-m, finds application in a variety of areas, such as
stochastic optimization, simulation and industrial engineering, and
on-line advertising. The theoretical basis of our work is an extension
of a previous formulation using multi-armed bandits that is devoted to
identifying just the single best of n random variables
(Explore-1). Under a PAC setting, we provide algorithms for Explore-m
and bound their sample complexity.
Our main contribution is the LUCB algorithm, which, interestingly,
bears a close resemblance to the well-known UCB algorithm for regret
minimization. We derive an expected-sample-complexity bound for LUCB
that is novel even for single-arm selection. We then improve the
problem-dependent constant in this bound through a novel algorithmic
variant called KL-LUCB. Experiments affirm the relative efficiency of
KL-LUCB over other algorithms for Explore-m. Our contributions also
include a lower bound on the worst case sample complexity of such
algorithms.