Non-parametric algorithms for Multi-Armed Bandits

Abstract

A Multi-Armed Bandits (MAB) is a learning problem where an agent sequentially chooses an action among a given set of candidates, collects a reward, and implements a strategy in order to maximize her sum of reward. Motivated by a case study in agriculture, we tackle in this thesis several problems that are relevant towards real-world applications of MAB. The first central question that we considered in this thesis is about the assumptions made on the distributions of rewards. While in theory it is usually convenient to consider simple parametric assumptions (e.g gaussian distributions), the practitioner may have some difficulty to find the right model fitting their problem. For this reason, we analyze two families of nonparametric algorithms, in the sense that they do not require strong parametric assumptions on the distributions for their implementation. We show that these two approaches can achieve strong theoretical guarantees in the standard bandit setting, improving what should be known in advance by the learner compared with previous algorithms. Then, we analyze some extensions of these algorithms that make them more suitable for some real-world applications. A second focus of our work is to consider alternative performance metrics, that may be more suitable than the expected sum of rewards for the practitioner. We propose a risk-aware algorithm for a bandit problem where the learner wants to find a safe arm according to a risk metric: the Conditional-Value-at-Risk. We also propose efficient algorithms for a problem analogous to the limit case of this setting, known as Extreme Bandits. Finally, we also adapt some of our approaches for standard variant of MAB, including one with non-stationary rewards and one with feedback grouped into batches of observations.

Type
Publication
PhD Thesis

Related