Source Themes | Dorian Baudry

Improved learning rates in multi-unit uniform price auctions

Motivated by the strategic participation of electricity producers in electricity day-ahead market, we study the problem of online learning in repeated multi-unit uniform price auctions. The main contribution of this paper is the introduction of a new …

The Value of Reward Lookahead in Reinforcement Learning

In reinforcement learning (RL), agents sequentially interact with changing environments while aiming to maximize the obtained rewards. Usually, rewards are observed only after acting, and so the goal is to maximize the expected cumulative reward. …

Lookback Prophet Inequalities

Prophet inequalities are fundamental optimal stopping problems, where a decision-maker observes sequentially items with values sampled independently from known distributions, and must decide at each new observation to either stop and gain the current …

Optimizing the coalition gain in Online Auctions with Greedy Structured Bandits

Motivated by online display advertising, this work considers repeated second-price auctions, where agents sample their value from an unknown distribution with cumulative distribution function F. In each auction t, a decision-maker bound by limited …

A new adaptive identification strategy of best crop management with farmers

We consider a Multi-Armed Bandit problem with covering constraints, where the primary goal is to ensure that each arm receives a minimum expected reward while maximizing the total cumulative reward. In this scenario, the optimal policy then belongs …

Multi-Armed Bandits with Guaranteed Revenue per Arm

A General Recipe for the Analysis of Randomized Multi-Armed Bandit Algorithms

In this paper we propose a general methodology to derive regret bounds for randomized multi-armed bandit algorithms. It consists in checking a set of sufficient conditions on the sampling probability of each arm and on the family of distributions to …

Fast asymptotically optimal algorithms for non-parametric stochastic bandits

We consider the problem of regret minimization in non-parametric stochastic bandits. When the rewards are known to be bounded from above, there exists asymptotically optimal algorithms, with asymptotic regret depending on an infimum of …

Non-parametric algorithms for Multi-Armed Bandits

A Multi-Armed Bandits (MAB) is a learning problem where an agent sequentially chooses an action among a given set of candidates, collects a reward, and implements a strategy in order to maximize her sum of reward. Motivated by a case study in …

Top Two Algorithms Revisited

Top Two algorithms arose as an adaptation of Thompson sampling to best arm identification in multi-armed bandit models (Russo, 2016), for parametric families of arms. They select the next arm to sample from by randomizing among two candidate arms, a …