LECTURES

10:00 – 12:00 Introduction to multi-armed bandits 

Nicolo Cesa-Bianchi Universita degli Studi di Milano

We will cover the basic results for the simple bandit model with K arms in the stochastic and non stochastic case. We will also present some of the key results for the linear and convex Lipschitz case. 

 

TALKS

14:00 Bandits for Exploration: Best Arm Identification and Discovery with Probabilistic Experts

Aurelien Garivier Universite Toulouse III Paul-Sabatier

Whereas the achievable limits in terms of regret minimization in simple bandit models are now well known, it is often meaningful to consider slightly different goals and/or slightly different models. In this talk, I first present recent contributions to a better understanding of the complexity of identifying the m-best arms. Generic notions of complexity are introduced and investigated for the two dominant frameworks considered in the literature: fixed-budget and fixed-confidence settings. Then, I present optimal discovery with probabilistic expert advice, a problem arising from security analysis of power systems. While not involving a bandit model, I will show how some basic ideas of the bandit literature still lead to optimal solutions. 

15:00 Model-free Algorithms for Misspecified and Complex Bandit Models 

Shie Mannor Technion

The UCB algorithm and its many variant essentially build models for the reward distribution of the arms. These models typically build on some strict probabilistic assumptions. However, the actual reward does not always satisfy such assumptions. We consider two different algorithmic approaches for the multi-armed bandit problem. The first approach, known as Thompson sampling, employs a pseudo prior over the arms. At each run the algorithm samples from the pseudo prior, finds the optimal action for the sampled parameters, and then updates the pseudo prior accordingly. By “pretending” to be Bayesian, Thompson sampling can harness some heavy machinery from Bayesian inference such as particle filters. We show how Thompson sampling can be used for complex bandit problems where the decision maker has to select a complex action (such as a permeation or a subset) and receives rewards that only reveals some information on the arms values. The second approach considers the empirical rewards an agent observes rather than building a complete model. The algorithm boils down to resampling from the data and choosing the arm with the highest resampled reward. Using this approach, the model may be severely misspecified with little effect on the efficiency of the algorithm.

16:00 Structured Stochastic Bandits

Alexandre Proutiere KTH Royal Institute of Technology

This talk surveys recent results for structured stochastic bandit problems. In these problems, the structure captures inherent dependencies between the rewards of the different arms, and should be optimally exploited to speed up the exploration process. We present problem-specific regret lower bounds for problems with various structures (unimodal, Lipschitz and combinatorial), and propose online algorithms that reach or approach these fundamental limits. The results are illustrated using applications ranging from online services to quantum computing.