Bandit problems and the four classes of policies

16 views

Skip to first unread message

Warren Powell

unread,

Jun 11, 2022, 3:32:20 PM6/11/22

to Reinforcement Learning Mailing List

I continue to be surprised at the handling of problems that involve learning, captured in the form of belief states. The simplest of these problems are known as multiarmed bandit problems, where a decision combines making a choice (an arm) and then learning from that choice.

Sutton and Barto (2018) present bandit problems and the popular UCB policies in chapter 2, before they present their notation for a dynamic program (in chapter 3). But a bandit problem *is* a dynamic program where the state variable is the belief about the performance of each arm.

Bandit problems can be formulated using the classical notation of MDPs, and optimal policies can be characterized using Bellman’s equation (see DeGroot (1970)). The original formulation of Bellman’s equation was intractable (a K-armed bandit would have 2K continuous state variables). It was Gittins in 1974 who introduced the idea of an index policy that could be solved (for special cases), although it was computationally difficult.

Then it was in 1985 that Lai and Robbins used the concept of an index policy (from Gittins) to motivate index policies that were easy to compute, and still enjoyed theoretical guarantees. These index policies were based on the idea of “upper confidence bounds.”

UCB policies and Gittins indices are examples of two of the four classes of policies. I describe each of the four classes of policies in the context of learning problems in chapter 7 (https://tinyurl.com/RLandSO/):

o PFAs (e.g. excitation policies) in section 7.4.

o CFAs (e.g. UCB policies) in section 7.5.

o VFAs (e.g. Gittins indices) in section 7.6. Also included is an illustration of approximate dynamic programming for a bandit problem.

o DLAs (such as knowledge gradient) in section 7.7.

I routinely meet people from CS who have never heard of Gittins indices. I have an easy introduction to Gittins indices in section 7.6. Readers will see that Gittins indices are not easy to compute, but I then illustrate a version of backward ADP, which might be the first application of ADP to a bandit problem.

There is a wide range of problems that require active learning. The CS/RL community tends to focus almost exclusively on UCB-based policies, but there is a reason that other examples of policies (esp PFAs and lookaheads such as knowledge gradient) have been developed. I think we should be introducing students to all four classes, with special emphasis on CFAs (such as UCB), DLAs (such as knowledge gradient), and PFAs. There are reasons to motivate VFA-based policies, but this class of strategies for learning problems requires more research.

------------------------------
Warren B. Powell

Chief Analytics Officer, Optimal Dynamics

Professor Emeritus, Princeton University

http://www.castlelab.princeton.edu

Reply all

Reply to author

Forward

0 new messages