Sutton and Barto (2018) present bandit problems and the popular UCB policies in chapter 2, before they present their notation for a dynamic program (in chapter 3). But a bandit problem *is* a dynamic program where the state variable is the belief about the performance of each arm.
Bandit problems can be formulated using the classical notation of MDPs, and optimal policies can be characterized using Bellman’s equation (see DeGroot (1970)). The original formulation of Bellman’s equation was intractable (a K-armed bandit would have 2K continuous state variables). It was Gittins in 1974 who introduced the idea of an index policy that could be solved (for special cases), although it was computationally difficult.
Then it was in 1985 that Lai and Robbins used the concept of an index policy (from Gittins) to motivate index policies that were easy to compute, and still enjoyed theoretical guarantees. These index policies were based on the idea of “upper confidence bounds.”
UCB policies and Gittins indices are examples of two of the four classes of policies. I describe each of the four classes of policies in the context of learning problems in chapter 7 (https://tinyurl.com/RLandSO/):
o PFAs (e.g. excitation policies) in section 7.4.
o CFAs (e.g. UCB policies) in section 7.5.
o VFAs (e.g. Gittins indices) in section 7.6. Also included is an illustration of approximate dynamic programming for a bandit problem.
o DLAs (such as knowledge gradient) in section 7.7.
I routinely meet people from CS who have never heard of Gittins indices. I have an easy introduction to Gittins indices in section 7.6. Readers will see that Gittins indices are not easy to compute, but I then illustrate a version of backward ADP, which might be the first application of ADP to a bandit problem.