Re: [rl-list] Digest for rl-list@googlegroups.com - 2 updates in 2 topics

54 views

Skip to first unread message

Marcus Hutter

unread,

Jun 12, 2022, 8:40:54 AM6/12/22

to wbpow...@gmail.com, rl-...@googlegroups.com

Hi Warren,

A book perfectly fitting your paradigm "every reinforcement learning problem is a sequential decision problem" missing from your resource page:

"Universal artificial intelligence:
Sequential decisions based on algorithmic probability"
https://link.springer.com/book/10.1007/b138233

Kind regards,

Marcus
______________________
Marcus Hutter,
Australian National University (Honorary Professor)
DeepMind, London (Senior Researcher)
http://www.hutter1.net/

On 6/12/22 11:58, rl-...@googlegroups.com wrote:

rl-...@googlegroups.com Google Groups

Topic digest
View all topics

Resource page for sequential decision problems - 1 Update

Bandit problems and the four classes of policies - 1 Update

Resource page for sequential decision problems

Warren Powell <wbpow...@gmail.com>: Jun 11 08:30PM -0400

I have created a resource page of my “greatest hits” of information on
sequential decision problems at:

https://tinyurl.com/sdalinks

It includes links with information on the following topics:

o Introduction to the fields of sequential decision problems

o Books

o Courses and teaching materials

o Short notes about sequential decision analytics

o Some video introductions

I provide both the tinyurl links (please use this when you can because it
helps me know what is popular) as well as the original link (for people who
cannot use tinyurl links). If you cannot use tinyurl links, you can access
the page at

https://castlelab.princeton.edu/sdalinks

Please bookmark this page. I will continue to post additional material
here (and drop less popular material).

Remember - every reinforcement learning problem is a sequential decision
problem. Every RL method falls in one of the four classes of policies.

Warren

*------------------------------Warren B. Powell*

*Chief Analytics Officer, Optimal Dynamics*
*Professor Emeritus, Princeton University*
*http://www.castlelab.princeton.edu <http://www.castlelab.princeton.edu/>*

Back to top

Bandit problems and the four classes of policies

Warren Powell <wbpow...@gmail.com>: Jun 11 03:32PM -0400

I continue to be surprised at the handling of problems that involve
learning, captured in the form of belief states. The simplest of these
problems are known as multiarmed bandit problems, where a decision combines
making a choice (an arm) and then learning from that choice.

Sutton and Barto (2018) present bandit problems and the popular UCB
policies in chapter 2, before they present their notation for a dynamic
program (in chapter 3). But a bandit problem *is* a dynamic program where
the state variable is the belief about the performance of each arm.

Bandit problems can be formulated using the classical notation of MDPs, and
optimal policies can be characterized using Bellman’s equation (see DeGroot
(1970)). The original formulation of Bellman’s equation was intractable (a
K-armed bandit would have 2K continuous state variables). It was Gittins
in 1974 who introduced the idea of an index policy that could be solved
(for special cases), although it was computationally difficult.

Then it was in 1985 that Lai and Robbins used the concept of an index
policy (from Gittins) to motivate index policies that were easy to compute,
and still enjoyed theoretical guarantees. These index policies were based
on the idea of “upper confidence bounds.”

UCB policies and Gittins indices are examples of two of the four classes of
policies. I describe each of the four classes of policies in the context
of learning problems in chapter 7 (https://tinyurl.com/RLandSO/):

o PFAs (e.g. excitation policies) in section 7.4.

o CFAs (e.g. UCB policies) in section 7.5.

o VFAs (e.g. Gittins indices) in section 7.6. Also included is an
illustration of approximate dynamic programming for a bandit problem.

o DLAs (such as knowledge gradient) in section 7.7.

I routinely meet people from CS who have never heard of Gittins indices. I
have an easy introduction to Gittins indices in section 7.6. Readers will
see that Gittins indices are not easy to compute, but I then illustrate a
version of backward ADP, which might be the first application of ADP to a
bandit problem.

There is a wide range of problems that require active learning. The CS/RL
community tends to focus almost exclusively on UCB-based policies, but
there is a reason that other examples of policies (esp PFAs and lookaheads
such as knowledge gradient) have been developed. I think we should be
introducing students to all four classes, with special emphasis on CFAs
(such as UCB), DLAs (such as knowledge gradient), and PFAs. There are
reasons to motivate VFA-based policies, but this class of strategies for
learning problems requires more research.

*------------------------------Warren B. Powell*

*Chief Analytics Officer, Optimal Dynamics*
*Professor Emeritus, Princeton University*
*http://www.castlelab.princeton.edu <http://www.castlelab.princeton.edu/>*

Back to top

You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to rl-list+u...@googlegroups.com.

Warren Powell

unread,

Jun 12, 2022, 8:01:37 PM6/12/22

to Marcus Hutter, Reinforcement Learning Mailing List

Marcus,

Thanks for sharing this - It looks like a lot of careful thinking went into your book!

I think your framework is quite different from mine. My framework does not compete with the vast literature from RL (or stochastic programming, optimal control, stochastic search, simulation-optimization, decision trees, bandit problems, etc etc etc) - it brings everything in these fields together. I show a single way of modeling any sequential decision problem, and claim (this is not a proof) that *any* method that might be used to design a policy for making decisions falls into one of four classes of policies (or possibly a hybrid of two or more). It includes whatever is currently being used in practice.

You note that "The major drawback of the AIXI framework is that it is incomputable". My framework is inherently computable, since a starting point is whatever someone (or company) might be doing now in the field. However, it also formalizes many ad-hoc approaches, such as a class of policies I call "cost function approximations" which are parameterized optimization problems. UCB policies are a very simple class of CFA; CFAs are also used to schedule airlines and plan energy generation for power grids. I can often take an ad hoc approach that is being used in the field, formalize it and describe an implementable path to make the ad hoc approach better (typically by tuning parameters that were previously set at some fixed value).

I offer a number of other insights such as differentiating pure learning problems to general dynamic problems (that may or may not include a belief state); different objectives such as final and cumulative reward; the use of different uncertainty operators (such as expectations versus risk measures). I think my treatment of lookahead policies offers some fresh ideas, and I show a new way of thinking about learning problems using a multiagent framework, which fixes what I believe is a major error being made by the POMDP community.

I think that anyone working on some form of sequential decision problem will find a place in my framework to represent their methods (and that should apply to your framework as well).

Note that in my book, Bellman's equation is the basis of just one of the four classes of policies. And while I treat policies based on value functions in considerable depth, I actually think it is likely to be the method that is least used in practice. I do not present Bellman in any depth until chapter 14 (but then I spend 5 chapters on it, so it is not as if I have overlooked it).

Final note - I have taught this material to undergraduates, as well as graduate students from very applied domains. I think sequential decision problems should be taught to everyone!

Warren

------------------------------
Warren B. Powell

Chief Analytics Officer, Optimal Dynamics

Professor Emeritus, Princeton University

http://www.castlelab.princeton.edu

Reply all

Reply to author

Forward

0 new messages