Teaching sequential decision problems

32 views
Skip to first unread message

Warren Powell

unread,
May 5, 2022, 6:10:26 PMMay 5
to Reinforcement Learning Mailing List
I have been sending a series of posts on the theme of "sequential decision problems" (all RL problems are sequential decision problems, as are all RL algorithms).  One seemed to strike a chord (it hit 86,000 views after one day) so I thought I would share it with the RL list.

I will send a few followup emails related to this topic.  But if you would like to see my various posts on sequential decision problems (I have sent out close to 50 since January) please follow me on LinkedIn at linkedin.com/in/warrenbpowell

Warren


===============

I am still seeing people teaching “dynamic programming” (aka reinforcement learning, stochastic optimization, approximate dynamic programming) where they *start* with Bellman’s equation.


PLEASE STOP!  STOP! STOP! STOP!


If you want to teach sequential decision problems, then do what we have done for 70 years in deterministic optimization and machine learning:


“Model first, then solve.”


Any sequential decision problem can be written:


State, decision, new information, state, decision, new information, …


Or in notation:


S_0, x_0, W_1, S_1, x_1, W_2, …, S_t, x_t, W_{t+1}, …


When we make a decision, we receive a contribution/reward (or incur a cost) C(S_t,x_t). Decisions are made using methods called “policies” that we represent by X^\pi(S_t) *which we will design later!!!*.  Our goal is to find a policy that works the best, which means our objective function might be written


\max_\pi E {\sum_{t=0}^T C(S_t,X^\pi(S_t))|S_0}


This is our model.  Next we need to model uncertainties that can be contained in the initial state S_0 (e.g. initial beliefs about quantities or parameters), or the exogenous information process W_1, W_2, …, W_t, … which is information that arrives over time.


To search over policies, it is important to introduce students to *all four classes*.  Needless to say, you want to start with the simplest class, which is NOT Bellman’s equation!  You need to pick a simple application.  For example, you can pick an inventory problem, and start with a simple policy that can be simulated (and optimized) in a spreadsheet. An example that I used when I was teaching can be downloaded from


https://tinyurl.com/energystorageoptimization/
================


------------------------------
Warren B. Powell
Chief Analytics Officer, Optimal Dynamics
Professor Emeritus, Princeton University
Reply all
Reply to author
Forward
0 new messages