===============
I am still seeing people teaching “dynamic programming” (aka reinforcement learning, stochastic optimization, approximate dynamic programming) where they *start* with Bellman’s equation.
PLEASE STOP! STOP! STOP! STOP!
If you want to teach sequential decision problems, then do what we have done for 70 years in deterministic optimization and machine learning:
“Model first, then solve.”
Any sequential decision problem can be written:
State, decision, new information, state, decision, new information, …
Or in notation:
S_0, x_0, W_1, S_1, x_1, W_2, …, S_t, x_t, W_{t+1}, …
When we make a decision, we receive a contribution/reward (or incur a cost) C(S_t,x_t). Decisions are made using methods called “policies” that we represent by X^\pi(S_t) *which we will design later!!!*. Our goal is to find a policy that works the best, which means our objective function might be written
\max_\pi E {\sum_{t=0}^T C(S_t,X^\pi(S_t))|S_0}
This is our model. Next we need to model uncertainties that can be contained in the initial state S_0 (e.g. initial beliefs about quantities or parameters), or the exogenous information process W_1, W_2, …, W_t, … which is information that arrives over time.
To search over policies, it is important to introduce students to *all four classes*. Needless to say, you want to start with the simplest class, which is NOT Bellman’s equation! You need to pick a simple application. For example, you can pick an inventory problem, and start with a simple policy that can be simulated (and optimized) in a spreadsheet. An example that I used when I was teaching can be downloaded from