We need to fix the notation in reinforcement learning

56 views
Skip to first unread message

Warren Powell

unread,
May 6, 2022, 7:33:44 PM5/6/22
to Reinforcement Learning Mailing List

After my previous post on modeling sequential decision problems (aka RL problems), one follower asked: “Are you using "x" for control/decision here? For RL or DP it is more common to use x/s for state, u for control, and r/c for reward/cost. Furthermore, what do these W variables mean?”


It is time to address the problem of notation in the RL literature.  I review some of these below, but I have a more detailed discussion at https://tinyurl.com/SDAnotation/.


Some highlights include:


o We are currently facing a conflict between the standard notation of the RL/MDP community, long used by Sutton and Barto, and the notation of the control theory community adopted by Bertsekas.  Both have serious problems (but the style in optimal control is better).


o While the optimal control community uses “x” for state, “x” has been used for decisions by the math programming community since the 1950s, and it is notation used universally.  This is not going away (but it does not mean we have to completely abandon “a” for action).


o “s” is the most natural notation for state, and has a long history in the dynamic programming literature that was adopted by the RL community.


o The RL/MDP community likes to use the one-step transition matrix p(s’|s,a), but this is never computable, and completely hides the exogenous information process and the transition equations which are what is actually used in any RL algorithm.  The style used in optimal control is better, but has imperfections.


I have developed a notation system (an effort that spans decades and numerous discussions with people from different fields)  that blends the most popular choices of dynamic programming, math programming, optimal control, and simulation, and which follows modeling conventions popular in applied probability. See https://tinyurl.com/SDAnotation/ 


Warren

------------------------------
Warren B. Powell
Chief Analytics Officer, Optimal Dynamics
Professor Emeritus, Princeton University
Reply all
Reply to author
Forward
0 new messages