It is time to address the problem of notation in the RL literature. I review some of these below, but I have a more detailed discussion at https://tinyurl.com/SDAnotation/.
Some highlights include:
o We are currently facing a conflict between the standard notation of the RL/MDP community, long used by Sutton and Barto, and the notation of the control theory community adopted by Bertsekas. Both have serious problems (but the style in optimal control is better).
o While the optimal control community uses “x” for state, “x” has been used for decisions by the math programming community since the 1950s, and it is notation used universally. This is not going away (but it does not mean we have to completely abandon “a” for action).
o “s” is the most natural notation for state, and has a long history in the dynamic programming literature that was adopted by the RL community.
o The RL/MDP community likes to use the one-step transition matrix p(s’|s,a), but this is never computable, and completely hides the exogenous information process and the transition equations which are what is actually used in any RL algorithm. The style used in optimal control is better, but has imperfections.
I have developed a notation system (an effort that spans decades and numerous discussions with people from different fields) that blends the most popular choices of dynamic programming, math programming, optimal control, and simulation, and which follows modeling conventions popular in applied probability. See https://tinyurl.com/SDAnotation/