The main difference between Q-learning and SARSA is that the former
learns off-policy, while the latter learns on-policy. This basically
means that, at every iteration, Q-learning aims to estimate the
state-action value function (i.e., Q-factor) for the optimal policy;
while SARSA learns the value function for the behavior policy that
it is actually following.
Since Q-learning is an off-policy algorithm, in principle, it can
learn from whatever behavior policy we use, even if it is very
different from the optimal policy. From that point of view,
Q-learning is better equipped for exploration.
However, consider that the Q-learning agent follows a random
behavior policy with uniform distribution over the actions all the
time. And consider that in order to get to the states with higher
rewards, it is required that the agent follows a specific long
sequence of actions. In this case, it is unlikely that the agent
will find that long sequence just by chance. Hence, it is also
unlikely that the agent will visit the states with highest reward
and, therefore, the estimate of the optimal Q-factor will have very
high variance.
In order for the agent to visit those states, it has to follow a
near-optimal behavior policy, while also allowing for visiting every
other state. That is the reason why epsilon-greedy policies are
widely used with Q-learning.
Now, as you pointed out, SARSA learns the value function for actual
the behavior the agent is following. However, the SARSA agent also
needs to visit every possible state infinitely often in order to
guarantee convergence to the optimal value. That is the reason why
epsilon-greedy is also a very popular choice for SARSA.
My point is that: i) both algorithms need to set a exploration vs.
exploitation tradeoff, and ii) they do in a similar manner. On the
other hand, it is true that if exploration is costly, depending on
the epsilon parameter, Sarsa will probably learn to be much more
careful than Q-learning. This is easily seen in the cliff domain
(see, e.g., the animations @
https://studywolf.wordpress.com/2013/07/01/reinforcement-learning-sarsa-vs-q-learning/
).
Regarding your questions.
As you already remark, in the paper you mentioned, since the agent
uses a soft-policy, it has some exploratory capabilities by default.
However, I would say the reason of using a soft-policy is more a
mathematical artifact than being motivated by enhancing exploration.
My point is that the authors use a linear approximation of the value
function, and the features and the parameter are related to a
specific form of policy that is also a parametric distribution. With
these combination of value and policy features, they can obtain a
explicit form for the gradient, so that the actor can perform
gradient ascent on the policy space.
In other words, the soft-policy will induce some randomness and
works well in practice, but I think it is up to you to increase the
exploratory capability of the agents, if you think it is required by
your application.
You are completely right. As a gradient-based method, policy
gradient will converge to a local minimum. You can repeat the
algorithm multiple times from different initial conditions and get
the best one. You can also do more cool things, like combining
gradient with a bandit-based global optimization algorithm, like in
this paper:
Or you can implement a other kind of black-box policy search
methods, like:
By the way, the paper by Konda and Tsitsiklis that you mentioned is
great. Just to mention that when I studied actor-critic methods, I
also found useful these other papers:
Hope this helps :-)
Best!
Sergio
--
The world can only change from within.
--Eckhart Tolle