Do we need exploration in actor-critic algorithm?

Hao Liu

unread,

Jun 23, 2015, 11:12:43 AM6/23/15

to rl-...@googlegroups.com

Hi, all

I am currently trying to implement an actor-critic algorithm in the paper of http://web.mit.edu/jnt/www/Papers/J094-03-kon-actors.pdf . And as a beginner, I am very confused about the when we should add some exploration when optimizing a policy.

As far as I know, q-learning is better equipped with exploration to get the possible better policy, while SARSA is trying to use the exploration to learn the optimal policy (only that the optimal policy provided by SARSA is not deterministic but with exploration). Am I right?

So about the implementation of the actor-critic algorithm, I have the following questions:

1. Should I use exploration in the actor part? As the policy is encoded by a parameter vector (Boltzmann policy), it is already a soft policy which gives every action a positive probability to be taken. If I use \epsilon-greedy exploration, it seems that I might have changed the form of policy I am trying to optimize (i.e., from a class of parameterized Boltzmann policies to Boltzmann policy with exploration).

2. As the optimization of actor is based on gradient, it is easy to gt trapped in local minima. How can I try to get a global minima with some guarantee? By setting different initial conditions or some better way?

I would be appreciated if anyone can give me some hint. Sorry if my questions are naive. Thanks in advance.

Best Regards.

Hao Liu

Sergio Valcarcel

unread,

Jun 25, 2015, 9:49:41 AM6/25/15

to rl-...@googlegroups.com

Hi Hao,

Let me elaborate on my understanding of your first comment:

As far as I know, q-learning is better equipped with exploration to get the possible better policy, while SARSA is trying to use the exploration to learn the optimal policy (only that the optimal policy provided by SARSA is not deterministic but with exploration).

The main difference between Q-learning and SARSA is that the former learns off-policy, while the latter learns on-policy. This basically means that, at every iteration, Q-learning aims to estimate the state-action value function (i.e., Q-factor) for the optimal policy; while SARSA learns the value function for the behavior policy that it is actually following.

Since Q-learning is an off-policy algorithm, in principle, it can learn from whatever behavior policy we use, even if it is very different from the optimal policy. From that point of view, Q-learning is better equipped for exploration.

However, consider that the Q-learning agent follows a random behavior policy with uniform distribution over the actions all the time. And consider that in order to get to the states with higher rewards, it is required that the agent follows a specific long sequence of actions. In this case, it is unlikely that the agent will find that long sequence just by chance. Hence, it is also unlikely that the agent will visit the states with highest reward and, therefore, the estimate of the optimal Q-factor will have very high variance.

In order for the agent to visit those states, it has to follow a near-optimal behavior policy, while also allowing for visiting every other state. That is the reason why epsilon-greedy policies are widely used with Q-learning.

Now, as you pointed out, SARSA learns the value function for actual the behavior the agent is following. However, the SARSA agent also needs to visit every possible state infinitely often in order to guarantee convergence to the optimal value. That is the reason why epsilon-greedy is also a very popular choice for SARSA.

My point is that: i) both algorithms need to set a exploration vs. exploitation tradeoff, and ii) they do in a similar manner. On the other hand, it is true that if exploration is costly, depending on the epsilon parameter, Sarsa will probably learn to be much more careful than Q-learning. This is easily seen in the cliff domain (see, e.g., the animations @ https://studywolf.wordpress.com/2013/07/01/reinforcement-learning-sarsa-vs-q-learning/ ).

Regarding your questions.

1. Should I use exploration in the actor part? As the policy is encoded by a parameter vector (Boltzmann policy), it is already a soft policy which gives every action a positive probability to be taken. If I use \epsilon-greedy exploration, it seems that I might have changed the form of policy I am trying to optimize (i.e., from a class of parameterized Boltzmann policies to Boltzmann policy with exploration).

As you already remark, in the paper you mentioned, since the agent uses a soft-policy, it has some exploratory capabilities by default. However, I would say the reason of using a soft-policy is more a mathematical artifact than being motivated by enhancing exploration. My point is that the authors use a linear approximation of the value function, and the features and the parameter are related to a specific form of policy that is also a parametric distribution. With these combination of value and policy features, they can obtain a explicit form for the gradient, so that the actor can perform gradient ascent on the policy space.

In other words, the soft-policy will induce some randomness and works well in practice, but I think it is up to you to increase the exploratory capability of the agents, if you think it is required by your application.

2. As the optimization of actor is based on gradient, it is easy to gt trapped in local minima. How can I try to get a global minima with some guarantee? By setting different initial conditions or some better way?

You are completely right. As a gradient-based method, policy gradient will converge to a local minimum. You can repeat the algorithm multiple times from different initial conditions and get the best one. You can also do more cool things, like combining gradient with a bandit-based global optimization algorithm, like in this paper:

Efficient Multi-Start Strategies for Local Search Algorithms, Gyorgy and Kocsis @ https://jair.org/media/3313/live-3313-5749-jair.pdf

Or you can implement a other kind of black-box policy search methods, like:

The Cross Entropy method for Fast Policy Search, Mannor et al. @ https://www.aaai.org/Papers/ICML/2003/ICML03-068.pdf
Online Stochastic Optimization under Correlated Bandit Feedback, Gheshlaghi Azar et al @ http://researchers.lille.inria.fr/~lazaric/Webpage/Publications_files/paper.pdf
Policy Improvement: Between Black-Box Optimization and Episodic Reinforcement Learning, Stulp and Sigaud @ http://www.isir.upmc.fr/files/2013ACTN2859.pdf

By the way, the paper by Konda and Tsitsiklis that you mentioned is great. Just to mention that when I studied actor-critic methods, I also found useful these other papers:

Natural actor–critic algorithms, Bhatnagar et al. @ http://webdocs.cs.ualberta.ca/~sutton/papers/BSGL-09.pdf
Policy Gradient Methods for Reinforcement Learning with Function Approximation, Sutton et al. @ http://webdocs.cs.ualberta.ca/~sutton/papers/SMSM-NIPS99.pdf
Chapter 4.4 of Algorithms for Reinforcement Learning, S. Szepesvari @ http://www.ualberta.ca/~szepesva/papers/RLAlgsInMDPs-lecture.pdf
Natural Actor-Critic, Peters et al. @ https://homes.cs.washington.edu/~todorov/courses/amath579/reading/NaturalActorCritic.pdf

Hope this helps :-)

Best!
Sergio

-- 
The world can only change from within.
    --Eckhart Tolle

--
You received this message because you are subscribed to the "Reinforcement Learning Mailing List" group.
To post to this group, send email to rl-...@googlegroups.com
To unsubscribe from this group, send email to
rl-list-u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/rl-list?hl=en
---
You received this message because you are subscribed to the Google Groups "Reinforcement Learning Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rl-list+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hao Liu

unread,

Jun 29, 2015, 2:55:21 PM6/29/15

to rl-...@googlegroups.com

Hi Sergio,

Thank you so much! Your really clear my doubts and the papers you pointed out are very useful to me .