[rl-list] Problem with Local Minima in Policy-Gradient Methods

Mohammad Gheshlaghi Azar

unread,

May 10, 2010, 10:17:54 AM5/10/10

to rl-...@googlegroups.com

Hi,

Recently, I tried to implement both NAC-LSTD (Peters et. al. 08) and CONJ-GPOMDP (Baxter and Bartlett 01) on the well-known mountain-car problem using soft-max distribution for the policy (i.e. \pi(s,a)\propto\exp(\theta^T\phi(s,a)), where \pi is the policy distribution and \theta is a vector of adjustable parameters.). My observation was that the asymptotic behavior of both methods highly depends on the initialization of the parametrized policy and the choice of basis functions. Personally, I guess this happens because the cost-function is non-convex in terms of \theta and therefore above-mentioned methods stuck in the local minima but I am not sure about that. Now, can any body help me with the following questions?

1-What is the the cause of sensitivity of the policy-gradient methods to initialization and the choice of basis-functions? Is it a fundamental problem for the policy-gradient methods or it just happens for some specific policy structures and policy-gradient algorithms?

2-If it is indeed the case that the policy-gradient methods suffer from local-minima, what is the state of the art solution to address this problem?

Thanks,
Mohammad Gheshlaghi Azar,
PhD student,
University of Nijmegen, Netherlands.
web: www.mbfys.ru.nl/~mazar

--
You received this message because you are subscribed to the "Reinforcement Learning Mailing List" group.
To post to this group, send email to rl-...@googlegroups.com
To unsubscribe from this group, send email to
rl-list-u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/rl-list?hl=en

"José Antonio Martín H."

unread,

May 10, 2010, 10:59:02 AM5/10/10

to rl-...@googlegroups.com

Hi all.

Does any body knows if there are at least some public pseudocode
versions of the new family of gradient descent TD methods introduced by:

Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D.,
Szepesvari, Cs., Wiewiora, E. (2009). Fast gradient-descent methods for
temporal-difference learning with linear function approximation. In
Proceedings of the 26th International Conference on Machine Learning,
Montreal, Canada.

Maei, H. R., Szepesvari, Cs., Bhatnagar, S., Precup, D., Silver, D.,
Sutton, R. S. (2009). Convergent Temporal-Difference Learning with
Arbitrary Smooth Function Approximation. In Advances in Neural
Information Processing Systems 22, Vancouver, BC. December 2009. MIT Press.

Maei, H. R., Sutton, R. S. (2010). GQ(λ): A general gradient algorithm
for temporal-diﬀerence prediction learning with eligibility traces. In
Proceedings of the Third Conference on Artificial General Intelligence,
Lugano, Switzerland.

?

Thanks...

--
======================================================================
José Antonio Martín H. PhD.
Departamento de Sistemas Informáticos y Computación
Facultad de Informática
Universidad Complutense de Madrid
C/ Profesor José García Santesmases s/n. 28040 Madrid
http://www.dia.fi.upm.es/~jamartin/
mailto://jama...@fdi.ucm.es
"El orden es el recurso no renovable más importante"
"Order is the truly nonrenewable resource."
======================================================================

Hado van Hasselt

unread,

May 10, 2010, 11:37:09 AM5/10/10

to rl-...@googlegroups.com

Dear Jose,

I have made Python - which almost reads as pseudo code -
implementations that you are welcome to use for the TD algorithms TDC,
GTD and GTD2. Of course, these come with no guarantees of correctness,
so if anyone of the original authors replies, I would go with their
version :) Ordinary TD(0) is also included, which can serve as
reference. I hope this is helpful to you.

The code assumes state-feature vectors (st is the vector for the
present state, st_ for the next state) and things like (possibly
state-dependent, possibly decreasing) learning rates (alpha, beta) and
of course some encapsulating program to call these methods. The
`getAction' method is assumed to also handle possible exploration and
such, if needed. If you have any questions, I'd be glad to help.

As a side note, I could not completely reproduce the results from the
papers. The results seemed a constant factor or something similar off,
so I assume this was a flaw in my implementation of the MDPs, not in
the implementation of the algorithms. However, if someone spots a flaw
in my code, I'd be very glad to hear about it.

Regards,
Hado

2010/5/10 "José Antonio Martín H." <jama...@fdi.ucm.es>:

TD.py

Shivaram Kalyanakrishnan

unread,

May 10, 2010, 3:00:11 PM5/10/10

to rl-...@googlegroups.com

Mohammad,

A policy gradient method essentially starts at a point (in the space
representing policies) and traces a trajectory by ascending the
gradient w.r.t. the objective function. Typically such a method stops
when a local maximum is reached.

In general the surface ascended can have multiple maxima, and even the
most favourable of these -- a global maximum in the space searched --
need not correspond to an optimal policy. This is simply because the
chosen policy space might not (and in practice predominantly does not)
contain an optimal policy. And even if an optimal policy is present in
the space, it need not be reached, because the search might not have
started within its basin of attraction.

I think you can call local optima a fundamental characteristic of
policy gradient methods. But I wouldn't think of them as a problem. We
do have convergence guarantees to the optimal policy in finite MDPs,
which rely on visiting every state multiple times and learning the
utilities/values of taking actions. This approach is simply not possible
in large/continuous state spaces, which are typically where policy
gradient methods are employed.

Perhaps the relevant question to consider is: for a given class of
tasks, what policy representation and search strategy perform well?
Note that in addition to policy gradient methods, one could employ
global search schemes: GA's, cross-entropy method, CMA-ES, etc. None
of these are global in the sense of finding global optima; they merely
spread their search across a wider expanse of the policy space. Note
that some might not even find local optima in a provable
sense. Nevertheless, given a finite amount of time, one algorithm's
bias might be more effective than another's on a given problem class.
For example,

* Evolutionary algorithms can work with bit strings and trees as
representations [1],

* NEAT can incrementally complexify a neural network-based
representation for the policy [2],

* The cross-entropy method and CMA-ES are (I think) state-of-the-art
when optimising over real vectors [3,4], and

* Multiple forms of policy gradient and actor-critic methods have been
quite successful for robotic motor control [5].

* Policy gradient methods such as the ones you describe rely on the
analytical computation of the gradient (which is why they use
stochastic policies), but under several representations this might
not be computable (although in some it could be approximated using
finite differences).

In short, there are several candidate algorithms to consider,
depending on the policy representation and the task. And all this
apart from value function-based methods (Sarsa, LSPI, etc.), which
still perform well on several continuous tasks.

Best,
Shivaram

1. http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/jair/pub/volume11/moriarty99a.pdf
2. http://staff.science.uva.nl/~whiteson/pubs/whitesonthesis07.pdf
3. http://web.eotvos.elte.hu/szityu/papers/SzitaLorincz05Learning.pdf
4. http://www.neuroinformatik.ruhr-uni-bochum.de/PEOPLE/igel/ESfDPS.pdf
5. http://www-clmc.usc.edu/publications/P/peters-NN2008.pdf

Mohammad Gheshlaghi Azar

unread,

May 11, 2010, 5:17:06 AM5/11/10

to rl-...@googlegroups.com

Hi Shivaram,

Thank you for your comprehensive response. I am in agreement with almost everything you mentioned in your mail. However, I have some reservations on your point that having multiple local maximas is essentially a fundamental characteristic of policy-gradient methods and therefore converging to the optimal policy is not guaranteed. We know from supervised learning that if the cost function is convex and differentiable then the gradient descent methods guaranty convergence to the global optima in limit. RL problems are no exceptions and therefore if the cost function is convex w.r.t. the parameters of the policy then the policy-gradient methods guarantee the convergence to the global optimal solution provided that we have an unbiased estimate of the gradient. So from theoretical point of view global optimality can be achieved in limit. Of course, In practice I am not aware of any policy structure which provide a convex cost function. But, if one come with the policy structure which make the cost function convex then I don't see why we can not achieve global optimality.

Regards,
Mohammad

"José Antonio Martín H."

unread,

May 11, 2010, 2:49:30 PM5/11/10

to rl-...@googlegroups.com

Hi Hado, thanks so much!.
Do you have also the non-linear TDC version ?

Bests,
--jose

whiti...@gmail.com

unread,

May 11, 2010, 5:46:26 PM5/11/10

to Reinforcement Learning Mailing List

Hi,

You are right it is a good time to release some code ...

We (Maei, Sutton, White) will be releasing a reference implementation
of GQ(lambda) shortly. Stay tuned :)

Adam

On May 11, 12:49 pm, "José Antonio Martín H." <jamart...@fdi.ucm.es>
wrote:

> > 2010/5/10 "José Antonio Martín H."<jamart...@fdi.ucm.es>:

> >> mailto://jamart...@fdi.ucm.es

> >> "El orden es el recurso no renovable más importante"
> >> "Order is the truly nonrenewable resource."
> >> ======================================================================
>
> >> --
> >> You received this message because you are subscribed to the "Reinforcement
> >> Learning Mailing List" group.
> >> To post to this group, send email to rl-...@googlegroups.com
> >> To unsubscribe from this group, send email to
> >> rl-list-u...@googlegroups.com
> >> For more options, visit this group at
> >>http://groups.google.com/group/rl-list?hl=en
>
> --
> ======================================================================
> José Antonio Martín H. PhD.
> Departamento de Sistemas Informáticos y Computación
> Facultad de Informática
> Universidad Complutense de Madrid
> C/ Profesor José García Santesmases s/n. 28040 Madridhttp://www.dia.fi.upm.es/~jamartin/

> mailto://jamart...@fdi.ucm.es

Hado van Hasselt

unread,

May 12, 2010, 3:12:07 AM5/12/10

to rl-...@googlegroups.com

Hi Jose,

I had not implemented the non-linear version, also since it depends on
the function approximation used. The idea is much the same though:

Vs is the current state value, Vs_ the next state value. dVs is a list
of derivatives of the value function to its parameters with (the
features of) the current state as input. dVs_ is the same, but with
(the features of) the next state as input. For instance, in a neural
network setting, this would contain the derivatives of the output to
the weights of the network. (Of course, it is then more natural to
store these derivatives as a matrix per layer, but I abstract over
topological issues here.) I assume P is the number of parameters.
Finally, dsw is the list of derivatives of sw, where sw is defined as
below. This can (inefficiently! see below) also be found by
calculating:
dsw = [ sum([ ddVs[p][q]*w[q] for q in range( P ) ]) ]
wheren ddVs is the Hessian matrix (with the current state as input) of
the parameters.

Then, ignoring the projection (see the paper for details), we get
something like:

delta = rt + gamma*Vs_ - Vs
sw = sum([ dVs[p]*w[p] for p in range( P ) ])
h = [ (delta - sw)*dsw[p] for p in range( P ) ]
theta = [ theta[p] + alpha*( delta*dVs[p] - gamma*dVs_[p]*sw - h[p] ) ]
w = [ w[p] + beta*( delta - sw )*dVs[p] for p in range( P ) ]

However, please note that this is untested and may contain any number
of bugs (so maybe I should not have send this mail :) ) and it's
probably best to wait for the implementation of Maei, Sutton and
White.

As a sidenote, I have the intuition that the elements of h might often
be quite small (depending also, of course, on the smoothness of your
function approximators) since w tries to minimize the difference
between delta and sw. So in practice, one might not even need the
value of dsw (although, as the paper notes, this can be calculated in
O(n) when not using the Hessian as I did above).

Anyway, like I said, it's probably better to wait for the
implementations by the people that actually invented the algorithms as
they will probably be much more useful and more likely to be bug free
:)

Regards,
Hado

2010/5/11 "José Antonio Martín H." <jama...@fdi.ucm.es>:

Reply all

Reply to author

Forward