Mohammad,
A policy gradient method essentially starts at a point (in the space
representing policies) and traces a trajectory by ascending the
gradient w.r.t. the objective function. Typically such a method stops
when a local maximum is reached.
In general the surface ascended can have multiple maxima, and even the
most favourable of these -- a global maximum in the space searched --
need not correspond to an optimal policy. This is simply because the
chosen policy space might not (and in practice predominantly does not)
contain an optimal policy. And even if an optimal policy is present in
the space, it need not be reached, because the search might not have
started within its basin of attraction.
I think you can call local optima a fundamental characteristic of
policy gradient methods. But I wouldn't think of them as a problem. We
do have convergence guarantees to the optimal policy in finite MDPs,
which rely on visiting every state multiple times and learning the
utilities/values of taking actions. This approach is simply not possible
in large/continuous state spaces, which are typically where policy
gradient methods are employed.
Perhaps the relevant question to consider is: for a given class of
tasks, what policy representation and search strategy perform well?
Note that in addition to policy gradient methods, one could employ
global search schemes: GA's, cross-entropy method, CMA-ES, etc. None
of these are global in the sense of finding global optima; they merely
spread their search across a wider expanse of the policy space. Note
that some might not even find local optima in a provable
sense. Nevertheless, given a finite amount of time, one algorithm's
bias might be more effective than another's on a given problem class.
For example,
* Evolutionary algorithms can work with bit strings and trees as
representations [1],
* NEAT can incrementally complexify a neural network-based
representation for the policy [2],
* The cross-entropy method and CMA-ES are (I think) state-of-the-art
when optimising over real vectors [3,4], and
* Multiple forms of policy gradient and actor-critic methods have been
quite successful for robotic motor control [5].
* Policy gradient methods such as the ones you describe rely on the
analytical computation of the gradient (which is why they use
stochastic policies), but under several representations this might
not be computable (although in some it could be approximated using
finite differences).
In short, there are several candidate algorithms to consider,
depending on the policy representation and the task. And all this
apart from value function-based methods (Sarsa, LSPI, etc.), which
still perform well on several continuous tasks.
Best,
Shivaram
1.
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/jair/pub/volume11/moriarty99a.pdf
2.
http://staff.science.uva.nl/~whiteson/pubs/whitesonthesis07.pdf
3.
http://web.eotvos.elte.hu/szityu/papers/SzitaLorincz05Learning.pdf
4.
http://www.neuroinformatik.ruhr-uni-bochum.de/PEOPLE/igel/ESfDPS.pdf
5.
http://www-clmc.usc.edu/publications/P/peters-NN2008.pdf