Monte Carlo On- and Off-Policy methods

Ruben

unread,

Sep 29, 2009, 11:24:31 AM9/29/09

to Reinforcement Learning Mailing List

Hi everyone,

I am currently writing the section about MC methods in my thesis and I
while reading about them in the Sutton and Barto RL introduction book
[1][2], I have come up with 2 questions about on and off policy monte
carlo methods:

- When performing On-Policy MC methods, in the end we get the policy
that is optimal among the epsilon-soft policies. Now if I take that
policy and set the epsilon to zero, will it be optimal, in other
words, does On-Policy MC generate the optimal policy we are looking
for?
- Its said that Off-Policy MC methods only learn from the tail of the
episodes and at the time the book was written it was unclear whether
that was a disadvantage as Off-Policy MC methods had too few
attention. Has there been any progress in this field? Are Off-Policy
MC methods even worth mentioning?

In case any of you has papers about MC research at hand, do not
hesitate to link them.

Thanks in advance,
Ruben

[1] http://www.cs.wmich.edu/~trenary/files/cs5300/RLBook/node54.html
[2] http://www.cs.wmich.edu/~trenary/files/cs5300/RLBook/node56.html

Brian Tanner

unread,

Sep 29, 2009, 12:29:46 PM9/29/09

to rl-...@googlegroups.com

To your first question, the answer is no, I think. It will not learn
the optimal policy for no exploration. Consider the cliff world in
Sutton and Barto's book. I believe that the agent will learn the
policy to avoid the cliff, because walking close to the cliff is
dangerous with e-soft policies. When you turn off exploration, the
policy of course still avoids the cliff, even though it is now safe to
walk along the cliff.

--
Brian Tanner
Ph.D Student
University of Alberta
br...@tannerpages.com

Ruben

unread,

Oct 1, 2009, 2:06:49 AM10/1/09

to Reinforcement Learning Mailing List

Alright, so that would suppose a decaying epsilon over time of some
sort to move toward the optimal policy?

Richard Cubek

unread,

Oct 1, 2009, 8:25:08 AM10/1/09

to rl-...@googlegroups.com

I think, an on-policy method with epsilon-greedy exploration and an
decaying epsilon towards 0 will converge to the optimal policy (for
greedy actions), that is, the same policy, which would be learned with
an off-policy method.

--
Richard Cubek, Dipl.-Ing.(FH)
University of Applied Sciences Ravensburg-Weingarten
Intelligent Mobile Robotics Laboratory
Phone: (0049) (0)751 501 9838
Mobile: (0049) (0)163 88 39 529

Michael Littman

unread,

Oct 1, 2009, 8:47:07 AM10/1/09

to rl-...@googlegroups.com

GLIE (greedy in the limit with infinite exploration) policies generally converge to off policy optimal:

http://www.springerlink.com/content/x25nv4t955681168/

Matthieu Geist

unread,

Oct 1, 2009, 9:09:16 AM10/1/09

to rl-...@googlegroups.com

Hello Ruben,

I think that something closely related to off-policy MC is off-policy policy evaluation with eligibility traces :

Doina Precup, Richard S. Sutton, and Sanjoy Dasgupta. Off-policy temporal-difference learning with function approximation. In Proceedings of the 18th International Conference on Machine Learning, 2001.

To my mind, what you call off-policy MC is more generally (i mean outside the RL field) known as importance sampling, which roughly consists in estimating something about a distribution from samples generated from another distribution.

http://en.wikipedia.org/wiki/Importance_sampling

Best regards,

Matthieu Geist

2009/9/29 Ruben <rew...@googlemail.com>

Rob Zumwalt

unread,

Oct 1, 2009, 8:48:14 AM10/1/09

to rl-...@googlegroups.com

What about optimistic initial values for the states (or state-action
pairs)? This should allow setting epsilon to zero, and still staying
on-policy. Perhaps it is not optimal?

Rob

On Oct 1, 2009, at 7:25 AM, Richard Cubek <richar...@hs-weingarten.de

Warren Powell

unread,

Oct 1, 2009, 10:12:55 AM10/1/09

to rl-...@googlegroups.com

Rob,

This only works if you can compute the expectation exactly (as in
real-time dynamic programming). If you depend on Monte Carlo estimates,
then you have to handle the case where initial estimates are optimistic,
but as a result of Monte Carlo sampling, you lose this property. So,
even with initially optimistic estimates, you still have to explore.

Optimistic estimates is one way of handling the exploration problem, but
it is very sensitive to the choice of initial estimates. If they are
too high, you may be forcing an exploration of the entire state space
(at least once), which is problematic if your state space is large.

Epsilon-greedy exploration can also suffer from very slow convergence.
It is well known in ADP/RL that some algorithms are provably convergent,
but the rate of convergence is so slow that the algorithm is
impractical. We have very few good stopping rules, and it is easy for
an algorithm to reach "apparent convergence" when in fact it is quite
far from anything resembling an optimal solution. An algorithm can
produce very good results in one setting, and poor results in another,
but you do not really know when you are getting poor results unless you
have some sort of benchmark.

Warren Powell

Michael Littman

unread,

Oct 5, 2009, 3:59:51 PM10/5/09

to rl-...@googlegroups.com

for what it's worth, the "delayed Q-learning" algorithm of Strehl et al. shows how to combine optimistic initialization with conservative updates to obtain provable convergence to a near-optimal policy

www.cs.rutgers.edu/~strehl/papers/ICML06PACModelFreeRL.pdf

Ruben

unread,

Oct 6, 2009, 5:23:56 PM10/6/09

to Reinforcement Learning Mailing List

Thanks for your answers.
I did implement the Delayed Q Learning algorithm proposed by Streht et
al. What would be good reasons for epsilon and delta? I tried
different values for both between 0 and 1 but the problem is even if I
discretize my continuous state space down to 1900 states, I hardly get
updates because the condition Q(s,a) - U(s,a) / m >= 2 * eps is almost
never satisfied even after thousands of episodes. Only if I use
relatively high epsilon values I get a lower m (~5000 against 1e~6 for
a lower epsilon) and get updates every hundred episodes, but then
again the epsilon value is so high that it has a significant influence
on my updates (Q(s,a) <- U[(s, a) / m + epsilon) which leads to wrong
estimates.

However also that it only updates one state estimate in some thousand
episodes indicate something strange is going on.
Maybe its because my rewards are 1 or 0.99 the most time and only 0 if
something fails (normally I give rewards from -1 to 0 but the paper
states rewards between 0 and 1 are necessary for this algorithm)? Or
is the algorithm only applicapable for small state spaces?

Ruben

Nikos Vlassis

unread,

Oct 7, 2009, 11:45:44 AM10/7/09

to rl-...@googlegroups.com

Hi Ruben,

- When performing On-Policy MC methods, in the end we get the policy
that is optimal among the epsilon-soft policies.

I think this is not true in general, and it depends on things like how we choose trajectories for simulation, how much we bootstrap, what is our policy class, how (and how often) we update the policy, etc. For example the paper already pointed at (http://www.springerlink.com/content/x25nv4t955681168/) establishes the convergence of single-step on-policy MC (=Sarsa(0)) for GLIE policies. The on-policy MC algorithms in the book of Sutton & Barto (1998) are only shown to converge when either the policy is fully evaluated before each update (see the analysis in the book), or when every state-action pair is used to initialize the observed trajectories with the same frequency (see www.mit.edu/~jnt/Papers/J089-02-jnt-optimistic.pdf). Some other papers demonstrate convergence of on-policy MC for "smoother" policy updates (google "natural policy gradient"; check http://www.dpem.tuc.gr/users/vlassis/publications/Vlassis09icml-webversion.pdf ). The chapter 5.4 (Optimistic Policy Iteration) of the book of Bertsekas & Tsitsiklis (1996) contains several details on on-policy MC.

So, AFAIK, the general conditions of convergence of on-policy MC is still an open problem, and presumably this involves the class of epsilon-soft policies as well.

Has there been any progress in this field? Are Off-Policy
MC methods even worth mentioning?

Off-policy MC is indeed related to importance sampling; a recent paper is www.aaai.org/Papers/AAAI/2008/AAAI08-214.pdf

Best regards,
Nikos

Lihong Li

unread,

Oct 7, 2009, 2:55:19 PM10/7/09

to Reinforcement Learning Mailing List

Hi Ruben,

It seems you're using Equation 7 in the delayed Q-learning paper for
m's value. While this value is critical to obtain theoretically sound
results, it is undoubtedly too conservative in many problems, since
the algorithm is designed for worst-case scenarios. With more prior
information about the reward/transition functions, you can tighten the
analysis significantly with a smaller m value. In practice, you might
want to tune the value of m.

The almost uniform reward function you use potentially can lead to
challenges, as you said. There are some papers in the literature that
study how to achieve faster convergence with rare events (such as
Frank-Mannor-Precup's ICML paper: www.cs.mcgill.ca/~jfrank8/pubs/icml08rareevents.pdf).
They may be helpful to your problem.

Lihong

Reply all

Reply to author

Forward