What are the advantages and disadvantages of off-policy learning compared to on-policy learning?

Law Roboss

unread,

Sep 26, 2013, 9:46:23 AM9/26/13

to rl-...@googlegroups.com

Dear all

Recently, I am reading some paper about reinforcement learning. I am still not clear that :What are the advantages and disadvantages of off-policy learning compared to on-policy learning?

Thanks.

Sergio Valcarcel

unread,

Sep 26, 2013, 2:08:09 PM9/26/13

to rl-...@googlegroups.com

Hi Law,

Off-policy learning is a very cool idea.

When an agent interacts with the environment, the experience it gathers
depends on its behavior policy. For instance, if a robot goes towards a
ramp, it is likely that the sensor reading given by an accelerometer
will be high, at least, more likely than if the robot moves towards a
flat area or does not move at all.

There are cases in which some behavior policies are dangerous. For
instance, if the robot moves very close to the stairs, it may fall from
a high distance and break after impacting the floor.
As another example, imagine a factory manager who wants to optimize the
manufacturing process. He could change some parameters that control the
machines, however, if he mistakes, then the production line could
collapse (and loose a lot of money!). The manager could think about
learning the optimal parameters using RL but he wants to be conservative
when trying the parameters, but then the question is: "can the manager
learn about policies less conservative than its actual behavior policy?"

Another example of off-policy learning comes up naturally with the
general-value-functions. Imagine that you want to learn a predictive
representation of the environment. In other words, you are not
interested in a classic grammar of concepts but rather you want to be
able to predict the response of the environment to your actions. For
instance, you may want to predict that when moving north, the
accelerometer-reading may rise very fast (note that you really do not
care about whether there is a change in the slope of the terrain named
ramp, but rather that your reading will shoot).
In this case, off-policy learning becomes very relevant. Note that
although you behave with one policy, you may want to predict what would
have happened if you have followed another policy, or one thousand
different policies. This way, you could learn a lot from the environment
with just a single stream of data! Even if you usually move west, you
could predict something like "what will be the accelerometer reading if
I move north, east or south?"

Hope this helps!
Best,
Sergio

> --
> --
> You received this message because you are subscribed to the
> "Reinforcement Learning Mailing List" group.
> To post to this group, send email to rl-...@googlegroups.com
> To unsubscribe from this group, send email to
> rl-list-u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/rl-list?hl=en
> ---
> You received this message because you are subscribed to the Google
> Groups "Reinforcement Learning Mailing List" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to rl-list+u...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

--
You are the master of the moments of your life.
--Paramahansa Yogananda

Make your day awesome:
http://www.youtube.com/embed/-aEhSgyWZe0

Michael Kaisers

unread,

Sep 27, 2013, 7:54:46 AM9/27/13

to rl-...@googlegroups.com

Hi Law,

in RL, you need to explore to get knowledge about actions you haven't tried or are uncertain about. In single-agent settings, you may want to estimate not the value of the policy you are executing (that would be on-policy and include the explorative actions), but the value of (the best estimate of) the optimal policy, which excludes the explorative actions and is thus not what is being executed, hence off policy.

The point I want to make is that in multi-agent settings, e.g., when learning to coordinate on a collaborative task or compete in a market, off policy learning has a diminished role to play. In order to learn about each other, each agent needs to execute the estimate of its best policy, and learn the quality of that current policy. Otherwise, if there was a discrete switch from some explorative behavior to the estimate of the optimal behavior, then the environment for the other adaptive agent would change drastically, and so might his response, which in turn may change the optimal behavior. In brief, samples from multi-agent interactions are most informative about the executed policy.

Best regards, Michael

Sergio Valcarcel

unread,

Sep 28, 2013, 12:26:58 PM9/28/13

to rl-...@googlegroups.com

Hi Michael,

Your point on multiagent systems sounds very interesting, but I do not quite understand.
I would say that depending on the game and the difference between target and behavior distributions, off-policy learning could help in a multiagent system the same as in the single-agent case. I mean, the performance bounds for off-policy learning are already difficult to obtain even for the single-agent case [1]!

What kind of algorithms do you have in mind (TD, DP, actor-critic, evolutionary...)?
Are you assuming complete or partial observability?
What kind of game (cooperative, competitive... potential...)?

My understanding is that the main challenge in a Markov game is the non-stationarity of the problem: other agents may behave very different to your same response.

[1] see "The Fixed Points of Off-Policy TD" by Zico Kolter, NIPS 2011, available at http://www.cs.cmu.edu/~zkolter/pubs/kolter-nips11.pdf.

Best!
Sergio

--

--
You received this message because you are subscribed to the "Reinforcement Learning Mailing List" group.
To post to this group, send email to rl-...@googlegroups.com
To unsubscribe from this group, send email to
rl-list-u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/rl-list?hl=en
---
You received this message because you are subscribed to the Google Groups "Reinforcement Learning Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rl-list+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Michael Kaisers

unread,

Oct 10, 2013, 10:28:20 AM10/10/13

to rl-...@googlegroups.com, ser...@gaps.ssr.upm.es

Dear Sergio,

You are right, performance bounds in single-agent learning can be tricky already. All I am saying is that they may be even harder to obtain in multi-agent learning. It is easy to show that the switch from epsilon-greedy exploration to a pure strategy (e.g., greedy exploitation) can change the best response from one action to another with an arbitrarily high impact on the payoff (e.g., checking bus tickets sometimes with high penalties may make buying a ticket the best response, while checking never results in free-riding being optimal). This means that especially if opponent actions are not observable and the game structure is not known a priori, the quality of policies that are not executed is hard to estimate, and the stability is questionable. Gradual decay of exploration and learning rates are used to limit the change that a single update can induce on the observed state transitions etc. of the opponent, leading to the non-stationarity of observable variables that you mentioned as a key challenge. That being said, you may very well find specific examples where off-policy learning does have its niche.

Best regards, Michael

Reply all

Reply to author

Forward