> --
> You received this message because you are subscribed to the
> "Reinforcement Learning Mailing List" group.
> To post to this group, send email to rl-...@googlegroups.com
> To unsubscribe from this group, send email to
> rl-list-u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/rl-list?hl=en
dongqing
Basically, Sarsa learns about a policy that sometimes takes optimal actions (as estimated) and sometimes explores other actions, while Q-learning learns about the policy that doesn't explore and only takes optimal (as estimated) actions. Sarsa will learn to be careful in an environment where exploration is costly, Q-learning will not.
In the cliff word, if you walk near the cliff it's faster, but if you choose a wrong action you will deterministically fall off the cliff. Alternatively if you choose the path away from the cliff it takes longer, but a wrong action will not hurt you as much. Sarsa learns to take the slow path because it accounts for the fact that it takes exploratory actions and their results are be costly. Q-learning takes the cliff path because it doesn't care about the cost incurred by exploratory actions.
Finally, consider the extreme case: these two algorithms are still well-defined if you were to take a completely random action on each step (if epsilon greedy exploration is used, set epsilon to 1). In this case, Sarsa is literally learning the value of the random policy while acting randomly, and Q-learning is learning the value of the optimal policy, but is *acting* randomly.
--
Brian Tanner
PhD Student
University of Alberta
br...@tannerpages.com
El 29/03/2011 17:41, Dongqing Shi escribi�:
--
======================================================================
Jos� Antonio Mart�n H. PhD.
Departamento de Sistemas Inform�ticos y Computaci�n
Facultad de Inform�tica
Universidad Complutense de Madrid
C/ Profesor Jos� Garc�a Santesmases s/n. 28040 Madrid
http://www.dia.fi.upm.es/~jamartin/
mailto://jama...@fdi.ucm.es
"El orden es el recurso no renovable m�s importante"
"Order is the truly nonrenewable resource."
======================================================================
So that is the definition of on-policy, or you just try to differentiate
these two algs?
In RL, if your learning is not on-policy then it is called off-policy learning; learning about a policy that is different than your behavior policy. So, if the samples used in the TD update is not generated according to your behavior policy (policy that the agent is following) then it is called off-policy learning--you can also say learning from off-policy data. So in your example:
> What about if the update of Q involves the weighted average between the Max(Q) and Q(s'), is it a half online half offline?
yes, it is off-policy learning (I didn't understand what you meant by offline though).
Cheers,
Hamid
On 2011-03-29, at 9:49 AM, José Antonio Martín H. wrote:
> What about if the update of Q involves the weighted average between the Max(Q) and Q(s'), is it a half online half offline?
>
> José Antonio Martín H. PhD.
> Departamento de Sistemas Informáticos y Computación
> Facultad de Informática
> Universidad Complutense de Madrid
> C/ Profesor José García Santesmases s/n. 28040 Madrid
> http://www.dia.fi.upm.es/~jamartin/
> mailto://jama...@fdi.ucm.es
> "El orden es el recurso no renovable más importante"
> "Order is the truly nonrenewable resource."
> ======================================================================
>
On 11-03-29 9:41 AM, Dongqing Shi wrote:
> so is the definition saying that the learned policy is used on-line for
> decision-making?
> if so, it seems similar to learn the model and plan with it.
Nope.
Cs.
El 29/03/2011 17:58, Hamid Reza Maei escribi�:
> Hi,
>
> In RL, if your learning is not on-policy then it is called off-policy learning; learning about a policy that is different than your behavior policy. So, if the samples used in the TD update is not generated according to your behavior policy (policy that the agent is following) then it is called off-policy learning--you can also say learning from off-policy data. So in your example:
>
>> What about if the update of Q involves the weighted average between the Max(Q) and Q(s'), is it a half online half offline?
>
>
> yes, it is off-policy learning (I didn't understand what you meant by offline though).
>
> Cheers,
> Hamid
> On 2011-03-29, at 9:49 AM, Jos� Antonio Mart�n H. wrote:
>
>> What about if the update of Q involves the weighted average between the Max(Q) and Q(s'), is it a half online half offline?
>>
>> El 29/03/2011 17:41, Dongqing Shi escribi�:
>> Jos� Antonio Mart�n H. PhD.
>> Departamento de Sistemas Inform�ticos y Computaci�n
>> Facultad de Inform�tica
>> Universidad Complutense de Madrid
>> C/ Profesor Jos� Garc�a Santesmases s/n. 28040 Madrid
>> http://www.dia.fi.upm.es/~jamartin/
>> mailto://jama...@fdi.ucm.es
>> "El orden es el recurso no renovable m�s importante"
>> "Order is the truly nonrenewable resource."
>> ======================================================================
>>
>> --
>> You received this message because you are subscribed to the "Reinforcement Learning Mailing List" group.
>> To post to this group, send email to rl-...@googlegroups.com
>> To unsubscribe from this group, send email to
>> rl-list-u...@googlegroups.com
>> For more options, visit this group at
>> http://groups.google.com/group/rl-list?hl=en
>>
>
--
======================================================================
Jos� Antonio Mart�n H. PhD. E-Mail: jama...@fdi.ucm.es
Computer Science Faculty Phone: (+34) 91 3947650
Complutense University of Madrid Fax: (+34) 91 3947527
C/ Prof. Jos� Garc�a Santesmases,s/n http://www.dacya.ucm.es/jam/
28040 Madrid, Spain
El orden es el recurso no renovable Order is the truly nonrenewable
m�s importante resource.
======================================================================
dongqing
On 3/29/2011 11:58 AM, Hamid Reza Maei wrote:
> Hi,
>
> In RL, if your learning is not on-policy then it is called off-policy learning; learning about a policy that is different than your behavior policy. So, if the samples used in the TD update is not generated according to your behavior policy (policy that the agent is following) then it is called off-policy learning--you can also say learning from off-policy data. So in your example:
>
>> What about if the update of Q involves the weighted average between the Max(Q) and Q(s'), is it a half online half offline?
>
> yes, it is off-policy learning (I didn't understand what you meant by offline though).
>
> Cheers,
> Hamid
> On 2011-03-29, at 9:49 AM, Jos� Antonio Mart�n H. wrote:
>
>> What about if the update of Q involves the weighted average between the Max(Q) and Q(s'), is it a half online half offline?
>>
>> El 29/03/2011 17:41, Dongqing Shi escribi�:
>> Jos� Antonio Mart�n H. PhD.
>> Departamento de Sistemas Inform�ticos y Computaci�n
>> Facultad de Inform�tica
>> Universidad Complutense de Madrid
>> C/ Profesor Jos� Garc�a Santesmases s/n. 28040 Madrid
>> http://www.dia.fi.upm.es/~jamartin/
>> mailto://jama...@fdi.ucm.es
>> "El orden es el recurso no renovable m�s importante"
This is the definition: If the algorithm estimates the value function of
the policy generating the data, the method is called on-policy.
Otherwise it is called off-policy.
Q-learning learned an optimal policy, but its behavior policy is not
exactly same with optimal policy since the behavior policy will take
actions with smaller Q values for exploration.
however, in Sarsa, these two behaviors are exactly matched. Sarsa learns
an non-optimal policy although most of time it takes optimal actions.
The cliff example shows why such a non-optimal policy could be very
useful: to avoid falling off the cliff if you made mistake.
opinions?
dongqing
On 3/29/2011 11:58 AM, Hamid Reza Maei wrote:
> Hi,
>
> In RL, if your learning is not on-policy then it is called off-policy learning; learning about a policy that is different than your behavior policy. So, if the samples used in the TD update is not generated according to your behavior policy (policy that the agent is following) then it is called off-policy learning--you can also say learning from off-policy data. So in your example:
>
>> What about if the update of Q involves the weighted average between the Max(Q) and Q(s'), is it a half online half offline?
>
> yes, it is off-policy learning (I didn't understand what you meant by offline though).
>
> Cheers,
> Hamid
> On 2011-03-29, at 9:49 AM, Jos� Antonio Mart�n H. wrote:
>
>> What about if the update of Q involves the weighted average between the Max(Q) and Q(s'), is it a half online half offline?
>>
>> El 29/03/2011 17:41, Dongqing Shi escribi�:
>> Jos� Antonio Mart�n H. PhD.
>> Departamento de Sistemas Inform�ticos y Computaci�n
>> Facultad de Inform�tica
>> Universidad Complutense de Madrid
>> C/ Profesor Jos� Garc�a Santesmases s/n. 28040 Madrid
>> http://www.dia.fi.upm.es/~jamartin/
>> mailto://jama...@fdi.ucm.es
>> "El orden es el recurso no renovable m�s importante"
Basically, Sarsa learns about a policy that sometimes takes optimal actions (as estimated) and sometimes explores other actions, while Q-learning learns about the policy that doesn't explore and only takes optimal (as estimated) actions. Sarsa will learn to be careful in an environment where exploration is costly, Q-learning will not.
In short, in Q-learning the goal is to learn the optimal action-value
function, irrespective of the sampling (aka behavior) policy. So you
learn about the action-value function of the optimal policy. Even when
the behavior policy is not the optimal policy.
A small correction to what you said above:
Notice that you can use Q-learning while *not* following a policy
derived from the learned action-values. For example, you can sample from
a fixed policy.
Important general observation: there are the value-function estimation
methods, which can be used, but are not necessarily used to change the
policy that is used to generate the samples. Separate these two things
to avoid confusion.
Q-learning is not a method to change the policy currently followed. It
is a method which can be used for this purpose, but that's all.
- Csaba
On 11-03-29 9:49 AM, "Jos� Antonio Mart�n H." wrote:
> What about if the update of Q involves the weighted average between the
> Max(Q) and Q(s'), is it a half online half offline?
>
This method probably converges to halfway in between the two value
functions, the optimal one, and the value function underlying the
behavior policy (if that is your intent here). Then, it is off-policy.
Note that one potential confusion is that I am talking about value
function estimation algorithms which can be used with any fixed behavior
policy.
These algorithms are building blocks of more complete algorithms which
change the policy on-the-fly in a feedback loop.
What about SARSA? SARSA is usually presented as a closed-loop learning
algorithm, which also changes the sampling policy. Mistake. You
can/should separate the value-estimation part from the part which
reasons about what actions to take. Then you can talk about the SARSA
value-estimation algorithm.
Bests,
- Csaba
In short, in Q-learning the goal is to learn the optimal action-valuefunction, irrespective of the sampling (aka behavior) policy. So you
learn about the action-value function of the optimal policy. Even when
the behavior policy is not the optimal policy.A small correction to what you said above:
Notice that you can use Q-learning while *not* following a policy
derived from the learned action-values. For example, you can sample from
a fixed policy.
Important general observation: there are the value-function estimation
methods, which can be used, but are not necessarily used to change the
policy that is used to generate the samples. Separate these two things
to avoid confusion.
Q-learning is not a method to change the policy currently followed. It
is a method which can be used for this purpose, but that's all.
Can you say "the method" is off-policy? perhaps on-policy?
In which part resides the distinction, in the method that generates the
samples? NO.
In the part that estimates the action-value function? Yes, but this time
the type of the method is not deterministic.
So, suppose the method runs 100 episodes, if 99 episodes performs
q-learning but just 1 episode runs sarsa, is it off-policy?
Or, is the distinction defined for just one episode?
But the episode is also non deterministic, so?
Bests.
Jose.
El 29/03/2011 18:20, Csaba Szepesvari escribi�:
--
Almost. To estimate the optimal action-value function, the behavior
policy needs to explore sufficiently. Then, indeed, Q-learning
estimates the optimal action-value function. Clearly, if some actions
are not selected enough (e.g., never) no sampling-based algorithm can
learn about the optimal value of these actions.
This observation may seem trivial, but it implies that certain
exploration methods should be used with care, as they may stimulate
too little exploration to learn about the optimal policy. If I
remember correctly, this is also mentioned in the book by Sutton and
Barto when they talk about optimistic value initialization.
Best,
Hado
Don't mix them (to answer 1, question 2 may be taken into account, but
needn't necessarily, you can also dice).
Now compare the update formulas, which form the outcome of the answer to
the *second* question:
SARSA:
Q(s,a) <- Q(s,a) + alpha*[r + gamma*Q(s',a') - Q(s,a)]
Q-Learning:
Q(s,a) <- Q(s,a) + alpha*[r + gamma* max'_a Q(s',a') - Q(s,a)]
In SARSA, the answer to the *first* question is taken into account,
therefore it manipulates the answer to the *second* question.
In Q-Learning, nothing, that has to do with the *first* question is
taken into account, thus, the *first* question here has no influence to
the outcome of the correct answer to the *second* question. You can dice
for each action to take, the correct answer of 2 stays the same.
Now someone could argue: "but the first question does affect the second
formula! it decides, which states I reach and thus which updates are
done!" :-) Yes, that's right. That is, why the *first* question affects
the *time* that you need to form the *correct* answer to the *second*
question, but again, it does NOT affect the *correct* answer.
(Btw, with *correct* answer to 2, I mean Q*)
Best
Richard
--
Richard Cubek, Dipl.-Ing.(FH)
University of Applied Sciences Ravensburg-Weingarten
Intelligent Mobile Robotics Laboratory
Phone: (0049) (0)751 501 9838
Mobile: (0049) (0)163 88 39 529
What about if the update of Q involves the weighted average between the Max(Q) and Q(s'), is it a half online half offline?
José Antonio Martín H. PhD.
Departamento de Sistemas Informáticos y Computación
Facultad de Informática
Universidad Complutense de Madrid
C/ Profesor José García Santesmases s/n. 28040 Madrid
http://www.dia.fi.upm.es/~jamartin/
mailto://jama...@fdi.ucm.es
"El orden es el recurso no renovable más importante"
"Order is the truly nonrenewable resource."
======================================================================
On 11-03-29 11:05 AM, "Jos� Antonio Mart�n H." wrote:
> Dear Csaba,
> So what about a method that at each iteration select randomly the kind
> of update based on max(Q(s,.)) or Q(s',a') ?
>
> Can you say "the method" is off-policy? perhaps on-policy?
Well, it depends. If the mixing probability is fixed (1/2, 1/2), the
effect is the same as actually mixing the two targets with these fixed
probabilities. Then, as long as the mixing probability associated with
max Q(..) is positive, the method will be off-policy.
You can also let the mixing probabilities change over time, decay etc.
In fact, the definition says that everything that does not learn the
value function of the policy generating the data is off-policy. So
on-policy is a small class, off-policy is a big.
So, if at step 1 you throw a dice and go with Q-learning if the result
is heads, and go with SARSA otherwise, you have an off-policy method.
However, the usefulness of a method like this is less than clear. In
fact, the usefulness of the off-policy/on-policy distinction in cases
like this becomes less than clear, too.
>
> In which part resides the distinction, in the method that generates the
> samples? NO.
>
> In the part that estimates the action-value function? Yes, but this time
> the type of the method is not deterministic.
Correct.
Well, this should not matter.
>
> So, suppose the method runs 100 episodes, if 99 episodes performs
> q-learning but just 1 episode runs sarsa, is it off-policy?
>
> Or, is the distinction defined for just one episode?
> But the episode is also non deterministic, so?
We are looking at what the method converges to in the face of infinitely
many samples (and some conditions apply).
- Csaba
On 11-03-29 11:00 AM, Alejandro wrote:
> On Tuesday, March 29, 2011 10:31:36 AM UTC-6, Csaba Szepesvari wrote:
>
> In short, in Q-learning the goal is to learn the optimal action-value
>
> function, irrespective of the sampling (aka behavior) policy. So you
> learn about the action-value function of the optimal policy. Even when
> the behavior policy is not the optimal policy.
>
> A small correction to what you said above:
> Notice that you can use Q-learning while *not* following a policy
> derived from the learned action-values. For example, you can sample
> from
> a fixed policy.
> Important general observation: there are the value-function estimation
> methods, which can be used, but are not necessarily used to change the
> policy that is used to generate the samples. Separate these two things
> to avoid confusion.
> Q-learning is not a method to change the policy currently followed. It
> is a method which can be used for this purpose, but that's all.
>
>
> Thanks! I think this makes things much clearer. In summary, and to check
> that I understood correctly:
>
> Q-learning _always_ estimate the optimal value-action function,
The goal of Q-learning is to estimate the optimal action-value function,
correct.
> and we
> can say that the estimation policy is the one induced by this
> value-action function. The algorithm let you choose _any_ behavior
> policy. It can be the one derived by the epsilon-greedy policy derived
> from Q(s,a), but it can be something else.
Yep.
The only thing is that in order to be able estimate the values for all
state-action pairs, the behavior policy should better explore all
state-action pairs indefinitely.
- Csaba
On 11-03-29 11:35 AM, Hado van Hasselt wrote:
>> The algorithm let you choose _any_ behavior policy. It can be the
>> one derived by the epsilon-greedy policy derived from Q(s,a), but it can be
>> something else.
>
> Almost. To estimate the optimal action-value function, the behavior
> policy needs to explore sufficiently. Then, indeed, Q-learning
> estimates the optimal action-value function. Clearly, if some actions
> are not selected enough (e.g., never) no sampling-based algorithm can
> learn about the optimal value of these actions.
Hado, you are absolutely right again:)
- Csaba