Why Q-Learning is considered an off-policy method?

Alejandro

unread,

Mar 29, 2011, 11:27:06 AM3/29/11

to rl-...@googlegroups.com

Hi:

Why Q-Learning is considered an off-policy method? (This is actually Exercise 6.9 of Sutton and Barto book's.)

My understanding is that an off-policy method uses two different policies, the behavior policy, which is fixed and used for exploration, and the estimation policy, that is evaluated and improved. In the Monte-Carlo off-policy method, for instance, it is very clear that these two policies are different entities. However, this distinction is not clear to me in the Q-learning case.

The attached PDF has the Sarsa algorithm (considered as on-policy) and the Q-learning, as described in the book. The main difference is in the estimation of Q(s,a) (lines 8 and 9 of the algorithms, respectively). However, both algorithms uses a policy derived from the action-value function. So it seems to me that Q-Learning is using a policy derived from Q(s,a), at the same time as it is updating Q(s,a). This doesn't seem to match the previous definition of off-policy method.

Any hints?

Alejandro.

sarsa_vs_Qlearning.pdf

Csaba Szepesvari

unread,

Mar 29, 2011, 11:35:40 AM3/29/11

to rl-...@googlegroups.com, Alejandro

This is the definition: If the algorithm estimates the value function of
the policy generating the data, the method is called on-policy.
Otherwise it is called off-policy.
- Csaba

> --
> You received this message because you are subscribed to the
> "Reinforcement Learning Mailing List" group.
> To post to this group, send email to rl-...@googlegroups.com
> To unsubscribe from this group, send email to
> rl-list-u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/rl-list?hl=en

Dongqing Shi

unread,

Mar 29, 2011, 11:41:09 AM3/29/11

to rl-...@googlegroups.com

so is the definition saying that the learned policy is used on-line for
decision-making?
if so, it seems similar to learn the model and plan with it.

dongqing

Brian Tanner

unread,

Mar 29, 2011, 11:46:50 AM3/29/11

to rl-...@googlegroups.com

I think this difference is clearly illustrated by the Cliff example from the book. I found that example (once I understood it) to be quite enlightening.

Basically, Sarsa learns about a policy that sometimes takes optimal actions (as estimated) and sometimes explores other actions, while Q-learning learns about the policy that doesn't explore and only takes optimal (as estimated) actions. Sarsa will learn to be careful in an environment where exploration is costly, Q-learning will not.

In the cliff word, if you walk near the cliff it's faster, but if you choose a wrong action you will deterministically fall off the cliff. Alternatively if you choose the path away from the cliff it takes longer, but a wrong action will not hurt you as much. Sarsa learns to take the slow path because it accounts for the fact that it takes exploratory actions and their results are be costly. Q-learning takes the cliff path because it doesn't care about the cost incurred by exploratory actions.

Finally, consider the extreme case: these two algorithms are still well-defined if you were to take a completely random action on each step (if epsilon greedy exploration is used, set epsilon to 1). In this case, Sarsa is literally learning the value of the random policy while acting randomly, and Q-learning is learning the value of the optimal policy, but is *acting* randomly.

--
Brian Tanner
PhD Student
University of Alberta
br...@tannerpages.com

Amit B

unread,

Mar 29, 2011, 11:47:30 AM3/29/11

to Reinforcement Learning Mailing List

Hello

The difference is in the Q(s,a) update rule of the algorithms:

Q-learning updates are done regardless to the actual action chosen for
next state, it just assumes that we are always choosing the argmax
one. So if we are using epsilon greedy policy and chose a random
action in any turn instead of the argmax action, Q-learning will still
update its estimation according to the argmax action. SARSA, on the
other hand, will consider this random action to be next one and won't
update Q(s,a) the same as Q-learning in that turn.

To illustrate the difference of the resulting policies behaviors of
those algorithms, refer to example (6.6) in:

http://www.cse.iitm.ac.in/~cs670/book/node65.html

Hope this makes it more clear,
Amit.

> sarsa_vs_Qlearning.pdf
> 100KViewDownload

"José Antonio Martín H."

unread,

Mar 29, 2011, 11:49:17 AM3/29/11

to rl-...@googlegroups.com

What about if the update of Q involves the weighted average between the
Max(Q) and Q(s'), is it a half online half offline?

El 29/03/2011 17:41, Dongqing Shi escribi�:

--
======================================================================
Jos� Antonio Mart�n H. PhD.
Departamento de Sistemas Inform�ticos y Computaci�n
Facultad de Inform�tica
Universidad Complutense de Madrid
C/ Profesor Jos� Garc�a Santesmases s/n. 28040 Madrid
http://www.dia.fi.upm.es/~jamartin/
mailto://jama...@fdi.ucm.es
"El orden es el recurso no renovable m�s importante"
"Order is the truly nonrenewable resource."
======================================================================

Dongqing Shi

unread,

Mar 29, 2011, 11:55:03 AM3/29/11

to rl-...@googlegroups.com

On 3/29/2011 11:46 AM, Brian Tanner wrote:
> I think this difference is clearly illustrated by the Cliff example from the book. I found that example (once I understood it) to be quite enlightening.
>
> Basically, Sarsa learns about a policy that sometimes takes optimal actions (as estimated) and sometimes explores other actions, while Q-learning learns about the policy that doesn't explore and only takes optimal (as estimated) actions. Sarsa will learn to be careful in an environment where exploration is costly, Q-learning will not.

So that is the definition of on-policy, or you just try to differentiate
these two algs?

Hamid Reza Maei

unread,

Mar 29, 2011, 11:58:38 AM3/29/11

to rl-...@googlegroups.com

Hi,

In RL, if your learning is not on-policy then it is called off-policy learning; learning about a policy that is different than your behavior policy. So, if the samples used in the TD update is not generated according to your behavior policy (policy that the agent is following) then it is called off-policy learning--you can also say learning from off-policy data. So in your example:

> What about if the update of Q involves the weighted average between the Max(Q) and Q(s'), is it a half online half offline?

yes, it is off-policy learning (I didn't understand what you meant by offline though).

Cheers,
Hamid

On 2011-03-29, at 9:49 AM, José Antonio Martín H. wrote:

> What about if the update of Q involves the weighted average between the Max(Q) and Q(s'), is it a half online half offline?
>

> José Antonio Martín H. PhD.
> Departamento de Sistemas Informáticos y Computación
> Facultad de Informática
> Universidad Complutense de Madrid
> C/ Profesor José García Santesmases s/n. 28040 Madrid
> http://www.dia.fi.upm.es/~jamartin/
> mailto://jama...@fdi.ucm.es
> "El orden es el recurso no renovable más importante"

> "Order is the truly nonrenewable resource."
> ======================================================================
>

Csaba Szepesvari

unread,

Mar 29, 2011, 12:03:45 PM3/29/11

to rl-...@googlegroups.com, Dongqing Shi

On 11-03-29 9:41 AM, Dongqing Shi wrote:
> so is the definition saying that the learned policy is used on-line for
> decision-making?
> if so, it seems similar to learn the model and plan with it.

Nope.

Cs.

"José Antonio Martín H."

unread,

Mar 29, 2011, 12:11:01 PM3/29/11

to rl-...@googlegroups.com

* half on-policy half off-policy
my mistake!.

El 29/03/2011 17:58, Hamid Reza Maei escribi�:

> Hi,
>
> In RL, if your learning is not on-policy then it is called off-policy learning; learning about a policy that is different than your behavior policy. So, if the samples used in the TD update is not generated according to your behavior policy (policy that the agent is following) then it is called off-policy learning--you can also say learning from off-policy data. So in your example:
>
>> What about if the update of Q involves the weighted average between the Max(Q) and Q(s'), is it a half online half offline?
>
>
> yes, it is off-policy learning (I didn't understand what you meant by offline though).
>
> Cheers,
> Hamid

> On 2011-03-29, at 9:49 AM, Jos� Antonio Mart�n H. wrote:
>
>> What about if the update of Q involves the weighted average between the Max(Q) and Q(s'), is it a half online half offline?
>>

>> El 29/03/2011 17:41, Dongqing Shi escribi�:

>> Jos� Antonio Mart�n H. PhD.
>> Departamento de Sistemas Inform�ticos y Computaci�n
>> Facultad de Inform�tica

>> Universidad Complutense de Madrid
>> C/ Profesor Jos� Garc�a Santesmases s/n. 28040 Madrid
>> http://www.dia.fi.upm.es/~jamartin/
>> mailto://jama...@fdi.ucm.es
>> "El orden es el recurso no renovable m�s importante"

>> "Order is the truly nonrenewable resource."
>> ======================================================================
>>
>> --
>> You received this message because you are subscribed to the "Reinforcement Learning Mailing List" group.
>> To post to this group, send email to rl-...@googlegroups.com
>> To unsubscribe from this group, send email to
>> rl-list-u...@googlegroups.com
>> For more options, visit this group at
>> http://groups.google.com/group/rl-list?hl=en
>>
>

--
======================================================================
Jos� Antonio Mart�n H. PhD. E-Mail: jama...@fdi.ucm.es
Computer Science Faculty Phone: (+34) 91 3947650
Complutense University of Madrid Fax: (+34) 91 3947527
C/ Prof. Jos� Garc�a Santesmases,s/n http://www.dacya.ucm.es/jam/
28040 Madrid, Spain
El orden es el recurso no renovable Order is the truly nonrenewable
m�s importante resource.
======================================================================

Dongqing Shi

unread,

Mar 29, 2011, 12:12:04 PM3/29/11

to rl-...@googlegroups.com

Thanks, that makes a lot sense to me.

dongqing

On 3/29/2011 11:58 AM, Hamid Reza Maei wrote:
> Hi,
>
> In RL, if your learning is not on-policy then it is called off-policy learning; learning about a policy that is different than your behavior policy. So, if the samples used in the TD update is not generated according to your behavior policy (policy that the agent is following) then it is called off-policy learning--you can also say learning from off-policy data. So in your example:
>
>> What about if the update of Q involves the weighted average between the Max(Q) and Q(s'), is it a half online half offline?
>
> yes, it is off-policy learning (I didn't understand what you meant by offline though).
>
> Cheers,
> Hamid

> On 2011-03-29, at 9:49 AM, Jos� Antonio Mart�n H. wrote:
>
>> What about if the update of Q involves the weighted average between the Max(Q) and Q(s'), is it a half online half offline?
>>

>> El 29/03/2011 17:41, Dongqing Shi escribi�:

>> Jos� Antonio Mart�n H. PhD.
>> Departamento de Sistemas Inform�ticos y Computaci�n
>> Facultad de Inform�tica

>> Universidad Complutense de Madrid
>> C/ Profesor Jos� Garc�a Santesmases s/n. 28040 Madrid
>> http://www.dia.fi.upm.es/~jamartin/
>> mailto://jama...@fdi.ucm.es
>> "El orden es el recurso no renovable m�s importante"

Alejandro

unread,

Mar 29, 2011, 12:16:07 PM3/29/11

to rl-...@googlegroups.com, Alejandro, Csaba Szepesvari

On Tuesday, March 29, 2011 9:35:40 AM UTC-6, Csaba Szepesvari wrote:

This is the definition: If the algorithm estimates the value function of
the policy generating the data, the method is called on-policy.
Otherwise it is called off-policy.

I don't fully understand this definition. In your book, section 3.1.1, page 21, you give the following definition:

"When learning about one policy, while following another is called off-policy learning"

In Q-learning, we are following the policy derived from Q. But which policy are we learning about?

Alejandro.

Dongqing Shi

unread,

Mar 29, 2011, 12:18:30 PM3/29/11

to rl-...@googlegroups.com

so it makes clear to me as belows.

Q-learning learned an optimal policy, but its behavior policy is not
exactly same with optimal policy since the behavior policy will take
actions with smaller Q values for exploration.
however, in Sarsa, these two behaviors are exactly matched. Sarsa learns
an non-optimal policy although most of time it takes optimal actions.
The cliff example shows why such a non-optimal policy could be very
useful: to avoid falling off the cliff if you made mistake.

opinions?

dongqing

On 3/29/2011 11:58 AM, Hamid Reza Maei wrote:

> Hi,
>
> In RL, if your learning is not on-policy then it is called off-policy learning; learning about a policy that is different than your behavior policy. So, if the samples used in the TD update is not generated according to your behavior policy (policy that the agent is following) then it is called off-policy learning--you can also say learning from off-policy data. So in your example:
>
>> What about if the update of Q involves the weighted average between the Max(Q) and Q(s'), is it a half online half offline?
>
> yes, it is off-policy learning (I didn't understand what you meant by offline though).
>
> Cheers,
> Hamid

> On 2011-03-29, at 9:49 AM, Jos� Antonio Mart�n H. wrote:
>
>> What about if the update of Q involves the weighted average between the Max(Q) and Q(s'), is it a half online half offline?
>>

>> El 29/03/2011 17:41, Dongqing Shi escribi�:

>> Jos� Antonio Mart�n H. PhD.
>> Departamento de Sistemas Inform�ticos y Computaci�n
>> Facultad de Inform�tica

>> Universidad Complutense de Madrid
>> C/ Profesor Jos� Garc�a Santesmases s/n. 28040 Madrid
>> http://www.dia.fi.upm.es/~jamartin/
>> mailto://jama...@fdi.ucm.es
>> "El orden es el recurso no renovable m�s importante"

Alejandro

unread,

Mar 29, 2011, 12:28:35 PM3/29/11

to rl-...@googlegroups.com, Brian Tanner

On Tuesday, March 29, 2011 9:46:50 AM UTC-6, Brian Tanner wrote:

Basically, Sarsa learns about a policy that sometimes takes optimal actions (as estimated) and sometimes explores other actions, while Q-learning learns about the policy that doesn't explore and only takes optimal (as estimated) actions. Sarsa will learn to be careful in an environment where exploration is costly, Q-learning will not.

Thanks. I can see the difference between Sarsa and Q-Learning. My problem is that to me it looks like Q-learning is still using _the same_ Q(s,a) that it is updating to find the actions to be taken. I might be misunderstanding the definition of off-policy.

Alejandro.

Dongqing Shi

unread,

Mar 29, 2011, 12:31:11 PM3/29/11

to rl-...@googlegroups.com

right, however, the exploration encourages the learner to take an action that may not have max Q. Such a policy is the behavioral policy which is not the optimal policy derived from Q.

Csaba Szepesvari

unread,

Mar 29, 2011, 12:31:36 PM3/29/11

to rl-...@googlegroups.com, Alejandro, Csaba Szepesvari

In short, in Q-learning the goal is to learn the optimal action-value
function, irrespective of the sampling (aka behavior) policy. So you
learn about the action-value function of the optimal policy. Even when
the behavior policy is not the optimal policy.

A small correction to what you said above:
Notice that you can use Q-learning while *not* following a policy
derived from the learned action-values. For example, you can sample from
a fixed policy.
Important general observation: there are the value-function estimation
methods, which can be used, but are not necessarily used to change the
policy that is used to generate the samples. Separate these two things
to avoid confusion.
Q-learning is not a method to change the policy currently followed. It
is a method which can be used for this purpose, but that's all.

- Csaba

Csaba Szepesvari

unread,

Mar 29, 2011, 12:20:15 PM3/29/11

to rl-...@googlegroups.com, "José Antonio Martín H."

Hi,

On 11-03-29 9:49 AM, "Jos� Antonio Mart�n H." wrote:
> What about if the update of Q involves the weighted average between the
> Max(Q) and Q(s'), is it a half online half offline?
>

This method probably converges to halfway in between the two value
functions, the optimal one, and the value function underlying the
behavior policy (if that is your intent here). Then, it is off-policy.

Note that one potential confusion is that I am talking about value
function estimation algorithms which can be used with any fixed behavior
policy.

These algorithms are building blocks of more complete algorithms which
change the policy on-the-fly in a feedback loop.

What about SARSA? SARSA is usually presented as a closed-loop learning
algorithm, which also changes the sampling policy. Mistake. You
can/should separate the value-estimation part from the part which
reasons about what actions to take. Then you can talk about the SARSA
value-estimation algorithm.

Bests,
- Csaba

Alejandro

unread,

Mar 29, 2011, 1:00:43 PM3/29/11

to rl-...@googlegroups.com, Alejandro, Csaba Szepesvari

On Tuesday, March 29, 2011 10:31:36 AM UTC-6, Csaba Szepesvari wrote:

In short, in Q-learning the goal is to learn the optimal action-value
function, irrespective of the sampling (aka behavior) policy. So you
learn about the action-value function of the optimal policy. Even when
the behavior policy is not the optimal policy.
A small correction to what you said above:
Notice that you can use Q-learning while *not* following a policy
derived from the learned action-values. For example, you can sample from
a fixed policy.
Important general observation: there are the value-function estimation
methods, which can be used, but are not necessarily used to change the
policy that is used to generate the samples. Separate these two things
to avoid confusion.
Q-learning is not a method to change the policy currently followed. It
is a method which can be used for this purpose, but that's all.

Thanks! I think this makes things much clearer. In summary, and to check that I understood correctly:

Q-learning _always_ estimate the optimal value-action function, and we can say that the estimation policy is the one induced by this value-action function. The algorithm let you choose _any_ behavior policy. It can be the one derived by the epsilon-greedy policy derived from Q(s,a), but it can be something else.

Alejandro.

"José Antonio Martín H."

unread,

Mar 29, 2011, 1:05:15 PM3/29/11

to rl-...@googlegroups.com

Dear Csaba,
So what about a method that at each iteration select randomly the kind
of update based on max(Q(s,.)) or Q(s',a') ?

Can you say "the method" is off-policy? perhaps on-policy?

In which part resides the distinction, in the method that generates the
samples? NO.

In the part that estimates the action-value function? Yes, but this time
the type of the method is not deterministic.

So, suppose the method runs 100 episodes, if 99 episodes performs
q-learning but just 1 episode runs sarsa, is it off-policy?

Or, is the distinction defined for just one episode?
But the episode is also non deterministic, so?

Bests.
Jose.

El 29/03/2011 18:20, Csaba Szepesvari escribi�:

--

Hado van Hasselt

unread,

Mar 29, 2011, 1:35:12 PM3/29/11

to rl-...@googlegroups.com

> The algorithm let you choose _any_ behavior policy. It can be the
> one derived by the epsilon-greedy policy derived from Q(s,a), but it can be
> something else.

Almost. To estimate the optimal action-value function, the behavior
policy needs to explore sufficiently. Then, indeed, Q-learning
estimates the optimal action-value function. Clearly, if some actions
are not selected enough (e.g., never) no sampling-based algorithm can
learn about the optimal value of these actions.

This observation may seem trivial, but it implies that certain
exploration methods should be used with care, as they may stimulate
too little exploration to learn about the optimal policy. If I
remember correctly, this is also mentioned in the book by Sutton and
Barto when they talk about optimistic value initialization.

Best,
Hado

Richard Cubek

unread,

Mar 29, 2011, 1:58:06 PM3/29/11

to rl-...@googlegroups.com

1) Think of the policy just as the question of: "what action (a) to take
in a state (s)?"
2) Think of the Q-function just as the question of: "how good is it to
take action (a) in a state (s)?"

Don't mix them (to answer 1, question 2 may be taken into account, but
needn't necessarily, you can also dice).

Now compare the update formulas, which form the outcome of the answer to
the *second* question:

SARSA:
Q(s,a) <- Q(s,a) + alpha*[r + gamma*Q(s',a') - Q(s,a)]

Q-Learning:
Q(s,a) <- Q(s,a) + alpha*[r + gamma* max'_a Q(s',a') - Q(s,a)]

In SARSA, the answer to the *first* question is taken into account,
therefore it manipulates the answer to the *second* question.

In Q-Learning, nothing, that has to do with the *first* question is
taken into account, thus, the *first* question here has no influence to
the outcome of the correct answer to the *second* question. You can dice
for each action to take, the correct answer of 2 stays the same.

Now someone could argue: "but the first question does affect the second
formula! it decides, which states I reach and thus which updates are
done!" :-) Yes, that's right. That is, why the *first* question affects
the *time* that you need to form the *correct* answer to the *second*
question, but again, it does NOT affect the *correct* answer.

(Btw, with *correct* answer to 2, I mean Q*)

Best
Richard

--
Richard Cubek, Dipl.-Ing.(FH)
University of Applied Sciences Ravensburg-Weingarten
Intelligent Mobile Robotics Laboratory
Phone: (0049) (0)751 501 9838
Mobile: (0049) (0)163 88 39 529

Rafael Pinto

unread,

Mar 29, 2011, 6:21:58 PM3/29/11

to rl-...@googlegroups.com

This algorithm exists and is called Q-Sarsa. See it here at page 62: http://leenissen.dk/rl/Steffen_Nissen_Thesis2007_Print.pdf

Rafael C.P.

2011/3/29 "José Antonio Martín H." <jama...@fdi.ucm.es>

What about if the update of Q involves the weighted average between the Max(Q) and Q(s'), is it a half online half offline?

José Antonio Martín H. PhD.
Departamento de Sistemas Informáticos y Computación
Facultad de Informática

Universidad Complutense de Madrid
C/ Profesor José García Santesmases s/n. 28040 Madrid
http://www.dia.fi.upm.es/~jamartin/
mailto://jama...@fdi.ucm.es
"El orden es el recurso no renovable más importante"

"Order is the truly nonrenewable resource."
======================================================================

Csaba Szepesvari

unread,

Mar 29, 2011, 8:34:41 PM3/29/11

to rl-...@googlegroups.com, "José Antonio Martín H."

Hi Jose,

On 11-03-29 11:05 AM, "Jos� Antonio Mart�n H." wrote:
> Dear Csaba,
> So what about a method that at each iteration select randomly the kind
> of update based on max(Q(s,.)) or Q(s',a') ?
>
> Can you say "the method" is off-policy? perhaps on-policy?

Well, it depends. If the mixing probability is fixed (1/2, 1/2), the
effect is the same as actually mixing the two targets with these fixed
probabilities. Then, as long as the mixing probability associated with
max Q(..) is positive, the method will be off-policy.

You can also let the mixing probabilities change over time, decay etc.

In fact, the definition says that everything that does not learn the
value function of the policy generating the data is off-policy. So
on-policy is a small class, off-policy is a big.

So, if at step 1 you throw a dice and go with Q-learning if the result
is heads, and go with SARSA otherwise, you have an off-policy method.
However, the usefulness of a method like this is less than clear. In
fact, the usefulness of the off-policy/on-policy distinction in cases
like this becomes less than clear, too.

>
> In which part resides the distinction, in the method that generates the
> samples? NO.
>
> In the part that estimates the action-value function? Yes, but this time
> the type of the method is not deterministic.

Correct.
Well, this should not matter.

>
> So, suppose the method runs 100 episodes, if 99 episodes performs
> q-learning but just 1 episode runs sarsa, is it off-policy?
>
> Or, is the distinction defined for just one episode?
> But the episode is also non deterministic, so?

We are looking at what the method converges to in the face of infinitely
many samples (and some conditions apply).

- Csaba

Csaba Szepesvari

unread,

Mar 29, 2011, 8:37:22 PM3/29/11

to rl-...@googlegroups.com, Alejandro, Csaba Szepesvari

Hi Alejandro,

On 11-03-29 11:00 AM, Alejandro wrote:
> On Tuesday, March 29, 2011 10:31:36 AM UTC-6, Csaba Szepesvari wrote:
>
> In short, in Q-learning the goal is to learn the optimal action-value
>
> function, irrespective of the sampling (aka behavior) policy. So you
> learn about the action-value function of the optimal policy. Even when
> the behavior policy is not the optimal policy.
>
> A small correction to what you said above:
> Notice that you can use Q-learning while *not* following a policy
> derived from the learned action-values. For example, you can sample
> from
> a fixed policy.
> Important general observation: there are the value-function estimation
> methods, which can be used, but are not necessarily used to change the
> policy that is used to generate the samples. Separate these two things
> to avoid confusion.
> Q-learning is not a method to change the policy currently followed. It
> is a method which can be used for this purpose, but that's all.
>
>
> Thanks! I think this makes things much clearer. In summary, and to check
> that I understood correctly:
>
> Q-learning _always_ estimate the optimal value-action function,

The goal of Q-learning is to estimate the optimal action-value function,
correct.

> and we
> can say that the estimation policy is the one induced by this
> value-action function. The algorithm let you choose _any_ behavior
> policy. It can be the one derived by the epsilon-greedy policy derived
> from Q(s,a), but it can be something else.

Yep.
The only thing is that in order to be able estimate the values for all
state-action pairs, the behavior policy should better explore all
state-action pairs indefinitely.

- Csaba

Csaba Szepesvari

unread,

Mar 29, 2011, 8:38:21 PM3/29/11

to rl-...@googlegroups.com, Hado van Hasselt

On 11-03-29 11:35 AM, Hado van Hasselt wrote:
>> The algorithm let you choose _any_ behavior policy. It can be the
>> one derived by the epsilon-greedy policy derived from Q(s,a), but it can be
>> something else.
>
> Almost. To estimate the optimal action-value function, the behavior
> policy needs to explore sufficiently. Then, indeed, Q-learning
> estimates the optimal action-value function. Clearly, if some actions
> are not selected enough (e.g., never) no sampling-based algorithm can
> learn about the optimal value of these actions.

Hado, you are absolutely right again:)
- Csaba

Reply all

Reply to author

Forward