answers of non-programming questions from reinforcement learning book

joão silva

unread,

Jul 26, 2010, 11:48:01 PM7/26/10

to rl-...@googlegroups.com

Recently I have tried to answer the questions of the sutton and barto book.
Unfortunately, I have no way to know if my answer is correct.
I tried to get a examination/desk copy but is not avaliable for my country.
somebody have a eletronic copy to send me?
Anyway my doub is in exercise 3.8. "What is the Bellman equation for action values, that is, for Q^\pi(s,a)?"
My answer was

Q^\pi(s,a)=\pi(s,a) \sum_s' P^a_ss' [R^a_ss' + \gamma max_a' Q^\pi(s',a')].
Is that correct?

Thanks.

Hamid Reza Maei

unread,

Jul 27, 2010, 12:09:34 PM7/27/10

to rl-...@googlegroups.com

Hello,

Regarding your questions:

1) The html (electronic) version of the book is here:

http://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html

2) The Bellman equation for action values under policy pi is:

Q^{\pi}(s,a)=\sum_{s'} P^a_{ss'}[R^a_{ss'}+\gamma \sum_{a'} \pi(s',a') Q^{\pi}(s',a')]

now if you want to get Bellman optimality equation (your equation looks like you are thinking about it but that is not what

the exercise is asking), it is:

Q^{*}(s,a)=\sum_{s'} P^a_{ss'}[R^a_{ss'}+\gamma max_a' Q^{*}(s',a')].

The solution of the above Bellman optimality equation in fact satisfies in:

Q^{*}(s,a)=max_{\pi} Q^{\pi}(s,a), which refers to optimal action value for stat-action pair (s,a).

Best,

Hamid

--
You received this message because you are subscribed to the "Reinforcement Learning Mailing List" group.
To post to this group, send email to rl-...@googlegroups.com
To unsubscribe from this group, send email to
rl-list-u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/rl-list?hl=en

Alexander Hans

unread,

Jul 27, 2010, 12:18:53 PM7/27/10

to rl-...@googlegroups.com

Hi Jo�o,

I don't have the official solutions to the book, but I'm pretty confident
I know the answer to your question.

> Anyway my doub is in exercise 3.8. "What is the Bellman equation for
> action values, that is, for Q^\pi(s,a)?"
> My answer was
>
> Q^\pi(s,a)=\pi(s,a) \sum_s' P^a_ss' [R^a_ss' + \gamma max_a'
> Q^\pi(s',a')].
> Is that correct?

No, I think it's wrong. It seems like you're mixing up stochastic and
deterministic policies. When using stochastic policies, maximizing Q over
actions doesn't make much sense. Moreover, your equation wouldn't work in
the case of a deterministic policies formulated as a stochastic one (i.e.,
setting \pi(s,a) := 1 when action a is to be selected and \pi(s,a) := 0
otherwise). Then all Q-values for actions not selected by the policy would
be set to zero.

The correct answer to this question would be

Q^\pi(s,a) = \sum_s' P^a_ss' [R^a_ss' + \gamma \sum_a' \pi(s',a')
Q^\pi(s',a')

This can also be seen from the backup diagram as suggested in the
question. While for V^\pi one sums over actions and then for each action
over successor states, for Q^\pi one first sums over successor states and
then for each possible successor state over possible actions in those.

When using a deterministic policy, the sum over a' is not needed anymore
and the equation becomes

Q^\pi(s,a) = \sum_s' P^a_ss' [R^a_ss' + \gamma Q^\pi(s', \pi(s'))

HTH,

Alex

--
PGP Public Key: http://www.tu-ilmenau.de/~alha-in/ahans.asc
Fingerprint: E110 4CA3 288A 93F3 5237 E904 A85B 4B18 CFDC 63E3

Davi Carnaúba

unread,

Jul 27, 2010, 1:09:27 PM7/27/10

to rl-...@googlegroups.com

thank you for the answer.
I'm very greatfully.

Atenciosamente,

Davi Carnaúba

2010/7/27 joão silva <silva...@gmail.com>

joão silva

unread,

Jul 27, 2010, 1:28:45 PM7/27/10

to rl-...@googlegroups.com

I get it.
Thank you Michael, Hamid and Alexander.

2010/7/27 Michael Littman <michael...@gmail.com>

Nope. The pi is followed *after* a is taken in s.

Reply all

Reply to author

Forward