--
You received this message because you are subscribed to the "Reinforcement Learning Mailing List" group.
To post to this group, send email to rl-...@googlegroups.com
To unsubscribe from this group, send email to
rl-list-u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/rl-list?hl=en
I don't have the official solutions to the book, but I'm pretty confident
I know the answer to your question.
> Anyway my doub is in exercise 3.8. "What is the Bellman equation for
> action values, that is, for Q^\pi(s,a)?"
> My answer was
>
> Q^\pi(s,a)=\pi(s,a) \sum_s' P^a_ss' [R^a_ss' + \gamma max_a'
> Q^\pi(s',a')].
> Is that correct?
No, I think it's wrong. It seems like you're mixing up stochastic and
deterministic policies. When using stochastic policies, maximizing Q over
actions doesn't make much sense. Moreover, your equation wouldn't work in
the case of a deterministic policies formulated as a stochastic one (i.e.,
setting \pi(s,a) := 1 when action a is to be selected and \pi(s,a) := 0
otherwise). Then all Q-values for actions not selected by the policy would
be set to zero.
The correct answer to this question would be
Q^\pi(s,a) = \sum_s' P^a_ss' [R^a_ss' + \gamma \sum_a' \pi(s',a')
Q^\pi(s',a')
This can also be seen from the backup diagram as suggested in the
question. While for V^\pi one sums over actions and then for each action
over successor states, for Q^\pi one first sums over successor states and
then for each possible successor state over possible actions in those.
When using a deterministic policy, the sum over a' is not needed anymore
and the equation becomes
Q^\pi(s,a) = \sum_s' P^a_ss' [R^a_ss' + \gamma Q^\pi(s', \pi(s'))
HTH,
Alex
--
PGP Public Key: http://www.tu-ilmenau.de/~alha-in/ahans.asc
Fingerprint: E110 4CA3 288A 93F3 5237 E904 A85B 4B18 CFDC 63E3