Rob,
This only works if you can compute the expectation exactly (as in
real-time dynamic programming). If you depend on Monte Carlo estimates,
then you have to handle the case where initial estimates are optimistic,
but as a result of Monte Carlo sampling, you lose this property. So,
even with initially optimistic estimates, you still have to explore.
Optimistic estimates is one way of handling the exploration problem, but
it is very sensitive to the choice of initial estimates. If they are
too high, you may be forcing an exploration of the entire state space
(at least once), which is problematic if your state space is large.
Epsilon-greedy exploration can also suffer from very slow convergence.
It is well known in ADP/RL that some algorithms are provably convergent,
but the rate of convergence is so slow that the algorithm is
impractical. We have very few good stopping rules, and it is easy for
an algorithm to reach "apparent convergence" when in fact it is quite
far from anything resembling an optimal solution. An algorithm can
produce very good results in one setting, and poor results in another,
but you do not really know when you are getting poor results unless you
have some sort of benchmark.
Warren Powell