> role number 4:
> (4) Compute total return given that the process may terminate with
> probability 1-gamma in every step.
Yes, in this view 1-gamma can be regarded as the normalizer of a
geometric distribution on trajectory lengths. Interestingly it turns
out that there are infinitely many such length distributions (combined
with appropriate discounting of the rewards) that leave the value
function unchanged; see eq.(19)-(26) in
http://www.dpem.tuc.gr/users/vlassis/publications/Vlassis09icml-webversion.pdf
> The question really is if gamma is fixed, given, part of the problem, or
> is it part of the solution?
I would say part of the solution, at least for the well-behaving
domains that we encounter in practice. (If I have an algorithm that
can deal with any inflation rate, why do I have to assume a fixed rate
a priori?) There discounting can be viewed as a mathematical gadget
for searching in the space of policies while assuming as little as
possible about the domain, offering easy proofs of convergence of
various algorithms (contraction argument in value iteration,
near-optimality in sampling-based online planning, etc.). It could
even make sense to consider algorithms that change gamma on the fly as
a way to boost convergence (as long as it is clear what we optimize
for).
N.