normalized expected return

ze...@gmx.net

unread,

Nov 18, 2009, 9:21:24 AM11/18/09

to rl-...@googlegroups.com

Hi,

it is obvious that the expected return depends on gamma given by

R = r_0 + gamma*r_1 + gamma^2*r_2 + ...

In some publications I read the expression "normalized expected return" defined as (1-gamma)*R. I suppose this is done with the geometric series in mind, but since r is not a constant this is not correct.

Is normalizing the expected return the right thing to do?

Are there any studies comparing "learning with expected return" and "learning with normalized expected return"?

regards
M. Schmitt
--
GRATIS f�r alle GMX-Mitglieder: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01

Brian Tanner

unread,

Nov 18, 2009, 9:33:06 AM11/18/09

to rl-...@googlegroups.com

I have seen people use 1/(1-gamma) to think about the practical
horizon and how far in advance future decisions can affect the current
value. However, I don't personally ever recall coming across
normalizing the reward as you describe. Of course, that means
nothing.

Do you have a citation or two you could share? I'd like to learn more.

On Wed, Nov 18, 2009 at 7:21 AM, <ze...@gmx.net> wrote:
> Hi,
>
> it is obvious that the expected return depends on gamma given by
>
> R = r_0 + gamma*r_1 + gamma^2*r_2 + ...
>
> In some publications I read the expression "normalized expected return" defined as (1-gamma)*R. I suppose this is done with the geometric series in mind, but since r is not a constant this is not correct.
>
> Is normalizing the expected return the right thing to do?
>
> Are there any studies comparing "learning with expected return" and "learning with normalized expected return"?
>
> regards
> M. Schmitt
> --

> GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!

> Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01
>

> --
> You received this message because you are subscribed to the "Reinforcement Learning Mailing List" group.
> To post to this group, send email to rl-...@googlegroups.com
> To unsubscribe from this group, send email to
> rl-list-u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/rl-list?hl=en
>

--
Brian Tanner
Ph.D Student
University of Alberta
br...@tannerpages.com

ze...@gmx.net

unread,

Nov 18, 2009, 9:49:17 AM11/18/09

to rl-...@googlegroups.com

Hi Brian,

> I have seen people use 1/(1-gamma) to think about the practical
> horizon and how far in advance future decisions can affect the current
> value. However, I don't personally ever recall coming across
> normalizing the reward as you describe. Of course, that means
> nothing.

Yes, I think I saw this expression on some slides, but was never sure how to interpret it.

> Do you have a citation or two you could share? I'd like to learn more.

I am sorry, I cannot find the cites I am looking for. In a quick search I found some publications from J. Peters and S. Schaal (for example [1]), but I know there are others.

[1] http://www-clmc.usc.edu/publications/P/peters-ECML2005.pdf
--
GRATIS f�r alle GMX-Mitglieder: Die maxdome Movie-FLAT!

Michael Littman

unread,

Nov 18, 2009, 9:56:26 AM11/18/09

to rl-...@googlegroups.com

Some researchers report value * (1-gamma), which is often much easier to think about than looking at the value directly because it puts it on the same scale as rewards. Note that multiplying all rewards (or values) by (1-gamma) doesn't really change anything because it is a positive scalar.

On Wed, Nov 18, 2009 at 9:49 AM, <ze...@gmx.net> wrote:

Hi Brian,

> I have seen people use 1/(1-gamma) to think about the practical
> horizon and how far in advance future decisions can affect the current
> value. However, I don't personally ever recall coming across
> normalizing the reward as you describe. Of course, that means
> nothing.

Yes, I think I saw this expression on some slides, but was never sure how to interpret it.

> Do you have a citation or two you could share? I'd like to learn more.

I am sorry, I cannot find the cites I am looking for. In a quick search I found some publications from J. Peters and S. Schaal (for example [1]), but I know there are others.

[1] http://www-clmc.usc.edu/publications/P/peters-ECML2005.pdf
--

GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!

Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01

Csaba Szepesvari

unread,

Nov 18, 2009, 10:05:46 AM11/18/09

to rl-...@googlegroups.com

Hi,

ze...@gmx.net wrote:
> Hi Brian,
>
>> I have seen people use 1/(1-gamma) to think about the practical
>> horizon and how far in advance future decisions can affect the current
>> value. However, I don't personally ever recall coming across
>> normalizing the reward as you describe. Of course, that means
>> nothing.
>
> Yes, I think I saw this expression on some slides, but was never sure how to interpret it.
>

You can think of discounting having a couple of different roles:
(1) to achieve boundedness of an infinite sum with bounded terms (math
trick);
(2) to compute the total present value of future returns as in finance
given a fixed interest rate;
(3) to approximate the long average average of the return.

Of these in the last case you should normalize the sum by multiplying it
with 1/(1-gamma).

However, no harm is made if you leave out the normalization, but keep
this in mind.

>> Do you have a citation or two you could share? I'd like to learn more.
>
> I am sorry, I cannot find the cites I am looking for. In a quick search I found some publications from J. Peters and S. Schaal (for example [1]), but I know there are others.
>
> [1] http://www-clmc.usc.edu/publications/P/peters-ECML2005.pdf

Many other people used it. I think Sham Kakade is using it in his thesis.

- Csaba

Csaba Szepesvari

unread,

Nov 18, 2009, 10:06:40 AM11/18/09

to rl-...@googlegroups.com

Csaba Szepesvari wrote:
> Hi,
>
> ze...@gmx.net wrote:
>> Hi Brian,
>>
>>> I have seen people use 1/(1-gamma) to think about the practical
>>> horizon and how far in advance future decisions can affect the current
>>> value. However, I don't personally ever recall coming across
>>> normalizing the reward as you describe. Of course, that means
>>> nothing.
>> Yes, I think I saw this expression on some slides, but was never sure how to interpret it.
>>
>
> You can think of discounting having a couple of different roles:
> (1) to achieve boundedness of an infinite sum with bounded terms (math
> trick);
> (2) to compute the total present value of future returns as in finance
> given a fixed interest rate;
> (3) to approximate the long average average of the return.
>
> Of these in the last case you should normalize the sum by multiplying it
> with 1/(1-gamma).

sorry, multiplying by (1-gamma).

Csaba Szepesvari

unread,

Nov 18, 2009, 10:08:57 AM11/18/09

to rl-...@googlegroups.com

Oh,
role number 4:
(4) Compute total return given that the process may terminate with
probability 1-gamma in every step.
(Does not apply very often, more like a trick.)

The question really is if gamma is fixed, given, part of the problem, or
is it part of the solution?

- Csaba

Eric Wiewiora

unread,

Nov 18, 2009, 11:32:33 AM11/18/09

to rl-...@googlegroups.com

There is a discussion of this normalization in Sham Kakade's thesis:
http://ttic.uchicago.edu/~sham/papers/thesis/sham_thesis.html

Normalizing the rewards between 0 and 1, and normalizing the
discounted return by (1-gamma) are useful in analyzing how different
discount factors affect the (additive) epsilon-optimality of a RL
algorithm. Essentially, it's not fair to compare gamma = 0.1 to gamma
= 0.9 because R can be 9 times larger in the second case. By
normalizing R to fall between 0 and 1 for every discount factor, we
can get a better assessment of how longer effective horizons change
the hardness of the learning problem.

On Nov 18, 2009, at 7:21 AM, ze...@gmx.net wrote:

> Hi,
>
> it is obvious that the expected return depends on gamma given by
>
> R = r_0 + gamma*r_1 + gamma^2*r_2 + ...
>
> In some publications I read the expression "normalized expected
> return" defined as (1-gamma)*R. I suppose this is done with the
> geometric series in mind, but since r is not a constant this is not
> correct.
>
> Is normalizing the expected return the right thing to do?
>
> Are there any studies comparing "learning with expected return" and
> "learning with normalized expected return"?
>
> regards
> M. Schmitt
> --

> GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!

> Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01
>

Nikos Vlassis

unread,

Nov 19, 2009, 9:33:02 AM11/19/09

to rl-...@googlegroups.com

> role number 4:
> (4) Compute total return given that the process may terminate with
> probability 1-gamma in every step.

Yes, in this view 1-gamma can be regarded as the normalizer of a
geometric distribution on trajectory lengths. Interestingly it turns
out that there are infinitely many such length distributions (combined
with appropriate discounting of the rewards) that leave the value
function unchanged; see eq.(19)-(26) in
http://www.dpem.tuc.gr/users/vlassis/publications/Vlassis09icml-webversion.pdf

> The question really is if gamma is fixed, given, part of the problem, or
> is it part of the solution?

I would say part of the solution, at least for the well-behaving
domains that we encounter in practice. (If I have an algorithm that
can deal with any inflation rate, why do I have to assume a fixed rate
a priori?) There discounting can be viewed as a mathematical gadget
for searching in the space of policies while assuming as little as
possible about the domain, offering easy proofs of convergence of
various algorithms (contraction argument in value iteration,
near-optimality in sampling-based online planning, etc.). It could
even make sense to consider algorithms that change gamma on the fly as
a way to boost convergence (as long as it is clear what we optimize
for).

N.

Csaba Szepesvari

unread,

Nov 19, 2009, 10:46:19 AM11/19/09

to rl-...@googlegroups.com

Hi,

Indeed.
And this is called the vanishing discount approach amongst MDP/control
people. If you let gamma->1 and normalize, under certain conditions, you
will end up optimizing the average reward per step.
- Csaba

> N.
>

Francisco Martinez Gil

unread,

Nov 20, 2009, 6:58:32 AM11/20/09

to rl-...@googlegroups.com

Hi,
I want to use Kanerva coding to generalise the state-action space because it
seems more adequate that tile coding or RBF for high dimensionalities.

Unfortunatelly I haven't found lot of bibliography about it. Only the
introductory reference of the Sutton & Barto's book, the Kostiadis' Thesis on
the RoboCup domain and the Kanerva's paper "Sparse Distributed Memory" that
is not focused on learning.
Does anybody know other papers where the Kanerva coding is explained in the
context of RL?

Is there any reason for the scarce bibliography? I mean, is it difficult to
implement/adjust method or it gives poor results?

Best regards

Francisco Martinez

Siegmund Duell

unread,

Nov 20, 2009, 7:15:16 AM11/20/09

to Reinforcement Learning Mailing List

Hi,
here are two papers comparing Kanerva Coding (KC) and Tile Coding (TC)
by C. Wu:

www.aamas-conference.org/Proceedings/aamas09/pdf/.../C_SP_0719.pdf

teamcore.usc.edu/taylorm/ALA09/4.pdf

http://www.google.de/url?sa=t&source=web&ct=res&cd=3&ved=0CB0QFjAC&url=http%3A%2F%2Fwww.ifaamas.org%2FProceedings%2Faamas08%2Fproceedings%2Fpdf%2Fpaper%2FAAMAS08_0573.pdf&ei=D4cGS6vjGNTdsgbelqS0Cg&usg=AFQjCNEW7ynx7HKSk-LU0LdzQ15yKKoLRw&sig2=tFLu2FGaMkrc0TyT4Bu7zw

Hope that is what you are looking for

best
siegmund duell

On Nov 20, 12:58 pm, Francisco Martinez Gil

Ross Gayler

unread,

Nov 20, 2009, 5:01:46 PM11/20/09

to rl-...@googlegroups.com

That first link is broken. It should be:
http://www.aamas-conference.org/Proceedings/aamas09/pdf/02_Extended_Abstract
/C_SP_0719.pdf

Siegmund Duell

unread,

Nov 20, 2009, 7:17:45 AM11/20/09

to Reinforcement Learning Mailing List

Hi,
here are two links hope that is what you are looking for:

http://www.google.de/url?sa=t&source=web&ct=res&cd=3&ved=0CB0QFjAC&url=http%3A%2F%2Fwww.ifaamas.org%2FProceedings%2Faamas08%2Fproceedings%2Fpdf%2Fpaper%2FAAMAS08_0573.pdf&ei=D4cGS6vjGNTdsgbelqS0Cg&usg=AFQjCNEW7ynx7HKSk-LU0LdzQ15yKKoLRw&sig2=tFLu2FGaMkrc0TyT4Bu7zw

teamcore.usc.edu/taylorm/ALA09/4.pdf

Best
siegmund

On Nov 20, 12:58 pm, Francisco Martinez Gil
<Francisco.Martinez-...@uv.es> wrote:

Siegmund Duell

unread,

Nov 20, 2009, 7:31:38 AM11/20/09

to Reinforcement Learning Mailing List

Hallo Francisco,
I implemented Kanerva Coding for function approximation in Java. I
used Leemon Baird's residual Q-learning to verify the implementation
and did not face any poor results but did not do any extensive tests
yet. I am currently finishing another project but will come back to
the topic.

On Nov 20, 1:15 pm, Siegmund Duell <si.du...@gmail.com> wrote:
> Hi,
> here are two papers comparing Kanerva Coding (KC) and Tile Coding (TC)
> by C. Wu:
>
> www.aamas-conference.org/Proceedings/aamas09/pdf/.../C_SP_0719.pdf
>
> teamcore.usc.edu/taylorm/ALA09/4.pdf
>

> http://www.google.de/url?sa=t&source=web&ct=res&cd=3&ved=0CB0QFjAC&ur...

Marek Grzes

unread,

Nov 20, 2009, 7:04:23 AM11/20/09

to rl-...@googlegroups.com

In the publications section here
http://www.ece.neu.edu/faculty/meleis.html
you can find few papers.

Marek

Arturo Servin

unread,

Nov 20, 2009, 9:11:20 AM11/20/09

to rl-...@googlegroups.com

Hope it helps.

@conference{kostiadis2001kabage,
title={{KaBaGe-RL: Kanerva-based generalisation and reinforcement
learning for possession football}},
author={Kostiadis, K. and Hu, H.},
booktitle={Proceedings of the IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS 2001)},
year={2001},
organization={Citeseer}
}

@book{kanerva1988sparse,
title={{Sparse distributed memory}},
author={Kanerva, P.},
year={1988},
publisher={The MIT Press}
}

-as

2009/11/20 Francisco Martinez Gil <Francisco.M...@uv.es>:

Reply all

Reply to author

Forward