NFQ for Pole Balancing

93 views
Skip to first unread message

Alexander Hans

unread,
Mar 26, 2010, 6:01:09 AM3/26/10
to rl-...@googlegroups.com
Hello all,

I'm trying to reproduce Riedmiller's pole balancing results from the ECML
2005 paper about NFQ [1].

However, I have some problems finding the parameters that result in
dynamics that cause the pole to fall after 6 steps when applying random
actions. Also I'm not sure what implementation to use, as there is one
using Euler's method for integration and another one using a Runge-Kutta
method. In CLS2 (the "rl-glue" of Riedmiller's group, [2]) a cart-pole
environment is implemented using Runge-Kutta. The closest I could get to
the 6 steps was using an implementation with Euler's method, 8 kg cart
mass, 2 kg pole mass, dt = 0.1 s, and actions in {-50 N, 0 N, 50 N} as
given in the paper, but a pole length l = 0.25 m (as opposed to l = 0.5 m
given in the paper).

Moreover, how many iterations would one use (i.e., constant N in
NFQ_main() (Fig. 1 in [1]))? In table-based DP it is possible to do as
many iterations as necessary for convergence, i.e., Q_{k+1} = Q_k +/-
epsilon. In my experiments with NFQ 20-30 iterations are sufficient to
generate perfect policies (i.e., policies that balance for at least 3000
steps). At this point the Q-function has not yet reached convergence, but
if I continue to iterate, the network "unlearns" the policy.

So to summarize, I'm looking for the implementation and parameters used to
generate the observations (and evaluate the policy) and the NFQ parameter
N giving the number of iterations.


Thanks,

Alex


[1]
http://www.ni.uni-osnabrueck.de/fileadmin/user_upload/publications/riedmiller.ecml2005.official.pdf
[2] http://www.ni.uos.de/index.php?id=70

--
PGP Public Key: http://www.tu-ilmenau.de/~alha-in/ahans.asc
Fingerprint: E110 4CA3 288A 93F3 5237 E904 A85B 4B18 CFDC 63E3

Brian Tanner

unread,
Mar 26, 2010, 8:44:57 AM3/26/10
to rl-...@googlegroups.com
Hi Alexander. There is a different implementation in the RL-Library
that is compatible with RL-Glue:
http://library.rl-community.org/wiki/CartPole_(Java)

This one is ported from Sutton's cart-pole code that goes back to the early 80s.

Not sure if this helps you at all though :)

> --
> You received this message because you are subscribed to the "Reinforcement Learning Mailing List" group.
> To post to this group, send email to rl-...@googlegroups.com
> To unsubscribe from this group, send email to
> rl-list-u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/rl-list?hl=en
>
> To unsubscribe from this group, send email to rl-list+unsubscribegooglegroups.com or reply to this email with the words "REMOVE ME" as the subject.
>

--
Brian Tanner
Ph.D Student
University of Alberta
br...@tannerpages.com

Alexander Hans

unread,
Mar 26, 2010, 8:55:59 AM3/26/10
to rl-...@googlegroups.com
Hello Brian,

> Hi Alexander. There is a different implementation in the RL-Library
> that is compatible with RL-Glue:
> http://library.rl-community.org/wiki/CartPole_(Java)

Yes, I've looked at this one already, it's almost identical to the
implementation I have, I believe mine is also based on Sutton's code. They
all use Euler's method for integration. I'm not sure if using Runge-Kutta
makes a difference here and what implementation Riedmiller used. Maybe I
should ask him directly ...


Thanks for the quick reply,

Alex

Martin Riedmiller

unread,
Mar 26, 2010, 10:52:46 AM3/26/10
to rl-...@googlegroups.com
Hi Alexander,

as indicated in my ECML paper, for the experiments I used exactly the
pole dynamics
and parameters (mass_cart = 8kg, mass_pole = 2kg, length pole = 0,5 m,
dt =0.1)
as described in the LSPI paper by Lagoudakis and Parr 2003. The dynamics
were
evaluated using a fourth order Runge-Kutta method.
N (number of NFQ iterations) in this case was chosen to be 100.
The number of training epochs for batch learning in each iteration was
also 100.
Varying these two parameters should not influence the result too much.


Hope this helps, regards,
Martin

Alexander Hans schrieb:

Alexander Hans

unread,
Mar 26, 2010, 11:19:59 AM3/26/10
to rl-...@googlegroups.com
Hello Martin,

thank you very much for the reply!


> as indicated in my ECML paper, for the experiments I used exactly the
> pole dynamics and parameters (mass_cart = 8kg, mass_pole = 2kg, length
> pole = 0,5 m, dt =0.1) as described in the LSPI paper by Lagoudakis and
> Parr 2003. The dynamics were evaluated using a fourth order Runge-Kutta
> method.

If I am not mistaken this is the method implemented in CLS2, right? Using
those parameters with the CLS2 implementation should give the 6 steps
then. I think I tried that and it gave me larger numbers, but I will
doublecheck that.


> N (number of NFQ iterations) in this case was chosen to be 100.
> The number of training epochs for batch learning in each iteration was
> also 100.

Does this mean you did not train the network until convergence? Or did it
always converge within 100 training epochs?


Thanks again and best regards,

Martin Riedmiller

unread,
Mar 26, 2010, 11:45:21 AM3/26/10
to rl-...@googlegroups.com
Hi,

>> N (number of NFQ iterations) in this case was chosen to be 100.
>> The number of training epochs for batch learning in each iteration was
>> also 100.
>>
>
> Does this mean you did not train the network until convergence? Or did it
> always converge within 100 training epochs.
>
determining 'convergence' is one of the tricky points of NFQ. Therefore,
in this case, these
parameters have been fixed empirically. Reasoning behind is, that after
100 training epochs of supervised learning, the mse
did not change significantly any more. Similarly, 100 NFQ iterations
gave good and stable results.
This procedure was chosen to be comparable to the procedure in
Lagoudakis and Parr.
In other settings, the number N of NFQ iterations is for example
determined by continuously testing the
current performance of the neural Q function.

Best,
Martin

Erik Gerrits

unread,
Apr 20, 2010, 6:08:03 PM4/20/10
to Reinforcement Learning Mailing List
Hello Alexander,

the pole will fall in 6 seconds if you multiply the pole length by the
cart mass in
your code instead of multiplying the pole length by the pole mass. I
don't know if
this is correct but I think Martin Riedmiller does this in his code,
pole_dynamics.c
from CLS. Is this an error or is this as it should be? In that case
there is an error in
Lagoudakis' Least-Squares Policy Iteration paper in the description of
the problem.

Martin you say: "N (number of NFQ iterations) in this case was chosen
to be 100.
The number of training epochs for batch learning in each iteration was
also 100."
I presume the second parameter doesn't apply to NFQ but to another
variant of
batch learning right? I don't know which parameter it is otherwise,
because only
the amount of epochs over the training data has to be set for NFQ, or
am i wrong?

Thanks,
Erik

On Mar 26, 5:45 pm, Martin Riedmiller <riedmil...@informatik.uni-
freiburg.de> wrote:
> Hi,>> N (number ofNFQiterations) in this case was chosen to be 100.
> >> The number of training epochs for batch learning in each iteration was
> >> also 100.
>
> > Does this mean you did not train the network until convergence? Or did it
> > always converge within 100 training epochs.
>
> determining 'convergence' is one of the tricky points ofNFQ. Therefore,
> in this case, these
> parameters have been fixed empirically. Reasoning behind is, that after
> 100 training epochs of supervised learning, the mse
> did not change significantly any more. Similarly, 100NFQiterations
> gave good and stable results.
> This procedure was chosen to be comparable to the procedure in
> Lagoudakis and Parr.
> In other settings, the number N ofNFQiterations is for example
> determined by continuously testing the
> current performance of the neural Q function.
>
> Best,
> Martin

Erik Gerrits

unread,
Apr 20, 2010, 5:27:08 PM4/20/10
to Reinforcement Learning Mailing List
Hello Alexander,

did you get 8.5 steps instead of 6? Then this is probably because the
term
mass times length in the paper of Lagoudakis refers to the mass of the
POLE
times the length of the pole, while in Martin Riedmiller's
implementation the
mass of the CART times the length of the pole is used. Which one is
correct,
I don't know.

Regards,
Erik

On Mar 26, 5:19 pm, "Alexander Hans" <a...@ahans.de> wrote:
> Hello Martin,
>
> thank you very much for the reply!
>
> > as indicated in my ECML paper, for the experiments I used exactly the
> > pole dynamics and parameters (mass_cart = 8kg, mass_pole = 2kg, length
> > pole = 0,5 m, dt =0.1)  as described in the LSPI paper by Lagoudakis and
> > Parr 2003. The dynamics were evaluated using a fourth order Runge-Kutta
> > method.
>
> If I am not mistaken this is the method implemented in CLS2, right? Using
> those parameters with the CLS2 implementation should give the 6 steps
> then. I think I tried that and it gave me larger numbers, but I will
> doublecheck that.
>
> > N (number ofNFQiterations) in this case was chosen to be 100.
> > The number of training epochs for batch learning in each iteration was
> > also 100.
>
> Does this mean you did not train the network until convergence? Or did it
> always converge within 100 training epochs?
>
> Thanks again and best regards,
>
> Alex
>
> --
> PGP Public Key:http://www.tu-ilmenau.de/~alha-in/ahans.asc
> Fingerprint: E110 4CA3 288A 93F3 5237  E904 A85B 4B18 CFDC 63E3

Reply all
Reply to author
Forward
0 new messages