I'm trying to reproduce Riedmiller's pole balancing results from the ECML
2005 paper about NFQ [1].
However, I have some problems finding the parameters that result in
dynamics that cause the pole to fall after 6 steps when applying random
actions. Also I'm not sure what implementation to use, as there is one
using Euler's method for integration and another one using a Runge-Kutta
method. In CLS2 (the "rl-glue" of Riedmiller's group, [2]) a cart-pole
environment is implemented using Runge-Kutta. The closest I could get to
the 6 steps was using an implementation with Euler's method, 8 kg cart
mass, 2 kg pole mass, dt = 0.1 s, and actions in {-50 N, 0 N, 50 N} as
given in the paper, but a pole length l = 0.25 m (as opposed to l = 0.5 m
given in the paper).
Moreover, how many iterations would one use (i.e., constant N in
NFQ_main() (Fig. 1 in [1]))? In table-based DP it is possible to do as
many iterations as necessary for convergence, i.e., Q_{k+1} = Q_k +/-
epsilon. In my experiments with NFQ 20-30 iterations are sufficient to
generate perfect policies (i.e., policies that balance for at least 3000
steps). At this point the Q-function has not yet reached convergence, but
if I continue to iterate, the network "unlearns" the policy.
So to summarize, I'm looking for the implementation and parameters used to
generate the observations (and evaluate the policy) and the NFQ parameter
N giving the number of iterations.
Thanks,
Alex
[1]
http://www.ni.uni-osnabrueck.de/fileadmin/user_upload/publications/riedmiller.ecml2005.official.pdf
[2] http://www.ni.uos.de/index.php?id=70
--
PGP Public Key: http://www.tu-ilmenau.de/~alha-in/ahans.asc
Fingerprint: E110 4CA3 288A 93F3 5237 E904 A85B 4B18 CFDC 63E3
This one is ported from Sutton's cart-pole code that goes back to the early 80s.
Not sure if this helps you at all though :)
> --
> You received this message because you are subscribed to the "Reinforcement Learning Mailing List" group.
> To post to this group, send email to rl-...@googlegroups.com
> To unsubscribe from this group, send email to
> rl-list-u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/rl-list?hl=en
>
> To unsubscribe from this group, send email to rl-list+unsubscribegooglegroups.com or reply to this email with the words "REMOVE ME" as the subject.
>
--
Brian Tanner
Ph.D Student
University of Alberta
br...@tannerpages.com
> Hi Alexander. There is a different implementation in the RL-Library
> that is compatible with RL-Glue:
> http://library.rl-community.org/wiki/CartPole_(Java)
Yes, I've looked at this one already, it's almost identical to the
implementation I have, I believe mine is also based on Sutton's code. They
all use Euler's method for integration. I'm not sure if using Runge-Kutta
makes a difference here and what implementation Riedmiller used. Maybe I
should ask him directly ...
Thanks for the quick reply,
Alex
as indicated in my ECML paper, for the experiments I used exactly the
pole dynamics
and parameters (mass_cart = 8kg, mass_pole = 2kg, length pole = 0,5 m,
dt =0.1)
as described in the LSPI paper by Lagoudakis and Parr 2003. The dynamics
were
evaluated using a fourth order Runge-Kutta method.
N (number of NFQ iterations) in this case was chosen to be 100.
The number of training epochs for batch learning in each iteration was
also 100.
Varying these two parameters should not influence the result too much.
Hope this helps, regards,
Martin
Alexander Hans schrieb:
thank you very much for the reply!
> as indicated in my ECML paper, for the experiments I used exactly the
> pole dynamics and parameters (mass_cart = 8kg, mass_pole = 2kg, length
> pole = 0,5 m, dt =0.1) as described in the LSPI paper by Lagoudakis and
> Parr 2003. The dynamics were evaluated using a fourth order Runge-Kutta
> method.
If I am not mistaken this is the method implemented in CLS2, right? Using
those parameters with the CLS2 implementation should give the 6 steps
then. I think I tried that and it gave me larger numbers, but I will
doublecheck that.
> N (number of NFQ iterations) in this case was chosen to be 100.
> The number of training epochs for batch learning in each iteration was
> also 100.
Does this mean you did not train the network until convergence? Or did it
always converge within 100 training epochs?
Thanks again and best regards,
Best,
Martin