RL in continuous state-action spaces benchmark

483 views
Skip to first unread message

sarah mehraban

unread,
Dec 20, 2011, 2:18:47 PM12/20/11
to rl-...@googlegroups.com
Hi,
I'm working on a reinforcement learning agent that can learn in continous state spaces and can produce continious-valued actions,
I need a good benchmark to test it.
A very famous one is cart-pole balancing problem, but in this problem using discrete actions leads to better results than the use of continuous ones (this can be explained by two facts; first, this task is well suited for bang-bang actions, and second, the use of only two actions simplifies greatly the learning problem and reduces the learning time.).
Using continous valued-actions can lead to better policy, and I need a benchmark to show this.
I really appreciate if you could intoduce me some.
 
Thanks.

Alejandro Weinstein

unread,
Dec 20, 2011, 2:56:49 PM12/20/11
to rl-...@googlegroups.com
> I need a good benchmark to test it.

The following papers deal with continuous action spaces, and include
some environments you can try.

Strösslin, T., & Gerstner, W. (2003). Reinforcement learning in
continuous state and action space. Artificial Neural Networks-ICANN.

van Hasselt, H., & Wiering, M. a. (2007). Reinforcement Learning in
Continuous Action Spaces. 2007 IEEE International Symposium on
Approximate Dynamic Programming and Reinforcement Learning, (Adprl),
272-279.

Marek Grzes

unread,
Dec 20, 2011, 2:31:31 PM12/20/11
to rl-...@googlegroups.com
Perhaps, the boat problem could be useful for you. It was used, e.g.,
here: http://books.nips.cc/papers/files/nips20/NIPS2007_0959.pdf

Cheers,
Marek

> --
> You received this message because you are subscribed to the
> "Reinforcement Learning Mailing List" group.
> To post to this group, send email to rl-...@googlegroups.com
> To unsubscribe from this group, send email to
> rl-list-u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/rl-list?hl=en

Marc Deisenroth

unread,
Dec 20, 2011, 2:21:29 PM12/20/11
to rl-...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

you could try the cart-pole swing-up (plus balancing). Also, if you
set the sampling frequency to something small (0.1 seconds) then,
bang-bang control could run into problems.

Marc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJO8OAwAAoJEIdXhEUMEJT+XEoH/i/5ijdc99fyBB5mjq/FPog4
vEQvG3Z9QluVngTTDmj8EOu/QeC3xO/JC3Pz2h8Ge3mzlZR14Q3aIDR+mIFkYNjE
IMt5TAb+rZ86V1EEIL8/f4rQAHCXzzF2Ts7C5Riz2N1O+LBcEHjT+7Kgo5njIlRK
XqLYDQjNxzl8NCibQ5jiCq3jVwUpN83m4ohFv33M/crsocWGBTZXlyEhcyJWpM34
syNnFzI4aRFbQI8M0gioPvU4QJERtGS3vcq17kEXa7JahDdEbY8ug2BbgYqPXDhx
lWhJke86j8pA2YPCuIDQK486ux2vsCWJ+fI/5CxO6gKJaICwhwjXW9jXJwfZ6EM=
=0Ofy
-----END PGP SIGNATURE-----

Gerhard Neumann

unread,
Dec 20, 2011, 3:11:34 PM12/20/11
to rl-...@googlegroups.com
> I need a good benchmark to test it.

You can also find some simple benchmarks in [1]. In general, you can use any control task with continuous actions (also the cart-pole balancing task, even though this is quite boring...) if you just add a squared punishment term for the control action in your reward function. Thus, you want to learn an energy-efficient controller. The bang-bang policy is suboptimal in this case.

best...

Gerhard

[1] Neumann, G.; Peters, J. (2009). Fitted Q-iteration by Advantage Weighted Regression, Advances in Neural Information Processing Systems 22 (NIPS 2008)

der...@ulg.ac.be

unread,
Dec 20, 2011, 3:56:24 PM12/20/11
to rl-...@googlegroups.com
Hello Sarah,

You could perhaps also try the HIV infection control problem described in:
http://www.montefiore.ulg.ac.be/~ernst/CDC_2006.pdf

It is a rather difficult benchmark for algorithms learning a state action value function.

All the best,

Damien

Prof. Damien ERNST
University of Liège - Dpt. of Elec. Eng.,
Building B28, Parking P32, B-4020 Liège, BELGIUM.
Email : der...@ulg.ac.be
Homepage : http://www.montefiore.ulg.ac.be/~ernst/

----- Mail original -----
De: "sarah mehraban" <sarah.m...@gmail.com>
À: rl-...@googlegroups.com
Envoyé: Mardi 20 Décembre 2011 20:18:47
Objet: [rl-list] RL in continuous state-action spaces benchmark

Hi,
I'm working on a reinforcement learning agent that can learn in
continous state spaces and can produce continious-valued actions,

I need a good benchmark to test it.

A very famous one is cart-pole balancing problem, but in this problem
using discrete actions leads to better results than the use of
continuous ones (this can be explained by two facts; first, this task is
well suited for bang-bang actions, and second, the use of only two
actions simplifies greatly the learning problem and reduces the learning
time.). Using continous valued-actions can lead to better policy, and I
need a benchmark to show this.
I really appreciate if you could intoduce me some.

Thanks.

-- You received this message because you are subscribed to the

Martin Riedmiller

unread,
Dec 20, 2011, 7:25:59 PM12/20/11
to rl-...@googlegroups.com
Hi,

you can find a couple of challenging continuous control benchmarks and
detailed description (underwater vehicle, airplane, magnet levitation,
hvac) in the following paper:

Hafner, Roland and M. Riedmiller. Reinforcement learning in feedback
control. Machine Learning, 27(1):55�74. available online at
http://dx.doi.org/10.1007/s10994-011-5235-x or upon request at
riedm...@informatik.uni-freiburg.de, Springer Netherlands, 2011.

You will find results on our neural RL approach for continuous actions
(NFQCA) and will also find comparison also to classical controller designs.

We will make the plants shortly also available in our software framework
CLSquare (or earlier, upon request).

Best,

Martin Riedmiller
MLL University of Freiburg, Germany

Ari

unread,
Dec 20, 2011, 9:20:17 PM12/20/11
to Reinforcement Learning Mailing List
Also worth mentioning are the domains used in the paper: Binary Action
Search for Learning Continuous-Action Control Policies, by Jason Pazis
and Michail G. Lagoudakis.

As has been mentioned by others here, one approach in that paper is to
utilize a common domain, but to use a different reward structure
(which is what they do with the inverted pendulum).

- Ari

Lei Wu

unread,
Dec 20, 2011, 10:11:14 PM12/20/11
to rl-...@googlegroups.com
I'm doing experiments on  Rotary Single Inverted Pendulum with Reinforcement learning methods:
cacla
wire fitting
Ex<a>

These methods are used to learn in continuous spaces. But I have not finished them.
2011/12/21 sarah mehraban <sarah.m...@gmail.com>
--

Arun Chaganty

unread,
Dec 21, 2011, 7:41:07 AM12/21/11
to rl-...@googlegroups.com
Hello,
I'm not sure why this hasn't been mentioned, but there are several
continuous domains at the RL competition as well:

* Acrobot - continuous state space; discrete actions
* Helicopter - 12 dimensional continuous state space, 4 dimensional
continuous action space
* Octopus - 82 dimensional continuous state space, 32 dimensional
continuous action space

An implementation using RL-glue, with suitable visualisations, etc.
can be found here: http://2009.rl-competition.org/software.php

Cheers,
--
Arun Tejasvi Chaganty
http://arun.chagantys.org/

Hado van Hasselt

unread,
Dec 21, 2011, 8:10:28 AM12/21/11
to rl-...@googlegroups.com
Hi,

The double-pole cart pole [1] is a good benchmark, similar to the
normal cart pole. An advantage is that it has actually been used quite
a lot, so it is easy to compare any result to previous algorithms.

The Helicopter domain from the RL competition [2,3], as mentioned by
Arun Chaganty just now, is also a good benchmark. However, please note
that the reward should be changed! In the competition, a reward
(penalty, actually) was issued when the helicopter crashed. But this
reward depended on the number of time steps that had past. This makes
this reward non-Markovian. This may be a reason why we only saw
evolutionary approaches (which operate on whole episodes) and no
temporal-difference algorithms (or other RL techniques) on this domain
in the competition. This is easily fixed by changing the penalty on a
crash to a fixed amount.

I have some (C++) code available for both these domains, if anyone is
interested. This code can be used in combination with the RL C++ code
(rlcpp) on my homepage [4].

Best,
Hado

[1] http://homepages.cwi.nl/~hasselt/papers/RL_in_Continuous_Spaces/Experiment_on_Double_Pole_C.html
[2] http://2009.rl-competition.org/
[3] http://code.google.com/p/rl-competition/
[4] http://homepages.cwi.nl/~hasselt/code.html

"José Antonio Martín H."

unread,
Dec 26, 2011, 6:49:43 AM12/26/11
to rl-...@googlegroups.com
Hi Hado.

I've been able to learn successfully in the Helicopter problem using the
TD method Ex<a> without changing the reward function.

The learning is not of the same quality as the evolutionary approach
that I used in the RL competition but it works fine to keep the
helicopter safe.

The idea is to use a dedicated learning agent for every actuator and
selecting well the state variables for each agent.

I have not tested it with your suggestion for modifying the reward
function but I will do (thanks for the tip!)

Best,
Jose.

El 21/12/2011 14:10, Hado van Hasselt escribi�:

--
/ .- .-.. .-.. / -.-- --- ..- / -. . . -.. / .. ... / .-.. --- ...- .
Jos� Antonio Mart�n H. Ph.D. E-Mail: jama...@fdi.ucm.es
Computer Science Faculty Phone: (+34) 91 3947650
Complutense University of Madrid Fax: (+34) 91 3947527
C/ Prof. Jos� Garc�a Santesmases,s/n 28040 Madrid, Spain
web: http://www.fdi.ucm.es/profesor/jamartinh/
El orden es el recurso no renovable Order is the truly nonrenewable
m�s importante resource.
.-.. --- ...- . / .. ... / .- .-.. .-.. / .-- . / -. . . -..

Shimon Whiteson

unread,
Dec 27, 2011, 1:01:14 PM12/27/11
to rl-...@googlegroups.com
We built a Python implementation of the two Helicopter domains used in the RL competitions, as well as an even harder one we designed for our own experiments. The code is available here:

http: //staff.science.uva.nl/~whiteson/helicopter.zip.

The following article describes the neuroevolutionary methods we applied to these problems:

http://staff.science.uva.nl/~whiteson/pubs/b2hd-koppejanei11.html

As policy search methods, they do not rely on the Markov property and thus do not require the alteration to the reward function that Hado describes.

Cheers,
Shimon

-------------------------------------------------------------
Shimon Whiteson | Assistant Professor
Intelligent Autonomous Systems Group
Informatics Institute | University of Amsterdam
-------------------------------------------------------------
Science Park 904 | 1098 XH Amsterdam
+31 (0)20.525.8701 | +31 (0)6.3851.0110
http://staff.science.uva.nl/~whiteson
-------------------------------------------------------------

Reply all
Reply to author
Forward
0 new messages