In principle, the Q-learning algorithm itself can handle large state
spaces. However, in your case the discretized state-action space is so
large (> 20 million possible state-action combinations) that it is
likely that you will need far too many samples to learn anything
useful.
I would suggest that you do not discretize the state space. You will
then need to choose a suitable function approximation technique. In
reinforcement learning, an often used technique is to evenly place a
number of radial basis functions or triangular functions throughout
the state space. These functions then can be interpreted as (local)
features. The Q-value of each action can then be approximated by a
linear combination of these features. For more in depth information
about this, you could check out the book by Busoniu et al. [1]. If you
have problems in your setting to specify RBFs for dimensions with
different resolutions, does it help if you first scale all the
dimensions (e.g., scale all of them to [0,1])?
Alternatively, you could use a non-linear function approximation
technique directly on the state variables. This comes with less
convergence guarantees than the former method, but it may work well in
practice. You can get some C++ code that implements Q-learning
(amongst other algorithms) with tables (for discrete problems) and
with neural networks (for continuous problems) from my homepage [2].
The neural network code is also included.
I am a bit curious about your action space by the way. If there are
three 'action variables' that can be increased or decreased, should
the current values not be included in the state space? Do I understand
correctly that you can not set the action variables directly, instead
of increasing or decreasing them?
[1] http://www.dcsc.tudelft.nl/rlbook/
[2] http://homepages.cwi.nl/~hasselt/code.html
Hope this helps!
Best,
Hado
> --
> You received this message because you are subscribed to the "Reinforcement
> Learning Mailing List" group.
> To post to this group, send email to rl-...@googlegroups.com
> To unsubscribe from this group, send email to
> rl-list-u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/rl-list?hl=en
CMAC also discretizes the state space, but it does so more efficiently
than a naive discretization. If you can generate enough samples, this
may work fine. Discretizing the state and action space as you propose
(for instance by using CMAC as function approximation), is in any case
a good start and may yield good enough results, depending on your
problem.
If you would like to look into this, you might get better results if
you choose an action directly. This would simplify your state space,
because dimensions S4 and S5 are no longer needed, but as you already
noted, it would complicate your action space. However, there are some
good algorithms for problems with continuous action spaces. I've
gotten some good results with an actor-critic algorithm called Cacla
[1] myself. This algorithm allows you to train a function approximator
to output a continuous action for each state. I've used it with neural
networks and with linear controllers, with promising results. Of
course, your mileage may differ. Other algorithms that can be used for
such continuous-action settings include other actor-critic algorithms,
such as the class of natural actor-critic algorithms [2,3].
I am currently writing up a surveying chapter on
reinforcement-learning methods in continuous spaces, that includes
discussions on methods for problems with continuous actions. It also
includes a short discussion on CMAC (or tile coding, as it is also
called). If you want, I could send you a (draft) copy.
If you have any other questions, feel free to ask. Hopefully, we'll be
able to help :)
[1] Hado van Hasselt and Marco Wiering (2009). "Using Continuous
Action Spaces to Solve Discrete Problems". Proceedings of the
International Joint Conference on Neural Networks, IJCNN 2009,
Atlanta, GA, USA, 2009.
[2] Peters J, Schaal S (2008) Natural actor-critic. Neurocomputing
71(7-9):1180–1190
[3] Bhatnagar S, Sutton RS, Ghavamzadeh M, Lee M (2009) Natural
actor-critic algorithms. Automatica 45(11):2471–2482
Best,
Hado
Hi Shashank,
The Fourier Basis paper has an empirical comparison between the FB and a
few linear function approximators: RBFs, the Polynomial Basis, PVFs (a
learned basis) on some standard benchmarks.
I view RBFs as an improvement over CMACs because they are not piecewise
constant. A linear combination of piecewise constant basis functions
(BFs) is itself piecewise constant. Consider an RL agent representing a
VF and using a forward model to look one step ahead to select actions.
Such an agent essentially follows an increase in the value function, so
if it is on a value plateau (where the VF is piecewise constant), it
will select actions randomly until it leaves that plateau. Having a
function approximator that is guaranteed to produce this kind of feature
is undesirable.
This difficulty is usually dealt with by using a very high resolution
tiling (so that the plateaus are small), but that sacrifices
generalization (and therefore damages the agent's "startup"
performance), so in many cases you end up hacking together a combination
of high- and low-resolution tiling that does the job, in the end, after
much engineering.
Function approximation is a very well studied topic in the applied
sciences, where is classically accomplished using either a Fourier
Series approximation or a Wavelet approximation. The Fourier Series
approximation approximation has global support and is smooth, while the
Wavelet approximation approach has local support and can be
multiresolution.
RBF is a fixed basis set similar to the Wavelet approach (Mahadevan and
Maggioni have Diffusion Wavelets as a learned basis set; similarly,
Mahadevan's PVFs are the learned generalization of the FB over graphs).
With RBFs, the local support of each basis function is a win because it
lets you represent discontinuities and fine local features in your value
function, but you must choose their positions in state space and their
radii. You can also choose to mix and match different sized RBFs, if you
like. Smaller ones give you less generalization and more resolution,
larger ones more generalization and less resolution. In my experience,
RBFs pretty consistently outperform CMACs.
The Fourier Basis is based on the Fourier Series approach. Every BF is
defined everywhere, but differs as to its frequency. Low-frequency BFs
contribute to smoother parts of the value function, and high frequency
BFs contribute to "spikier" parts. The only parameter you have to set is
your maximum frequency - which essentially describes how spiky a value
function you think you might need.
We've found the FB to generally work very reliably, across a fairly
large set of diverse projects. I used it exclusively in all of the work
in my thesis, for example.
As an aside, to represent a general continuous value function you need
to have infinitely many BFs. Therefore, all fixed linear BF schemes will
have at least one parameter that determines how to limit the BF set.
Additionally, all fixed linear function approximation schemes face a
combinatorial explosion in the number of state variables. If you have a
large number of state variables, and the number of BFs becomes far too
high, then you need to do one of three things: 1) assume that each state
variable contributes independently to the VF, and hope for the best; 2)
use a learned basis, like PVFs or Diffusion Wavelets, that is extracted
from sample data and scales with the sample data, not the number of
state variables; or 3) use a feature selection method. It doesn't sound
like you need to do any of these, though.
- Gdk.
--
George Konidaris
Postdoctoral Associate, MIT CSAIL
http://people.csail.mit.edu/gdk
You're bringing up some interesting discussion points that I would like
to comment on, based on my experience with tile coding (which is close
to CMACs) for approximating high dimensional Q-functions of robotic tasks.
On 06/19/2011 12:29 AM, George Konidaris wrote:
> Hi Shashank,
>
> The Fourier Basis paper has an empirical comparison between the FB and
> a few linear function approximators: RBFs, the Polynomial Basis, PVFs
> (a learned basis) on some standard benchmarks.
>
> I view RBFs as an improvement over CMACs because they are not
> piecewise constant.
In high dimensional spaces, radial basis functions tend to have a
disadvantage over piecewise constant BFs such as CMACs because of the
relatively small volume they are influencing. Let's assume an RBF that
has decreasing values for increasing euclidean distance to its center,
such as a Gaussian bell shape. Compare this to a constant BF with the
shape of a hypercube with edge length R and dimension D. The hypercube
has volume R^D -- the volume that the constant BF is influencing. Now
consider a hypersphere that precisely fits in this hypercube, with
radius R/2. The hypersphere has volume ~(R/2)^D (nicely explained here,
forgive me for not having a scientific reference:
http://spacemath.gsfc.nasa.gov/weekly/6Page89.pdf). This means that by
far the largest volume of the hypercube is in its 'corners' and not in
its inner sphere, where the RBF has its main influence. RBFs therefore
tend to have a local peek influence (even after normalization). At any
point in space, the approximated function tends to be dominated by a
single BF. Therefore, when traveling through the state space, your value
function approximated with RBFs might be smooth, but you're still
'jumping' from feature to feature in a sense. With CMACs, however, the
number of BFs influencing a function value is fixed and all BFs have
equal influence. Moving around in state space will change your function
value in discrete but gradual steps.
> A linear combination of piecewise constant basis functions (BFs) is
> itself piecewise constant. Consider an RL agent representing a VF and
> using a forward model to look one step ahead to select actions. Such
> an agent essentially follows an increase in the value function, so if
> it is on a value plateau (where the VF is piecewise constant), it will
> select actions randomly until it leaves that plateau. Having a
> function approximator that is guaranteed to produce this kind of
> feature is undesirable.
> This difficulty is usually dealt with by using a very high resolution
> tiling (so that the plateaus are small), but that sacrifices
> generalization (and therefore damages the agent's "startup"
> performance), so in many cases you end up hacking together a
> combination of high- and low-resolution tiling that does the job, in
> the end, after much engineering.
The role of these value plateaus and their size is very interesting. If
the goal is to learn a good policy, and not the (action-)value function
per se, I agree that it is essential for the agent to leave the plateau
in one time step in order to make a good decision. This suggests that an
appropriate 'plateau size' depends on (at least) two things: the actions
that you are considering, and the length of the time step. With discrete
actions, it seems sufficient that different actions lead to different
'plateaus' as to be able to differentiate between their values.
Similarly, larger time steps (assuming continuous system dynamics)
usually lead to larger jumps in state space, making the difference
between actions more noticeable; this suggests that increasing the time
step allows you to use larger features. Although I have used these ideas
while manually tuning my tile coding function approximators with
success, I unfortunately haven't been able to formalize these ideas
during my PhD, nor have I found papers that use these notions in, e.g.,
automatically adjusting the tiling resolution.
An additional comment: with tile coding (CMACs), generalization does not
need to be sacrificed with resolution. You can use large tiles and many
overlapping tilings to achieve both a high resolution and wide
generalization.
Best regards,
Erik
--Tom
--
Thomas G. Dietterich, Professor Voice: 541-737-5559
School of Electrical Engineering FAX: 541-737-1300
and Computer Science URL: http://eecs.oregonstate.edu/~tgd
US Mail: 1148 Kelley Engineering Center
Office: 2067 Kelley Engineering Center
Oregon State Univ., Corvallis, OR 97331-5501
Hi George,
Best regards,
Erik
--
I have done experiments in simulation myself, where I replaced the binary features in tile coding with RBFs (zero-valued outside the hyperrectangle, just like in tile coding) and came to the same conclusion. In two dimensions, non-binary features would beat the binary ones. But in higher-dimensional learning problems, performance was (much) worse. Normalizing the BFs and using the infinity norm distance (instead of Euclidian distance) improved my results. I did not do tests with RBFs that had non-zero values outside their hyperrectangle, since this would be computationally prohibitive for my target application.
Apart from the above empirical 'evidence', I would argue that more 'corner mass' means more generalization and thus faster learning. I would not have a theoretical argument on which method would be able to best represent an arbitrary function (with the smallest integral of squared errors, for example) after an infinite amount of learning samples (not that you would typically have infinitely many learning samples in high-dimensional spaces..).
Best regards,
Erik
________________________________________
From: rl-...@googlegroups.com [rl-...@googlegroups.com] on behalf of Dietterich, Thomas G. [t...@cs.orst.edu]
Sent: Monday, June 20, 2011 9:20 PM
To: rl-...@googlegroups.com
Subject: RE: [rl-list] Question about Q-Learning in large state space