Softmax exploration with many actions

francisco.m...@uv.es

unread,

Jun 28, 2011, 7:14:32 AM6/28/11

to Reinforcement Learning Mailing List

Hello,
I have implemented a Q-learning with a softmax (boltzman) exploratory
policy with poor results compared with an epsilon-greedy exploratory
policy. I think the problem is the high number of actions available
(81 actions actually). In this configuration I think that Boltzman
explores a long time, oscilating with many actions with similar
values. On the contrary, epsilon-greedy behaves much better. My
question is: Is there a practical trick or a way of doing things when
Boltzman has many options? Many actions implies necessarily in
Boltzman much more exploration and therefore the convergence will be
always slower?
best regards
Francisco Martinez
University of Valencia (Spain)

Hado van Hasselt

unread,

Jun 28, 2011, 7:37:36 AM6/28/11

to rl-...@googlegroups.com

Hi Francisco,

Have you tried setting the temperature lower? I thought that a
properly tuned softmax should almost always be able to outperform an
epsilon-greedy approach.

Do you know in which range the action values approximately fall?

On a practical note, if you use fairly low temperatures and the action
values are not very small, numerical problems may occur. For instance,
suppose tau=0.1 and Q(s,a) = 10. Then
e^(Q(s,a)/tau) > 10^(43) .
This can be prevented by scaling all the action values down, for
instance by using
Q'(s,a) = Q(s,a) - max_a Q(s,a) .
Then, e^(Q'(s,a)/tau) is at most e^0 = 1.

Note that this scaling does not affect the analytic probabilities of
the resulting policy.

There's a slightly longer discussion about this in Section 2.4.6 of my
dissertation:
http://homepages.cwi.nl/~hasselt/insights.html

Best,
Hado

> --
> You received this message because you are subscribed to the "Reinforcement Learning Mailing List" group.
> To post to this group, send email to rl-...@googlegroups.com
> To unsubscribe from this group, send email to
> rl-list-u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/rl-list?hl=en

Steffen Udluft

unread,

Jun 28, 2011, 7:51:01 AM6/28/11

to rl-...@googlegroups.com

Hello Francisco,

from my experience simple boltzman exploration is not a preferrable
exploration strategy. One of the main problems is to choose the
temperature parameter properly. If the temperature is too low, the
exploration is too greedy, if the temperature is too high the exploration
is too random.
Things get better if a temperature is calculated for each state, depending
on how often the actions have been applied (few tries -> high temperature,
many tries -> low temperature).
(In Q-learning this counter usually is not available,
but as it has the same index range as the Q-function it should be possible
to add it to the Q-learning implementation.)

My experience is that this "boltzman exploration with local temperature"
is much better than simple boltzman exploration, but not necessarily
superior to epsilon-greedy.

Many exploration schemes exist, which can be better than epsilon-greedy,
depending on the problem at hand. For your problem, I would suggest to
start with epsilon-greedy or to try R-max.

Hope this helps,
Steffen

--
Dr. Steffen Udluft
Siemens AG
Corporate Research and Technologies
Intelligent Systems & Control

"José Antonio Martín H."

unread,

Jun 28, 2011, 8:09:30 AM6/28/11

to rl-...@googlegroups.com

Hi Fran.

Are your 81 action actions related in some way?

If they are for instance ordered or partially ordered, such as motor
power of money etc, then you can use very efficient exploration techniques.

Two interesting ideas to try:

1.-) If you can in some sense order your actions then you can use a kind
of *randomized binary search*.

2-) If not, then you can use the logic of an evolutionary algorithm to
search for your best actions selecting a *reduced subset of optimal
actions* an improve the *quality of the set* (e.g. average fitness ,
i.e.Q-value) over iterations.

Bests,
Jose.

--
======================================================================
Jos� Antonio Mart�n H. PhD. E-Mail: jama...@fdi.ucm.es
Computer Science Faculty Phone: (+34) 91 3947650
Complutense University of Madrid Fax: (+34) 91 3947527
C/ Prof. Jos� Garc�a Santesmases,s/n http://www.dacya.ucm.es/jam/
28040 Madrid, Spain
El orden es el recurso no renovable Order is the truly nonrenewable
m�s importante resource.
======================================================================

El 28/06/2011 13:14, francisco.m...@uv.es escribi�:

Alexander Hans

unread,

Jun 28, 2011, 8:18:24 AM6/28/11

to rl-...@googlegroups.com

Hi Francisco,

I'd suggest you try R-Max or MBIE-EB. Both are fairly straightforward to
implement and also both have just one parameter whose influence is more
obvious than the temperate of a Boltzman exploration.

R-Max [1] takes one parameter C, which is the number of times a
state-action pair (s,a) must have been observed until its actual Q-value
estimate is used in the Bellman iteration. If it has been observed fewer
times, its value is assumed as Q(s,a) = R_max / (1 - \gamma), which is the
maximum possible Q-value (R_max is the maximum possible reward).

Model-based interval estimation with exploration bonus (MBIE-EB) [2] is a
variant of MBIE that -- as the name suggests -- adds an exploration bonus
to the Bellman update equation:

Q(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q(s',a') + \beta /
\sqrt(n_{s,a})

\beta is the parameter weighting the bonus and n_{s,a} the number of times
state-action pair (s,a) has been observed. Obviously, the more often a
state-action pair has been observed, the lower its exploration bonus gets.

Both methods are designed to work within a standard dynamic programming
approach. I'm not sure how they'd work in combination with Q-learning. You
would have to add an array with n_{s,a} counters. In the case of R-Max the
"target" Q for the TD error calculation would be R_max / (1-\gamma), in
the case of MBIE-EB one would use R(s,a,s') + \gamma \max_{a'} Q(s',a') +
\beta / sqrt(n_{s,a}). In both cases, the "true" Q-values would change, as
at some point the switch to the real Q-values is made (R-Max) or the
exploration bonus becomes smaller (MBIE-EB). I don't see if that would
hurt any convergence properties, but I'd expect it not to be a problem, as
in both cases the influence of the exploration bonus vanishes over time.
So I'd definitely give it a shot!

Cheers,

Alex

[1] Brafman, R.I. and Tennenholtz, M. (2003). "R-max - a general polynomial
time algorithm for near-optimal reinforcement learning". In: Journal of
Machine Learning Research 3, pp. 213�231.

[2] Strehl, A.L. and Littman, M.L. (May 5, 2009). "An analysis of
model-based Interval Estimation for Markov Decision Processes." In:
Journal of Computer and System Sciences 74.8, pp. 1309�1331.
http://dblp.uni-trier.de/db/journals/jcss/jcss74.html#StrehlL08

> --
> You received this message because you are subscribed to the "Reinforcement
> Learning Mailing List" group.
> To post to this group, send email to rl-...@googlegroups.com
> To unsubscribe from this group, send email to
> rl-list-u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/rl-list?hl=en
>

--

Hado van Hasselt

unread,

Jun 28, 2011, 8:27:25 AM6/28/11

to rl-...@googlegroups.com

Incidentally, when using Q-learning with such a high number of
actions, I would definitely expect to see significant overestimations
in some of the action values if there is any noise in the action
values (e.g., due to noise in the rewards, transitions or function
approximation).[1]
This might present a problem for value-based exploration techniques,
such as softmax.

Best,
Hado

[1] Hado van Hasselt (2010). "Double Q-learning". Advances in Neural
Information Processing Systems 23 (NIPS 2010), Vancouver, British
Columbia, Canada, pp. 2613-2622.
books.nips.cc/papers/files/nips23/NIPS2010_0208.pdf

Reply all

Reply to author

Forward