Is epsilon-greedy exploration GLIE

Ermo Wei

unread,

Jul 30, 2014, 2:54:33 PM7/30/14

to rl-...@googlegroups.com

Hi, all, I have a simple question to ask.

I was wondering if epsilon-greedy a GLIE exploration method. It seems that it is GLIE only if epsilon decrease according to the number of times that certain state have been visited. But I often seeing people using a fixed small epsilon value. I guess this fixed epsilon greedy is not GLIE? and will this method converge to optimality?

Thanks for the response.

Littman, Michael

unread,

Jul 30, 2014, 3:32:16 PM7/30/14

to rl-...@googlegroups.com

You are right, fixed epsilon is not GLIE and there are examples where it won't converge to the solution of the Bellman equation. I believe it converges to a ball around the solution, however.

--
You received this message because you are subscribed to the "Reinforcement Learning Mailing List" group.
To post to this group, send email to rl-...@googlegroups.com
To unsubscribe from this group, send email to
rl-list-u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/rl-list?hl=en
---
You received this message because you are subscribed to the Google Groups "Reinforcement Learning Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rl-list+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ermo Wei

unread,

Aug 4, 2014, 1:15:16 PM8/4/14

to rl-...@googlegroups.com, mlit...@cs.brown.edu

Thanks, Dr.Littman. Would you be able to point me the examples that fix epsilon greedy won't converge to the Bellman equation?

Littman, Michael

unread,

Aug 4, 2014, 1:37:47 PM8/4/14

to Ermo Wei, rl-...@googlegroups.com

Ok, sorry, to be more precise:

- Q-learning with epsilon-greedy exploration will converge to the solution to the Bellman equation.

- However, its *behavior* will not converge because it will continue to choose random actions with high frequency.

The purpose of the concept of GLIE is to talk about convergence in the behavior. (What's the point of the Q function converging to the solution if the agent doesn't actually behave that way?)

Hado van Hasselt

unread,

Aug 4, 2014, 2:11:57 PM8/4/14

to rl-...@googlegroups.com, Ermo Wei

Small addendum: if instead of Q-learning you are using an on-policy algorithm such as (Expected) Sarsa, you additionally need a GLIE policy if you want the action-value estimates to converge to the optimal values. If epsilon does not go to zero, the action-value estimates will still converge (under some assumptions, e.g., on the step size), but then to the values of the actions under the (exploring) behaviour policy.

And, to elaborate on Michael’s answer: it is perfectly fine to use a constant epsilon for a long time with Q-learning, until the values converge, and then switch to no exploration at once (epsilon=0). This is what is sometimes done in practice when people use a constant epsilon: after “learning” the exploration is turned off for “testing”.

- Hado

Reply all

Reply to author

Forward