Default QLearning vs. QLearning with provided learning policy

35 views

Skip to first unread message

elroy....@gmail.com

unread,

Nov 22, 2017, 3:10:28 AM11/22/17

to BURLAP Discussion

With the following constructor QLearning(SADomain domain, double gamma, HashableStateFactory hashingFactory, double qInit, double learningRate), the Qlearning object uses a 0.1 epsilon greedy policy.

The overloaded constructor QLearning(SADomain domain, double gamma, HashableStateFactory hashingFactory, double qInit, double learningRate, Policy learningPolicy, int maxEpisodeSize) allows for providing a custom learning policy. The learning agents derived from the two options produces vastly different max time steps per episodes.

My goal is to determine a baseline as I am trying to use different policies with QLearning. Since EpsilonGreedy is used by default via the first constructor mentioned above and could be set using the second constructor, I opted to use it first.

For example, The maxTimesteps for A is very different from B.
A.
// this constructor will by default set Q-learning to use a 0.1 epsilon greedy policy
LearningAgent agent = new QLearning(domain, 0.99, hashingFactory, 0., 1.);
for(int i = 0; i < 50; i++){
Episode e = agent.runLearningEpisode(env);

e.write(outputPath + "ql_" + i);
System.out.println(i + ": " + e.maxTimeStep());

env.resetEnvironment();
}

B.
QLearning qL = new QLearning(domain, 0, hashingFactory, 0., 0,0);
qL.setLearningPolicy(null);
qL.setLearningRateFunction(null);
qL.initializeForPlanning(0);
qL.setMaxQChangeForPlanningTerminaiton(0);
LearningAgent agent = new QLearning(domain, 0.99, hashingFactory, 0., 1., (Policy)(new EpsilonGreedy(qL,0.1D)), 2147483647);

for(int i = 0; i < 50; i++) {
Episode e = agent.runLearningEpisode(env);

e.write(outputPath + "ql_" + i);
System.out.println(i + ": " + e.maxTimeStep());

env.resetEnvironment();
}

A was taken from http://burlap.cs.brown.edu/tutorials/bpl/p4.html#qlearn

What is the correct way to manually provide a policy for Qlearning so that B outputs maxTimesteps similar to A?

Reply all

Reply to author

Forward

0 new messages