I would like to ask a question on OffPolicy RLAgents (e.g., OffPolicyAgentDirect).
In the "Runner" class; I've observed that when
runnerEvent.step.time == maxEpisodeTimeSteps
the "step" is forced with "forceEndEpisode=true". Therefore, when this event occurs (which is most often in off-policy learning agents with random behavior policy) "OffPolicyAgentDirect" sets "x_tp1=null". This means that the learning agent experiences (x_t, a_t, 0, r_tp1, z_tp1). Therefore, this event also treated as a terminal event.
Treating the event "runnerEvent.step.time == maxEpisodeTimeSteps" as a terminal event causes learning problems for GQ/GreedyGQ algorithms. I ran "OffPolicyControlTests::greedyGQOnMountainCarTest()" with the random number a new generator "Random random = new Random(System.currentTimeMillis());" in "MountainCarOffPolicyLearning::evaluate()" line 34. This change causes "behavior" policy to be initialized with random seeds.
When I evaluate the test case; most of the time the assertion failed.
===
Exception in thread "main" java.lang.AssertionError:
at org.junit.Assert.fail(Assert.java:91)
5000
at org.junit.Assert.assertTrue(Assert.java:43)
at org.junit.Assert.assertTrue(Assert.java:54)
at rlpark.plugin.rltoys.junit.experiments.offpolicy.OffPolicyControlTests.greedyGQOnMountainCarTest(OffPolicyControlTests.java:65)
at rlpark.plugin.rltoys.junit.experiments.offpolicy.OffPolicyControlTests.main(OffPolicyControlTests.java:113)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
===
I change the step method in Runner, so that
public void step() {
assert nbEpisode < 0 || runnerEvent.nbEpisodeDone < nbEpisode;
if (runnerEvent.step == null || runnerEvent.step.isEpisodeEnding()) {
runnerEvent.step = problem.initialize();
runnerEvent.episodeReward = 0;
agentAction = null;
assert runnerEvent.step.isEpisodeStarting();
} else {
runnerEvent.step = problem.step(agentAction);
// =========== Removed
//if (runnerEvent.step.time == maxEpisodeTimeSteps)
// runnerEvent.step = problem.forceEndEpisode();
}
agentAction = agent.getAtp1(runnerEvent.step);
// Added ====
if (runnerEvent.step.time == maxEpisodeTimeSteps)
runnerEvent.step = problem.forceEndEpisode();
// =======
runnerEvent.episodeReward += runnerEvent.step.r_tp1;
runnerEvent.nbTotalTimeSteps++;
onTimeStep.fire(runnerEvent);
if (runnerEvent.step.isEpisodeEnding()) {
runnerEvent.nbEpisodeDone += 1;
onEpisodeEnd.fire(runnerEvent);
}
}
the last experience is (x_t, x_tp1, a_t, r_tp1, z_tp1).
This modification makes GQ/GreedyGQ algorithm have 100% success.
Treating the event "runnerEvent.step.time == maxEpisodeTimeSteps" as a terminal event and setting "x_tp1=null" in off-policy problems with random behavior policy with different seeds has high variance in other off-policy algorithms too (e.g., Off-PAC).
I have also observed similar behaviors in RLLib.
Please let me know; is it necessary to treat the event "runnerEvent.step.time == maxEpisodeTimeSteps" as a terminal event and set the state to "x_tp1=null".
This behavior is not a problem for on-policy problems.
Thank you! Sam