Question on OffPolicy RLAgents (RLPark)

20 views
Skip to first unread message

Sam

unread,
Dec 9, 2013, 2:32:20 PM12/9/13
to github...@googlegroups.com
Hi Devs,

I would like to ask a question on OffPolicy RLAgents (e.g., OffPolicyAgentDirect). 

In the "Runner" class; I've observed that when 

runnerEvent.step.time == maxEpisodeTimeSteps

the "step" is forced with "forceEndEpisode=true". Therefore, when this event occurs (which is most often in off-policy learning agents with random behavior policy) "OffPolicyAgentDirect" sets "x_tp1=null". This means that the learning agent experiences  (x_t, a_t, 0, r_tp1, z_tp1). Therefore, this event also treated as a terminal event. 

Treating the event  "runnerEvent.step.time == maxEpisodeTimeSteps" as a terminal event causes learning problems for GQ/GreedyGQ algorithms. I ran "OffPolicyControlTests::greedyGQOnMountainCarTest()" with the random number a new generator  "Random random = new  Random(System.currentTimeMillis());" in "MountainCarOffPolicyLearning::evaluate()" line 34. This change causes "behavior" policy to be initialized with random seeds. 

When I evaluate the test case; most of the time the assertion failed.

===
Exception in thread "main" java.lang.AssertionError: 
at org.junit.Assert.fail(Assert.java:91)
5000
at org.junit.Assert.assertTrue(Assert.java:43)
at org.junit.Assert.assertTrue(Assert.java:54)
at rlpark.plugin.rltoys.junit.experiments.offpolicy.OffPolicyControlTests.greedyGQOnMountainCarTest(OffPolicyControlTests.java:65)
at rlpark.plugin.rltoys.junit.experiments.offpolicy.OffPolicyControlTests.main(OffPolicyControlTests.java:113)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
 
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
===

I change the step method in Runner, so that

public void step() {
    assert nbEpisode < 0 || runnerEvent.nbEpisodeDone < nbEpisode;
    if (runnerEvent.step == null || runnerEvent.step.isEpisodeEnding()) {
      runnerEvent.step = problem.initialize();
      runnerEvent.episodeReward = 0;
      agentAction = null;
      assert runnerEvent.step.isEpisodeStarting();
    } else {
      runnerEvent.step = problem.step(agentAction);
      // =========== Removed
      //if (runnerEvent.step.time == maxEpisodeTimeSteps)
      //  runnerEvent.step = problem.forceEndEpisode();
    }
    agentAction = agent.getAtp1(runnerEvent.step);
    // Added ====
    if (runnerEvent.step.time == maxEpisodeTimeSteps)
      runnerEvent.step = problem.forceEndEpisode();
   // ======= 
   runnerEvent.episodeReward += runnerEvent.step.r_tp1;
    runnerEvent.nbTotalTimeSteps++;
    onTimeStep.fire(runnerEvent);
    if (runnerEvent.step.isEpisodeEnding()) {
      runnerEvent.nbEpisodeDone += 1;
      onEpisodeEnd.fire(runnerEvent);
    }
  }


the last experience is (x_t, x_tp1, a_t, r_tp1, z_tp1). 

This modification makes GQ/GreedyGQ algorithm have 100% success.

Treating the event "runnerEvent.step.time == maxEpisodeTimeSteps" as a terminal event and setting "x_tp1=null" in off-policy problems with random behavior policy with different seeds has high variance in other off-policy algorithms too (e.g., Off-PAC). 

I have also observed similar behaviors in RLLib. 

Please let me know; is it necessary to treat the event "runnerEvent.step.time == maxEpisodeTimeSteps" as a terminal event and set the state to "x_tp1=null". 

This behavior is not a problem for on-policy problems. 

Thank you! Sam

Thomas Degris

unread,
Dec 9, 2013, 4:32:59 PM12/9/13
to github...@googlegroups.com
Hello Sam,

I think you raised an interesting point ! However, I don’t think the solution you propose is correct: for a problem in which the reward is only at the terminal state, the agent would never see it with the code you propose. 

It is possible that the assertion fails only because GreedyGQ learns something acceptable but has a performance slightly worse than the threshold in the unit test.

Thomas

--
 
---
You received this message because you are subscribed to the Google Groups "RLPark" group.
To unsubscribe from this group and stop receiving emails from it, send an email to githubrlpark...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Saminda Abeyruwan

unread,
Dec 9, 2013, 5:04:05 PM12/9/13
to github...@googlegroups.com
Hi Thomas,

Thank you for the swift reply. 

The test case fails if the evaluation is more that 270 for GreedyGQ/GQ learning agents (off-policy). When I enable a random number generator with different seeds, e.g., MountainCarOffPolicyLearning::evaluate() line 34, the evaluation is 5000 most of the time. This is the maximum number of steps allowed to the MountainCar problem, which means that the off-policy learning agent (GreedyGQ) did not learn from the allocated episodes.

I did  the following change to the  "step" method in "Runner". 

Current:
If the behavior policy hits a terminal state (base problem terminal state), the agent will see it, therefore the experience is (x_t, a_t, x_tp1=null, r_tp1, z_tp1).

When  the "runnerEvent.step.time == maxEpisodeTimeStepsevent occurs, which is the case that the behavior policy did not see the base problem terminal state, the agent is forced to experience (x_t, a_t, x_tp1=null, r_tp1)  by  "Runner::step" method, without additional rewards or punishments. This event is designed by the programmer.    

Changed:
I have changed  the semantics of the event "runnerEvent.step.time == maxEpisodeTimeSteps" such that the agent experiences (x_t, a_t, x_tp1, r_tp1); because this event does  not corresponding to base problem terminal state. With this change GreedyGQ agent learns 100% of the time no matter which seed the random behavior policy takes.  Off-PAC has significantly less variances.

The other method could be, when  the event "runnerEvent.step.time == maxEpisodeTimeSteps" occurs agent additionally provide reward or punishments of the form r_tp1, and z_tp1 to indicate that some forced termination has occurred. 

Please let me know the semantics, why the agent is forced to experience same rewards or punishments, (x_t, a_t, x_tp1=null, r_tp1,z_tp1), as the base terminal state, when the event "runnerEvent.step.time == maxEpisodeTimeSteps" is occured. 


Thank you! Sam

Thomas Degris

unread,
Dec 9, 2013, 5:08:57 PM12/9/13
to github...@googlegroups.com
Hi Sam,

would you mind doing a pull request on github? it would be easier for me to check out your code.

Thank you,

Thomas

Saminda Abeyruwan

unread,
Dec 9, 2013, 9:50:18 PM12/9/13
to github...@googlegroups.com
Hi Thomas,

Attached herewith: rlpark.plugin.rltoys.experiments.testing.control.MountainCarOffPolicyLearning.java

that creates a behavior policy with random seed.

For example: if you set the seed to 
seed=1386642900299
or 
seed=1386642937928
etc

The OffPolicyControlTests::greedyGQOnMountainCarTest() test evaluates to 5000 steps, which means that that the agent did not learn

if you set the seed=0; the test case evaluates to 267. I think this is a lucky run.

Please see lines 35-37.

If you replace the attached Runner.java with the Github RLPark Runner.java class, the Off-policy algorithms (GreeedyGQ) passes 100% of the time (steps < 270).

I've observed this behavior because C++ random number generator (rand()) (used in RLLib) creates completely different traces of random numbers when start with seed=0, than Java random number generator starts with seed=0. 

Please see the step method of the attached rlpark.plugin.rltoys.experiments.helpers.Runner.java. The proposed changes can be made OO cleaner.

Thank you!

Sam





MountainCarOffPolicyLearning.java
Runner.java
Reply all
Reply to author
Forward
0 new messages