Understanding OffPACPuddleWorld

Sergio Valcarcel

unread,

Jul 30, 2013, 12:27:43 PM7/30/13

to github...@googlegroups.com

Hi,

I have been able to follow the tutorials and setup the environment.

I copied the OffPACPuddleWorld into my own project, and was able to run it, export it as a runable jar file and successfully opened it in Zephyr.

For my current problem, I want to create two almost independent agents in parallel, each one with its own environment. I just want them to communicate at the end of each step and exchange their parameters.

I am only concerned about predictions so I do not need any actor, but only the two critics learning off-policy and exchanging their estimated parameters "w_{k,i}" (where "k" denotes the number of the agent and "i" the time-step).

First of all, there are two Runners in the example (learning and evaluation). I understand that one must take actions following the behavior policy, and estimate the prediction-parameters using GTD; while the other computes the optimal policy for such estimated parameters. Is that right?

How does this exactly map to the code (which variables do what)?

When I inspect the variables in Zephir, both the critic and the actor are inside the Runners, but in the declaration they seem to be variables different from "criticAdapter". Please, could you explain briefly how these variables interact?

For my problem, should I duplicate both runners for each agent and make them exchange the runner.critic.offPolicyTD.w vector?

Any help will be very much appreciated!

Thanks!!!

Sergio

Thomas Degris

unread,

Jul 30, 2013, 4:19:54 PM7/30/13

to github...@googlegroups.com

Hello,

actually, both runners use the exact same instance of OffPAC but do not use it the same way:

- learning: use getAtp1 of the class OffPolicyAgentDirect, which calls the learn method of the OffPac instance, which updates the parameters of the critic and the actor (see http://people.bordeaux.inria.fr/degris/public/doxygen/html/_off_policy_agent_direct_8java_source.html)

- evaluation: use getAtp1 of the class ControlAgent, which calls the proposeAction of the (same) OffPac instance, which uses the parameter of the actor to sample the next action (see http://people.bordeaux.inria.fr/degris/public/doxygen/html/_control_agent_8java_source.html)

Note that, as implemented in OffPACPuddleWorld::createOffPACAgent, OffPAC is composed of a critic and an actor.

For your problem, creating two instances of Runner would be one way to go but you have other ways. For instance, the example SarsaMountainCar does not use a runner. Moreover, maybe the class OffPolicyAgentDirect is not suited to your needs in which case you can reimplement the interface RLAgent (used by the class Runner) to fit your needs.

Finally, yes I think you need at least two instances of GTD and mix somehow their weight parameters.

Cheers,

Thomas

--

---
You received this message because you are subscribed to the Google Groups "RLPark" group.
To unsubscribe from this group and stop receiving emails from it, send an email to githubrlpark...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sergio Valcarcel

unread,

Jul 30, 2013, 9:36:31 PM7/30/13

to github...@googlegroups.com

Thanks Thomas. This info has been very useful! Now, I have 2 runners working in parallel.

In order to mix the GTD parameters of each agent, I am thinking about declaring the criticAdapter of each agent as a global variable so I can combine them at run(); would it be fine?

But still, I have to find out where exactly "v" is (I mean the prediction parameter vector, such that Xv approximates the value function). The method criticAdapter.predictor() returns offPolicyTD, which seems to be a GTDLambda object. So criticAdapter.prediction() returns the vector "v", is that right? Which one should I use: "v" (predictions) or "v_t" (weights)?

Thank you very much!
Best!
Sergio

Thomas Degris

unread,

Jul 31, 2013, 9:17:41 AM7/31/13

to github...@googlegroups.com

Hi Sergio,

>
> In order to mix the GTD parameters of each agent, I am thinking about declaring the criticAdapter of each agent as a global variable so I can combine them at run(); would it be fine?

I suggest you do not use the criticAdapter, just use GTDLambda directly (maybe encapsulated inside a class implementing RLAgent if you would like to use Runner).

> But still, I have to find out where exactly "v" is (I mean the prediction parameter vector, such that Xv approximates the value function). The method criticAdapter.predictor() returns offPolicyTD, which seems to be a GTDLambda object. So criticAdapter.prediction() returns the vector "v", is that right? Which one should I use: "v" (predictions) or "v_t" (weights)?

Use GTDLambda::weights() to return the weight vector. Use GTDLambda::prediction() to return the last prediction computed. See the code source at:
http://people.bordeaux.inria.fr/degris/public/doxygen/html/_g_t_d_lambda_8java_source.html

Thomas

Sergio Donal

unread,

Aug 25, 2013, 11:36:28 AM8/25/13

to github...@googlegroups.com

Dear Thomas,

Thank you very much for your support.

I am trying to reimplement RLAgent to make GTDLambda accessible (I understand that this is your suggestion, is this correct?).

However, I am still a bit loss with the interfaces. For example, if I modify OffPolicyAgentDirect to reimplement OffPolicyAgentEvaluable, then I got the following warning:

Discouraged access: The type OffPolicyAgentEvaluable is not accessible due to restriction on required library /opt/eclipse/plugins/ rlpark.plugin.rltoys_1.0.0.201308081705.jar

What does it actually mean? Have I understood your suggestion?

Thanks!
Sergio

Sergio Donal

unread,

Aug 25, 2013, 4:57:40 PM8/25/13

to github...@googlegroups.com

Hello,

I found the @SuppressWarnings("restriction") command at the top of the "demo code" so I understand that I should not care about such warning.

Thanks!
Sergio

Reply all

Reply to author

Forward