TDLambda code

Ryan Bak

unread,

Nov 11, 2013, 11:02:59 AM11/11/13

to github...@googlegroups.com

Hi,

I am a student trying to use your TDLambda code for a game learning project. I was wondering if I could get some clarification on what some of the variables were. In particular I was wondering about nbFeatures and prototype in the constructor, RealVector x_t and x_tp1 in update(), and the doubles returned from initEpisode(), update(), predict() and prediction(). If I could get some more information on what these values are or how they correspond to the TDLambda algorithm, I would appreciate it. Also, I apologize if I asked anything that should be obvious, I am still working on my understanding of TDLambda.

Thanks,

Ryan

Thomas Degris

unread,

Nov 13, 2013, 4:57:11 PM11/13/13

to github...@googlegroups.com

Hello Ryan,

the goal of TDLambda is to predict the cumulative sum of rewards with respect to the agent state. In RLPark, the agent state is represented as a vector (the RealVector class). x_t and x_tp1 are the agent state respectively at time t and t+1. nbFeatures is the number of dimension in x_t and x_tp1. initEpisode() and update() return the latest TD error. predict() computes a prediction for a given state. prediction() returns the last prediction computed with the update function.

The prototype in the constructor is to specify the kind of eligibility traces you would like. More information about eligibility traces and TD lambda in general in the book "Introduction to Reinforcement Learning":

http://webdocs.cs.ualberta.ca/~sutton/book

Thomas

--

---
You received this message because you are subscribed to the Google Groups "RLPark" group.
To unsubscribe from this group and stop receiving emails from it, send an email to githubrlpark...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ryan Bak

unread,

Nov 24, 2013, 12:04:27 PM11/24/13

to github...@googlegroups.com

Thomas,

Thanks for your help. Your information was very helpful. I do have a couple of follow up questions now that I have my program semi working. First, I'm having a problem with TDLambda where the values in the v vector grow out of control. I have a simple example in which the computer navigates an 8x8 grid and attempts to learn the fastest route to a specific position. Repeating this experiment with the TD code or TDLambda where lambda=0 shows that the computer very quickly learns the quickest path to the goal. However, when I increase lambda, the computer wanders around without noticeably learning, and the values in the v vector increase to infinity with a speed partially determined by the value of lambda. Do you have any idea what could cause this behavior? I am not using an eligibility trace and am leaving that argument out of the constructor, although my understanding is that these are improvements to the TDLambda algorithm but not necessary.

Second, clarification on the nbFeatures argument, my end goal is to use TDLambda to play checkers. The state of my system can therefore be described as an array of values of length 64. This is what I was using for x_t and x_tp1. However, after watching the v vector update (even with lambda=0), I'm wondering if I misunderstood. Should nbFeatures be the number of possible states of the board? Currently with nbFeatures=64 and lambda=0 the values in the v vector still grows to infinity over time, which appears to be due to the fact that even when the board state changes, most pieces don't move which increases the value of their corresponding position in v.

Lastly, I have been storing v as a bin file after every iteration of the algorithm through a game with the idea that I could shutdown the program, and start it up again where the last run left of learning. However, I don't see anyway to set v since it is protected. Do I have the right idea that this would allow me to pick up learning where an early run left off, and if so is there anyway to do this without altering your code?