Hi,
I've explored the Burlap tutorials, Javadocs and source code quite extensively and it seems to be a well-thought out and clearly written framework, which is great because I'm new to RL (I've read Sutton & Barto and feel like I have a reasonable grasp of the core ideas) and so the clarity of the code really makes learning the concepts much easier. I feel that Burlap is the right tool to accomplish my task, but the plethora of options available has me a little bewildered as to which route to follow (my attempts so far have not yielded great results -- details below). Hopefully the community here can point me in the right direction
The task
I'm writing an AI for a contest involving an arcade-game style world.
A quick summary of the game parameters is:
- The AI must control a ship that can move, rotate and fire at enemies (that occasionally shoot back).
- There are also stationary targets that appear from time to time
- A competitor is also in the arena, who you can also try and kill.
- The game proceeds by turns, with both competitors carrying actions simultaneously each turn.
You score points for killing enemies, your competitor, and the stationary targets.
There are a finite number of turns (250), and the highest score after the last turns wins.
If you get killed, you respawn after a short cool-down period.
The area is discretised, so the agents move around on a grid (it's can be between 40 x 20 and 20 x 40 in size).
The game mechanics are almost purely deterministic. The only random elements are
- Which enemy will fire at you
- The competitor's movements (obviously)
- The location and timing appearance of the stationary targets
My approach (to date)
So far, I've written all the game mechanics into a JointActionModel, and written a visualiser for the game so that I can see that everything works 100% when I control the game manually (using SGVisualExplorer).
At my first stab at getting some learning done (let's call this the naive approach), I implemented the full state (enemies, both players, targets and some meta-data) in an SGDomain - in total there were some half dozen Object Classes and maybe 300 "live" attributes in a given state. I also implemented the JointReward and TerminalFunctions as per the game rules. I made the opponent a RandomAgent to begin with. Sending this to an MAQLFactory with an e-Greedy policy yielded no positive results -- After about 300,000 iterations, I ran out of memory, and the AI was still doing no better than the random agent. The SGNaiveQLAgent didn't do any better,
I suspected the state space was just far too large to accomplish anything positive, or I had implemented something incorrectly (or both).
My next attempt was to use a simplified state representation, where I basically just tracked the closest enemy and any missiles headed my way. This reduced the state space size by several orders of magnitude, but still could not do any better than a random agent -- the average reward is a flat line over a few hundred thousand iterations.
What next?
I'm a little surprised at the results. Intuitively, (and the math supports it), Q-learning should work (eventually) so either, the problem is still too large, or I've missed a turn in the implementation. Since the game is very nearly deterministic, what I'd REALLY like to have is:
- Actually use a planning algorithm to look ahead a certain number of turns to exploit any mistakes by the competitor (e.g. hit a target before he can etc)
- "Initialize" the Q-value map using a competitor agnostic learning algorithm by training the AI before any competitions are actually carried out
- This also entails serialising and deserialising the Q-value map somehow -- which Burlap doesn't seem to support yet, but I'm sure I can put something together given a few pointers.
Ok, no biggy, right? :)
Following my initial efforts I have some additional questions on how to proceed:
- Is the Stochastic Game class family the right one for this task, or can it be done using Single Agents?
- If SG is the right way to go, and if so how does one incorporate the planning algorithms in the SG worlds? I've looked at the source code for SGtoSADomain, SALearningAgentFactoryForDG and related classes, but I'm not sure how one would implement say, LSPI in a SG domain for planning purposes; MAValueFunctionPlanner only has one subclass implementation at present, and it doesn't use any SA planning algorithms directly (as far as I can ascertain).
- What is a "reasonable" size for the state space that Burlap will be able to handle?
- Is using StateAbstraction the preferred way to reduce the state space? Are there any gotchas that I need to look out for when using it (e.g. I noticed that the reward function uses the full state even when the state abstraction is non-null)?
- Should I be thinking about using another state-space reduction techniques, such as LSPI or RBFs?
- I have 12GB RAM for the prelearning step, but the competition server will only have about 2GB available.
Fingers crossed, I can get some guidance here and lick this thing.
If it comes right, I'll be more than happy to release the code as an additional tutorial / case study -- a stochastic game example is currently missing from the documentation, so maybe this will be able to fill the gap.
Cheers
Cayle