Advice for RL algorithm choices for an arcade game AI

69 views
Skip to first unread message

Cayle Sharrock

unread,
Jul 23, 2015, 3:44:54 AM7/23/15
to BURLAP Discussion
Hi,

I've explored the Burlap tutorials, Javadocs and source code quite extensively and it seems to be a well-thought out and clearly written framework, which is great because I'm new to RL  (I've read Sutton & Barto and feel like I have a reasonable grasp of the core ideas) and so the clarity of the code really makes learning the concepts much easier. I feel that Burlap is the right tool to accomplish my task, but the plethora of options available has me a little bewildered as to which route to follow (my attempts so far have not yielded great results -- details below). Hopefully the community here can point me in the right direction

The task

I'm writing an AI for a contest involving an arcade-game style world. 

A quick summary of the game parameters is:
  • The AI must control a ship that can move, rotate and fire at enemies (that occasionally shoot back). 
  • There are also stationary targets that appear from time to time
  • A competitor is also in the arena, who you can also try and kill.
  • The game proceeds by turns, with both competitors carrying actions simultaneously each turn.
You score points for killing enemies, your competitor, and the stationary targets.
There are a finite number of turns (250), and the highest score after the last turns wins.
If you get killed, you respawn after a short cool-down period.

The area is discretised, so the agents move around on a grid (it's can be between 40 x 20 and 20 x 40 in size).

The game mechanics are almost purely deterministic. The only random elements are
  • Which enemy will fire at you
  • The competitor's movements (obviously)
  • The location and timing appearance of the stationary targets
My approach (to date)

So far, I've written all the game mechanics into a JointActionModel, and written a visualiser for the game so that I can see that everything works 100% when I control the game manually (using SGVisualExplorer).

At my first stab at getting some learning done (let's call this the naive approach), I implemented the full state (enemies, both players, targets and some meta-data) in an SGDomain - in total there were some half dozen Object Classes and maybe 300 "live" attributes in a given state. I also implemented the JointReward and TerminalFunctions as per the game rules. I made the opponent a RandomAgent to begin with. Sending this to an MAQLFactory with an e-Greedy policy yielded no positive results -- After about 300,000 iterations, I ran out of memory, and the AI was still doing no better than the random agent. The SGNaiveQLAgent didn't do any better,

I suspected the state space was just far too large to accomplish anything positive, or I had implemented something incorrectly (or both).

My next attempt was to use a simplified state representation, where I basically just tracked the closest enemy and any missiles headed my way. This reduced the state space size by several orders of magnitude, but still could not do any better than a random agent -- the average reward is a flat line over  a few hundred thousand iterations.

What next?

I'm a little surprised at the results. Intuitively, (and the math supports it), Q-learning should work (eventually) so either, the problem is still too large, or I've missed a turn in the implementation. Since the game is very nearly deterministic, what I'd REALLY like to have is:

  • Actually use a planning algorithm to look ahead a certain number of turns to exploit any mistakes by the competitor (e.g. hit a target before he can etc)
  • "Initialize" the Q-value map using a competitor agnostic learning algorithm by training the AI before any competitions are actually carried out
    • This also entails serialising and deserialising the Q-value map somehow -- which Burlap doesn't seem to support yet, but I'm sure I can put something together given a few pointers.
Ok, no biggy, right? :)

Following my initial efforts I have some additional questions on how to proceed:
  • Is the Stochastic Game class family the right one for this task, or can it be done using Single Agents?
    • If SG is the right way to go, and if so how does one incorporate the planning algorithms in the SG worlds? I've looked at the source code for SGtoSADomain, SALearningAgentFactoryForDG and related classes, but I'm not sure how one would implement say, LSPI in a SG domain for planning purposes; MAValueFunctionPlanner only has one subclass implementation at present, and it doesn't use any SA planning algorithms directly (as far as I can ascertain).
  • What is a "reasonable" size for the state space that Burlap will be able to handle? 
    • Is using StateAbstraction the preferred way to reduce the state space? Are there any gotchas that I need to look out for when using it (e.g. I noticed that the reward function uses the full state even when the state abstraction is non-null)?
    • Should I be thinking about using another state-space reduction techniques, such as LSPI or RBFs?
    • I have 12GB RAM for the prelearning step, but the competition server will only have about 2GB available.
Fingers crossed, I can get some guidance here and lick this thing.
If it comes right, I'll be more than happy to release the code as an additional tutorial / case study -- a stochastic game example is currently missing from the documentation, so maybe this will be able to fill the gap.

Cheers
Cayle

James MacGlashan

unread,
Jul 23, 2015, 9:02:17 AM7/23/15
to BURLAP Discussion, ca...@nimbustech.biz, ca...@nimbustech.biz
Hi Cayle,

Sounds interesting! When you frame this as a stochastic games are you have both agents be learning agents? If so, that can be challenging for standard NaiveQ learning, because the policy of the other agent is changing with time. Sometimes things work out okay, but other times it can be a problem. Other multi-agent learning algorithms like MaxQ do not suffer this problem, but explode the problem out really large by reasoning over the *joint* action space.

I think a better solution is train your agents in sets. First fix the policy of the opponent, initially to something really naive that you can hard code (for example, shoot the player or nearest target or randomly move if nothing is in range). Now train your agent against this player. After training, you can update the AI of the opponent to be the policy that you just learned and repeat the process from there (potentially initializing your player to its previous state so that it doesn't have to start from scratch).

To do this, it might be easier to code things up as a single agent domain. Although you have multiple players, you really only care about developing a good AI for one and reasoning in the full multi-agent space is often hard. Furthermore, if you want use a little bit of planning, having everything prepared in a single agent domain will make that easier. The SGToSA domain implementation itself really only affords single agent learning algorithms against other agents and not planning. You could make a more advanced conversion that enabled planning by specifying the policy of the other agents and baking that into the transition dynamics (some people here at Brown are doing things like that), but it seems like a lot more overhead then you want right now.

I also do very much encourage you to add a little bit of planning to the process. A little bit of search depth goes a long way to improve capabilities (even 1-2 steps). If you want to do that, I would look into the SparseSampling, which does a finite horizon tree search and if you give it a heuristic value for leaf nodes (perhaps a value that is learned with something like LSPI or another learning algorithm) using the setValueForLeafNodes it will work quite nicely. It might even work quite nicely if you give it a hand coded heuristic.


As far as reducing the state space complexity, in general, your state variables should be the minimum number of variables necessary to fully define the problem. After that, if you shrink it, the best ways to do that are typically through value function approximation (VFA) methods; for example using LSPI or gradient descent SARSA with a VFA approach that shrink the representation down into some set of features. There are some common methods for that in BURLAP already (RBFs, tile coding, Fourier), but for a complex game like this it would probably be best for you to define your own state features by implementing something like a FeatureDatabase. Of course if you use SparseSampling in conjunction with a learned value function as described previously, SparseSampling itself can use the full state space since you'll only look a few moves ahead, whereas the heuristic you learned might do best to use VFA.


Hope that helps!


James

Cayle Sharrock

unread,
Jul 24, 2015, 5:48:36 AM7/24/15
to BURLAP Discussion, jmacg...@gmail.com
Hi James

Thanks for the quick response. There's quite a bit of advice there and exactly the kind of guidance I was hoping for. I've started translating the game mechanics to the SA domain already and will give your suggestions a go; though I may have some follow up questions before I'm done.

To have the opponent's strategy defined in the single-agent context, am I right in saying that this gets baked into the Action subclass? 

So if I define something like OpponentStrategy:

/**
* Interface for opponent strategy
*/
public interface OpponentStrategy {
String getMove(State s);
}

Then in my Action subclass I can handle that strategy in getTransitions:

public class SingleShipAction extends Action {

public SingleShipAction(String name, Domain domain, OpponentStrategy opponent, Integer seed) {
super(name, domain, "");
this.seed = seed;
this.opponentStrategy = opponent;
}

}

@Override
public List<TransitionProbability> getTransitions(State s, String[] params) {
List<TransitionProbability> transitions = new ArrayList<>(2);
State newstate = performActionHelper(s, params);
//Update state based on opponent's move
String oppMove = opponentStrategy.getMove(s);
switch (oppMove) {
case Nothing:
case MoveLeft:
case MoveRight:
//...
}
//Generate non-deterministic states
//If enemies will shoot
return transitions;
}
Thanks and Regards
Cayle

James MacGlashan

unread,
Jul 24, 2015, 8:33:23 AM7/24/15
to Cayle Sharrock, BURLAP Discussion
Also make sure you do it in the performActionHelper (since some planning algorithms are sample-based), but otherwise yes, that looks right!

James

Cayle Sharrock

unread,
Aug 3, 2015, 3:17:21 PM8/3/15
to BURLAP Discussion, ca...@nimbustech.biz

Hi James

I've implemented a linear VFA based on a hand-rolled FeaturesDatabase that condenses the full state into some 16 features. With 7 possible actions, that gives 112 parameters for SARSA-λ to fit in its learning iterations. After a few thousand training runs, the heuristics are starting to make the "right" decisions (against a really dumb, deterministic opponent), so things are looking positive. It seems clear that crafting the heuristics and features is the real essence of building a successful AI for this sort of application. I also serialise and de-serialise the coefficients so that the agent can continue learning from where it left off, or use the VFA in a Sparse Sampling tree (as you suggested).

The learning has gone as far as it can against a passive opponent (it wins every time, but is still a far from optimal strategy). I could try and write as smart an algorithmic opponent strategy that I can and have the current implementation train to beat that, but I feel that's a 2nd prize to having the agent shift to learning by playing against itself (as you have suggested and as I had envisaged from the outset). Here I've come across a few intricacies, and I'd like to see what you think before I plough ahead.

  • By employing the same agent as its opponent, the OpponentStrategy.getMove produces an infinite recursion (because of how SAs work, I need your move to carry out mine, but you need mine to carry out yours etc.). Now I could pretend each opponent just does Nothing each turn and have some other "controller" agent return the true state after each agent has acted, but I think this introduces further wrinkles (see next).
  • In order for an agent to play itself, the field (i.e. targets, enemies etc) need to be synchronised for each agent. I can't really use the same state, since the competing agents are at opposite ends of the field and thus the objects in that field are mirror images when viewed from each agent's perspective. Again, this is resolved by employing a "controller" state/agent that handles the flipping etc for each turn. However,
By implementing such a controlling agent, I sense that I'm reimplementing the Stochastic Game classes that already exist in Burlap (in fact, while thinking about how this would go, I envisaged running each agent in its own thread so that learning can proceed wholly independently. Lo and behold, that's exactly what SingleAgentInterface does, so I really think I'm perilously close to reinventing the wheel here)

So I've come around again to thinking that it's best to go back into the SGDomain space, but avoiding the mistakes I made in the past (due to an incomplete understanding of the intent of some of the BURLAP classes). The SADomain work will not be in vain, since I'd like to use what I already have by using SingleAgentInterface (or better yet a modified version that allows for planning). So that said, do you have any further thoughts or pointers? Recall that
  • The agents are 100% competitive (So a JointRewardFunction is just a concatenation of individual RewardFunction values)
  • I'm not too worried about exploring the Joint Action Space right now. The agents can learn (and plan) independently (but the state representation needs to be maintained and synchronised for each agent)

Thanks and regards
Cayle

James MacGlashan

unread,
Sep 29, 2015, 12:00:03 PM9/29/15
to BURLAP Discussion, ca...@nimbustech.biz
Hi Cayle,

Sorry this took so long to get back to you. I was traveling a lot during the time you sent this and it fell off my radar. You've probably moved on from this but I'll provide some feedback all the same in case its useful. If you want to have two agents learning simultaneously, yes, the stochastic games branch lets you do that and you can use the single agent interface to allow standard RL algorithms to run on both ends. However, one possible limitation of the single agent interface is that you can't use any single agent planning algorithms since planning requires having an expectation of what the other agent will do. If you're comfortable just using RL algorithms you can certainly use it though.

James
Reply all
Reply to author
Forward
0 new messages