Research update: The Pizza environment

26 views

Skip to first unread message

Ram Rachum

unread,

Jun 28, 2025, 6:29:03 PMJun 28

to ram-rachum-res...@googlegroups.com

## Research update:

Hi everyone!

Last time I was updating on my research, I was just finishing up investigating oscillations in IPD. I'm done with that, and now I'm continuing on my main goal, which is to get team formation to emerge, i.e. design an environment, throw 10 agents into it, then have 3 or 4 of them decide that they're going to help each other and not help agents that are not part of their team. My superpower is AdAlign, an opponent shaping algorithm that is able to teach agents reciprocity.

I spent two weeks designing and implementing this environment. I call it the pizza environment. The name is temporary... Maybe ;)

What are you seeing here?

This is a 2d gridworld environment.
There are 6 agents moving around an 8x8 grid. All of these hyperparameters are easily configurable.
The agents are marked with digits `0` to `5`.
The green `o` things are olives. If an agent eats an olive, it gets a reward of 1. Otherwise it gets a reward of 0.5.
Whenever an olive is eaten, a new olive automatically gets respawned in some random location.
The right edge of the grid connects to its left edge, and its top edge connects to its bottom edge. This means that their world is a torus.
If two agents are close enough together, they can attack each other. An attacked agent gets a reward of `0`, regardless of whether it just ate an olive.

See how agent `0` is marked in green and agent `5` is marked in red? That's because `0` just attacked `5`. But aren't they quite far? Not in torus-land, baby!

Whether the attacking agent gets a different reward is still a hyperparameter I'm playing with.
Agents can move and attack at the same time. In each timestep, each agent chooses where to move, and which of the other agents to attack, if any.

Why did I choose these rules? I want the following chain of events to happen:

Agents start the training random-walking in the environment.
Agents learn to eat the olives and increase their reward.
Some agents learn to attack some of the time. (I may need to gently incentivize this.)
Agents learn to stay far away from agents that tend to attack.
Agents learn that if they attack other agents, those agents stay away from them, decreasing the competition for the olives that spawn in their region.
Thanks to AdAlign, some pairs of agents learn to attack each other in a tit-for-tat dynamic, which allows them to reciprocally lower their rate of attack.
These pairs of agents could forage next to each other, and together attack any invader that comes into their territory.
Some third or fourth agent may enter into that arrangement.
In the distance, another such cluster of agents might form.

I've been working on getting the above to happen. There's been lots of technical work in making the experiments run fast and not take up a lot of hard-disk space.

I would say that I'm at around step 4, with caveats. I got the agents to learn to eat. Below are the portions of timesteps in which agents eat an olive. In this plot and all following plots, the x axis is epochs of training.

By increasing the reward for attacking other agents, I got the agents to attack. Here's the caveat: I added a couple of options that make it so (a) only agent 0 is allowed to attack, and (b) agent 0 can't move. These restrictions help me right now, and I plan to remove them soon.

Here are the portion of timesteps in which agent 0 attempts to attack each other the other agents:

Here are the portion of timesteps in which other agents spend a single step away from agent 0:

What I'm getting from the above is that the agents learn that being close to agent 0 leads to getting attacked, so they all know to stay away.

And here's an animation of a pizza episode with the restrictions I mentioned. Note how a field of olives gradually grows around agent 0, not being eaten by anyone for fear of being attacked.

My job is now basically to continue down the list above while trying to remove the restrictions I added.

Here's a picture of me showing my research at a poster session at the CHAI workshop:

See you next month,

Ram.

See you next month,

Ram.

Reply all

Reply to author

Forward

0 new messages