Research update: POLA experiment analysis

8 views
Skip to first unread message

Ram Rachum

unread,
Sep 30, 2024, 12:06:04 PM9/30/24
to ram-rachum-res...@googlegroups.com

Hi everyone!

Retrospective on goals for last month

In last month's update I outlined a list of goals. Here's my update on these goals:

  1. Find freelance work to fund my research: 😢 Not yet

    I talked with around 30 people I know about getting freelance work, but couldn't find any relevant work for me. I do have a meeting scheduled with a potential client, so maybe that'll work out. If you know of anyone looking for Python freelance work, let me know.

  2. Make some progress on POLA: Made progress, but still have lots of uncertainty

In the last few months, I've been noticing that the "goals" format I've been using for this update isn't working so great. That's mostly because my work on POLA is open-ended, so it's been hard to define it as a goal. I have mixed feelings here. On one hand I'm disappointed that I've been working on it for so long and hadn't gotten publishable results yet, and I'm concerned that I may be wasting invaluable time and money on a wild goose chase; on the other hand, I do feel, at least part of the time, that I'm learning things and doing meaningful work. I'll share some of that in the update. For now I'll keep working on POLA, but it's possible that at some point I'll have to pivot, if I won't get meaningful results.

POLA experiments

Here are some nice results from a simplified version of my POLA experiments. I temporarily replaced the neural networks that the agents use with something like a gradient bandit algorithm (Section 2.8 of Sutton and Barto). This is part of my general efforts to make simpler versions of the experiment, so I could debug them more easily. While this version doesn't have a neural network, it still performs gradient ascent and it still has an actor-critic architecture.

Here are the cooperation rates of two POLA agents playing iterated prisoner's dilemma against each other:

2024-09-29 Stable extortion.png

"Cooperation rate", which I shorten to "corate", means how often an agent chooses the Cooperate action rather than the Defect action. The green line is agent 0's corate and the red line is agent 1's corate.

The nice thing about this run is that the phases of learning (a.k.a. autocurriculum) can be visually discerned:

cut.jpg

Here's the breakdown of the 4 phases:

  • Phase 0: The agents start the experiment knowing nothing. They discover two things: Defecting is an easy way to get lots of reward, and behaving reciprocally causes the opponent to cooperate more often, which also gains them reward. At this phase the former is more lucrative than the latter.

  • Phase 1: The tide turns as the agents' reciprocity levels are high enough that defecting is no longer lucrative. The agents' corates skyrocket.

  • Phase 2: The agents learn to extort each other. Agent 1 is the extorter and agent 0 is the extortee. This means that agent 0 consistently cooperates more than agent 1, because it knows that agent 1 will punish it if its corate falls under a certain level.

  • Phase 3: With neural networks, the extortion was not stable, but with this simpler algorithm, it is. The agents seem to converge to corates of 0.74 and 0.51 respectively.

What's interesting about extortion is that it's a behavior that combines reciprocity with violence. Different runs result in different agents being on top. Yes, this is similar to the dominance hierarchies experiment, and we could probably get multiple agents to form a dominance hierarchy based on ZD extortion. However, this time my sights are set on team formation.

I bought a water-cooled workstation

I've been running POLA experiments for about 6 months now, and it takes JAX between 10 minutes and 10 hours to compile each experiment, depending on the complexity. This compilation time is a big annoyance, because I want to iterate and run experiments one after another, and having to do these pauses every time makes me lose concentration.

So I bought a beefy workstation to run these experiments:

2024-09-17 13.09.33 New Melfi.jpg

It has an i7-14700K processor and 64 GB of RAM. I kept costs low by not buying a GPU, since most of the slowness I have to deal with is the compilation rather than the actual running of the experiments, and the compilation happens in the CPU. However, I bought a strong power supply which will allow me to add a GPU in the future if I need one.

See the black hoses coming into and out of the processor area? That's water-cooling hoses, going to the fans at the top. I've never owned a water-cooled PC before, so I'm excited to have one.

This computer compiles my experiments 2-3x times faster than my other computers, which makes it easier for me to work. When I run tests, I can run them all in parallel, which takes only one minute when the compilations are cached.

Dominance hierarchy talks

I gave talks about my dominance hierarchies paper at the Tufts CS colloquium (remotely) and at the IAAI 2024 conference in Israel. They were well-received, especially at IAAI. I got lots of enthusiastic questions and comments.

My goals for this month

Let's pause the goals for now. I'll just be working on POLA.

That's it for now. See you next month!

Ram.

Reply all
Reply to author
Forward
0 new messages