Research update: 12 Angry Agents, Corridor environment

9 views

Skip to first unread message

Ram Rachum

unread,

Oct 20, 2025, 6:05:04 AMOct 20

to ram-rachum-res...@googlegroups.com

Hi everyone!

Early update this month.

In the last monthly update, I said that David and I are working on our next two posts in the conservative alignment sequence. These posts are now finished and published on LessWrong!

Part 2: Messy on Purpose
Part 3: 12 Angry Agents, or: A Plan for AI Empathy

I'm especially proud of the third one. It's an attempt to design an AI with empathy, using the movie 12 Angry Men as a case study. I hope you read it.

Breakdown

I've been working on my explainability method. I'm thinking that for the paper, I should demonstrate it in two environments: One environment that's as simple as possible so it'll be easy to understand how the method works, and one environment that's more like an interesting real-world problem that people will care about. Let's talk about the simple environment today, and hopefully talk about the interesting environment next time.

The simple environment: Corridor

For the simple environment, I chose a variant of the Corridor environment from the classic RL textbook, "Reinforcement Learning: An Introduction" by Sutton and Barto. Here is how that environment is introduced in the book:

What's interesting in this environment is that the agent should generally learn to go to the right, except because of the reversed state it should sometimes go to the left. The agent is blind so it doesn't know in which state its in. The best it can do is walk randomly with some optimal distribution and hope that it's choosing right on the normal states and choosing left on the reversed state. This is a good case study for our exploration of breakdown and steerage, because that one reversed state is the only state steering the agent towards choosing left.

I changed the environment slightly. I stretched it from a short corridor to a long corridor, i.e. I added 4 simple states to make for a total of 8 states:

The agent learns the optimal behavior very quickly:

At around epoch 18, the agent converges to a probability of around 83% of going right.

I then ran a breakdown analysis, checking which states steer the agent towards choosing right:

There are 8 plot lines; each line represents the effect of each of the 8 states on the agent's behavior. The X axis is the learning epochs, meaning that the left side of the plot shows the start of the training, and the right side shows the end. The Y axis is the steerage of the "right" action; the higher a point is, the more that state is causing the agent to learn to choose the "right" action.

As expected, the reversed state, which is represented by the red line, has consistently negative steerage. What's interesting is that even though the agent converges to its final behavior at epoch 18, the steerages remain non-zero beyond that epoch, and probably forever. The steerage analysis shows us how the agent's static behavior is actually a balance between multiple forces steering it in opposite ways. The reversed state steering the agent towards choosing "left" is precisely balanced with the other states steering the agent towards choosing "right".

Trick question

When I initially analyzed the environment, I made an interesting mistake. I wrote code that answers the question, "Out of all the timesteps that the agent experienced, which steered it the most towards choosing the left action?" I expected the answer to be the timesteps in which the agent is in the reversed state, but that actually wasn't the case. Feel free to email me your guess to what the real answer was, and why.