Research update: Behavior-Explainable Reinforcement Learning

6 views

Skip to first unread message

Ram Rachum

unread,

Feb 1, 2026, 12:14:57 PMFeb 1

to ram-rachum-rese...@googlegroups.com

Oh my God, I have been through so much in the last month that I barely remember what was the last thing I sent you.

I'm now sprinting towards preparing my paper for RLC 2026. I started the sprint on January 18th, and the deadline is March 5th. This means that 2 weeks have gone by out of 7 weeks. I have 5 weeks left. I paused most of my freelance work for EquiStamp, METR and Redwood Research. I've worked so hard and done so much over these past two weeks that it's all become a blur.

Here are the main things I've accomplished in these two weeks:

I started talking with people from the XRL (Explainable Reinforcement Learning) community to advise me on the project.
I completely reframed the project to be 50% about the problem and 50% about the solution. (More details about that below.)
I changed my plans for the RL environments I intend to demonstrate my technique on.
I wrote a title, abstract, and the first 5 pages of the paper (out of 12 pages.)

Reframing the problem

I originally developed the Experience Breakdown method because my agents weren't behaving reciprocally in a multi-agent setup and I wanted to know why, e.g. if the problem is that the environment is not incentivizing them to be nice to others, or any other factor. I developed EB with total ignorance of existing XRL methods, and it worked. When I started planning to spin it into its own paper, I thought that the main thing I'd need to do is demonstrate it on an impressive environment like RLHF of LLMs.

I was also worried that maybe someone already came up with this technique and I won't get any credit for this work. When I started reading XRL surveys and comparing EB to existing techniques, I realized how wrong I was; not only did no one develop this technique, no one even expressed the problem in the way that I did. This can be both a good thing and a bad thing for my paper.

Let me explain: The goal of XRL is to explain the behavior of an agent trained with an RL algorithm, i.e. explain why the model does what it does. So far, this aligns with my vision. However, they define "explain the behavior" a little differently than me. Before I'll add more detail, here's the bottom line: No XRL method is able to explain a pattern of behavior exhibited by a model. Not only that, XRL doesn't even provide a definition for "behavior" that the user might want explained!

Here are the 6 things that XRL algorithms can explain, or at least attempt to explain:

The model's weights and why they have the values that they do.
The model's objectives.
Why the model took a specific action in a specific episode.
Why the model took an entire sequence of actions in a specific episode.
The model's entire policy and its decision-making process.
The change in the model's entire policy from one epoch to the next.

Let's call these "explanation targets". You can think of these as six different problem formulations, and each XRL method picks one or more of these six as its target for explaining. However, these six options don't cover what I think is a very important use case: You have a model and it's doing something that doesn't make sense to you. For example, you have a robot and when it's supposed to walk straight, it's limping. You want to know why it's limping.

Right off the bat we can rule out explanation targets 1, 2 and 6 because the weights, objectives and policy change are not what's perplexing to us right now.

Let's consider explanation target 3: "Why a model took a specific action in a specific episode." While we could point the algorithm at a particular action in which the robot moved towards limping, that would be ineffective, for three reasons:

Limping involves many physical actions over a period of time, so any specific action might have a good reason, but they are a problem as a whole and that's what we want explain;
The robot is limping in many different scenarios and conditions. If we ask for it to be explained in one specific instance, the explanation might not be representative for all instances of limping, and
When we point at a particular action related to limping, we are not making it clear what it is about the action that surprises us.

The last point is deep. This is something that we take for granted when we communicate as people, but we can't assume that machines can implicitly understand the same things people do. If Person A punches Person B and we ask Person A "why did you do that?" it's clear to both ourselves and Person A what we're asking. Person A could answer "because I didn't have enough maneuvering space to kick them." While that would be a badass answer, it's clearly evading the real focus of the question, which is "why did you assault Person B?" and not the particular technique.

When we explain people's behavior, we take for granted that we have the ability to refer to their behavior in a way that's not tied down to any particular action that they did. Neither of the six XRL explanation targets provides this kind of explanation; and because no XRL paper even provides a definition for "behavior", it's not even considered a problem formulation that's part of the field. That's why I rebranded the paper as BXRL: Behavior-Explainable Reinforcement Learning.

I've got a lot more to say about this, but I need to get back to work. Now I'm focusing on porting HighwayEnv to JAX so I could demonstrate EB on it.