Research update: Pivoting to explainability

4 views
Skip to first unread message

Ram Rachum

unread,
Aug 26, 2025, 2:19:13 PMAug 26
to ram-rachum-res...@googlegroups.com

Hi everyone,

I have a momentous update: I've been feeling frustrated about lack of progress in my research. I just can't get the agents to form groups with each other. I've been trying lots of different things, but it's just not working. I've been feeling for a while that I need to pivot away from it somehow, but I was very resistant to do it because I wanted to try more.

Finally I'm ready to admit I need to step back from this problem. It's difficult but it's probably the right thing.

I'm lucky that I found a good opportunity to pivot to. When I was working on getting the agents to do my bidding, I came up with a few explainability techniques for RL. These are methods for understanding why an RL agent is learning the particular behavior that it learns. I've been using them to try to get the agents to be more social, but they could be useful in any kind of RL setting. I'm now working on spinning them off into their own project. This is still within the realm of AI Safety.

Cam and I asked a bunch of researchers in three different labs about these techniques, and they all say they're interesting, novel and worthy of a paper. We also asked two high-caliber RL researchers and they confirmed that they haven't heard of anything like that. I really hope that this new project will be successful. And I hope that when I get back to social behavior of RL agents, I'll have fresh eyes to look at the problem with.

Next month I'll update with more details about this project. The next steps are to plan the paper, strategize about which environments to demonstrate it on, and port the code to that environment. I hope that I could have a paper in time for the IASEAI 2026 deadline on October 8th.


A Conservative Vision For AI Alignment

Here's a little project I've been working on on the side and haven't mentioned here. David Manheim and I wrote a LessWrong post titled A Conservative Vision For AI Alignment. Here's a Claude summary of it:

The authors argue that current approaches to AI alignment are rooted in liberal philosophy, and therefore aim to eliminate conflict and pain through optimization, potentially eroding the boundaries and tensions that give human life meaning. Drawing parallels between AI alignment and parent-child relationships, the post suggests that AGI should help communities navigate and channel disagreement rather than erase it, preserving the institutional structures and value conflicts that define human society. This conservative vision proposes that alignment isn't about convergence toward frictionless utopias, but about creating AI systems that respect and sustain the meaningful constraints and productive tensions inherent in human values.

I hope you'll find it interesting.

Flying back

I'll be flying back to Israel in 10 days. I'm a little sad that I haven't accomplished here as much as I wanted. I had a good time hanging out with people at CHAI. I wish I had that kind of social environment back home. I hope to get accepted into similar programs like MATS and Constellation so I'll have that kind of experience again.


See you next month, Ram.

Bonus meme

Reply all
Reply to author
Forward
0 new messages