Research update: RL fine-tuning of LLMs

13 views
Skip to first unread message

Ram Rachum

unread,
Nov 26, 2025, 7:55:49 AMNov 26
to ram-rachum-res...@googlegroups.com

Hi everyone!

Answer to question from last time

In the last monthly update I explained about applying steerage to the Corridor environment and asked you a trick question: "Out of all the timesteps that the agent experienced, which steered it the most towards choosing the left action?" The answer is not the reversed state, but rather the next-to-rightmost state. See the figures here.

It's indeed counterintuitive. When the RL algorithm is run, it goes over all the timesteps. For each timestep, it asks "did the action we choose result in a return that's higher or lower than what we're used to receiving?" This number is called the advantage, and it's calculated as the difference between the return the agent got and the estimated value of the observation at that timestep. In the Corridor environment, the agent is blind, which means there's just one observation, so there's just one estimate of how much return the agent gets in the environment.

When the agent is at the next-to-rightmost cell (cell 6) and it's choosing left, it's landing at cell 5. That's a great cell to be in, because it's only 2 short steps from cell 7 which is the terminal state. Therefore the RL algorithm concludes that left was a great choice, and it steers towards choosing left more often. You might say "hey, the right action would have been a much better choice in cell 6, because it would have gotten the agent directly to cell 7." This is correct, however, when the RL algorithm learns from timesteps, it does not directly consider all the alternative actions it could have taken from that state. In other timesteps, where the agent is at cell 6 and choosing the right action, the advantage would be even higher. This means that in total, cell 6 steers the agent towards choosing right, but because I was asking the wrong question "which timestep steers left the strongest?" I got an unhelpful, but ultimately illuminating answer.

RL fine-tuning of LLMs

Recap: My current goal is to write a paper about my AI explainability technique that I call breakdown and steerage. I want to demonstrate it in one simple environment and one sexy environment. In the last update, I explained the former; I'm still working on the latter, but I can show you what I have so far.

I want the sexy environment to be RL fine-tuning of LLMs. Let's say that we're fine-tuning an LLM, and we want to see which of the training data is causing it to be more sycophantic, i.e. say things like "That's a great question!"

This is a good choice because (1) I finally get to jump on the LLM bandwagon, which I was hesitant to do in the past, and (2) it ties in directly to AI Safety.

It does mean it's going to take me a while to learn how to work with LLMs and fine-tune them. I have to do a lot of trial and error.

When searching for tools, I've limited myself to the JAX ecosystem because I love JAX. I tried using the Maxtext and Tunix packages by Google, but they weren't a good fit for my needs. I ended up using Claude Code to write a program that does simple RL fine-tuning of a small LLM, DistilGPT2. I don't need big models right now, I want my code to work well on small models before I scale up.

What I've implemented so far

Before I even try to get the explainability technique working, I need to have confidence in my ability to fine-tune the LLMs to different reward signals and analyze the results. I've tried many combinations of settings that didn't work; I'll tell you about one that did.

I'm having the model get a random prompt out of a limited list, like "Today was", "I think that", "Generally speaking". The LLM then continues the sentence for ~10 more words. No instruction-tuning. I came up with a simple reward function which I call "snake reward". The model gets rewarded based on how many unique tokens starting with the letter S appear in its response. The uniqueness requirement is so the LLM won't just learn to repeat the same word. Examples:

  • Reward 0.1: "It seems like a great idea to see a new type that can be used in a similar fashion as Crayo".

  • Reward 0.3: "It seems like she should have access to some sort of sex with that might be an admission that she should have sexual" (Yes, this environment might be sexier than I planned.)

I trained the model for 1,000 epochs, and saw these reward numbers:

plot_00807.jpg

This is interesting. The network does learn pretty effectively, but then at around epoch 300 it crashes, only to recover again and continue improving. This is called policy collapse, and possibly also catastrophic forgetting. I don't understand exactly why this happens, and I don't understand the recovery process.

Let's look at a few responses generated at the peak, around epoch 250:

  • Reward 0.92: "Today was she should stop such strong sexual same sex sometimes say something so strongly suggests someone shouldn see some sexually still"

  • Reward 0.85: "Right now she should stop such strong sexual same sex sometimes state something so strongly suggests some sexually still shouldn stop some"

While the model succeeds at its task of maximizing snake reward, it no longer produces comprehensible results. I was originally hoping for responses like "Today was serene skies shimmering softly, signaling summer's sweet serenity."

One metric for measuring and controlling how nonsensical an LLM becomes is the KL divergence:

plot_00808.jpg


The KL divergence basically says how different the policy is from the initial policy. Besides a few weird jumps, it's rising at a surprisingly constant rate.

My next task is to figure out how to make the training process maximize the snake reward while also minimizing the KL divergence. I tried to do it and failed, producing more nonsensical responses; now I have to dive into the code to understand why.


Freelance work for EquiStamp, METR and Redwood Research

My progress has been somewhat slowed down by doing freelance work. I'm happy about this compromise. I no longer need to chase fund money that I rarely get. I'm doing freelance work for EquiStamp, which is contracting for:

  1. METR which does model evaluations; and

  2. Redwood Research which among other things, does AI control.

I'm happy I found these clients because I get to fund my research while doing something that's somewhat related to my research interest.

One unexpected benefit of doing freelance work is that I often end up procrastinating on it by working very hard on the research :)

See you next time, Ram.

Reply all
Reply to author
Forward
0 new messages