I'm excited! After many months working on my POLA project, and feeling that I'm stuck and going in circles, I finally feel like I'm making great progress.
What changed things was that I decided to finally dive into a part of the algorithm that I've been treating as a black box: The objective function. For the last 6 months that I've been playing with POLA, I was afraid of it. It uses something called "stochastic nodes" which sounds mysterious and complex. Here's the code if you'd like to take a gander.
I was afraid because I knew that in order to understand it, I'd have to read the paper about Loaded DiCE, which is a technique used in that code. But before I could read that paper I'd need to read the paper about the original DiCE estimator that Loaded DiCE is based on. But before I could read that, I'd need to understand stochastic computation graphs, which is the framework in which DiCE is defined. So I'd need to read the paper about stochastic computation graphs, but first I'd need to get back to Sutton and Barto and redo some of the chapters that my memory was fuzzy about.
I've been dreading to do that for a while... But at the beginning of October I decided to dive in and do that. I had to make multiple passes on each paper, and I understood maybe 50% of what I read, but the result was dramatic. I was able to go into the Loaded DiCE code I was using and start playing with it. I found that I was making lots of progress in adapting the code to my needs.
The two major products of this advance are:
I developed a new opponent shaping algorithm that I call Viola.
I'm working on an interpretability method that will help me understand why my agents are learning the behavior that they are learning.
It's quite possible that Viola is going to be the focus of my next paper. It's an opponent shaping algorithm which means its basic principle of operation is the same as POLA's, but it's a little different. I won't reveal exactly what's different because I want to leave that for the paper, but I am getting the agents to be more reciprocal with less training.
I've run my algorithm on 3 agents playing IPD against each other. This means that in each turn, each agent is playing prisoner's dilemma separately against each of the two other agents. Here are the rewards for the 3 agents:
Right off the bat I love this plot. One of my challenges throughout this research has been to define what social behavior is. This challenge is ongoing, because I'm able to define specific social behaviors, like dominance hierarchies, but I always have to define more social behaviors that I only have a vague intuition for. I don't have a definition yet, but my intuition is telling me that the way these agents move is probably what I'm looking for. It has a balance of stability and instability. For example, look at the time shortly after epoch 150. The rewards plateau, meaning that the agents found a position of stability, which they maintain, until they all together break from that position. One image it evokes is of cats fighting. They can stare menacingly at each other for minutes without moving, and then suddenly break off and fight while moving very fast, until they settle into a new stable position.
Let's look at the cooperation rates for this run. We'll only look at the cooperation rates between agents 0 (green) and 1 (red):
It's interesting that on one hand, the agents do seem to be responsive to each other most of the time, e.g. between epochs 80 and 200. This is highly reciprocal behavior which makes sense to us. However, sometimes they don't. Especially between epochs 30 and 65, agent 0 is cooperating with agent 1, even though agent 1 is defecting back. Why does agent 0 behave this way?
This leads me to my second topic: my new interpretability method. I need to be able to answer such questions as I posed above. I need to be able to take a magnifying glass to any interaction between agents and be able to explain, at least roughly, why each agent learned the particular behavior that it did.
After some trial-and-error, I came up with a method that does that. I won't reveal the working mechanism because I might be writing a paper about that method, but what it does is look at an agent's learning process and find, for each epoch, in what states the agent's behavior has changed in the most lucrative way, and which action caused the gain of the most reward.
I tested out this method on the short corridor environment from Sutton and Barto, in which there are 3 easy environment states and one confusing environment state. The algorithm was able to find the confusing environment state and use it to explain the agent's learning, even when I increased the number of easy states from 3 to 10.
The next challenge for me is to make this technique usable on Viola. This is challenging because now we need an answer not only in terms of the focal agent's actions, but also the opponents. This is my next task.
See you next month, Ram.