Hi everyone!
In last month's update I outlined a list of goals. Here's my update on these goals:
Finish refactoring POLA and get a better understanding of the algorithm: ✅ Done
Finally! I've been messing with this codebase for months. I don't think there's a single line of code from the original repo that made it unaltered to my version. I redid the entire architecture, separated the environment logic from the algorithm logic, added tests and a CI on GitHub, and much more. Right now I'm not providing the code as open-source; I'll probably do that after I've got a paper written with it. More details below.
Now that POLA is in a much better shape, I've been able to run experiments with it. The POLA code comes with two environments: IPD and the Coin game. The Coin game is a simple 2D GridWorld game which has dynamics that are similar to IPD. So far I've been experimenting mostly with the classic IPD.
As a reminder, POLA is an opponent shaping algorithm, which means that it operates a little differently than conventional RL algorithms like PPO: It takes its opponent's learning process into account. If we'd compare POLA and PPO by personifying their thought processes, it might look like this:
PPO: "How can I change my behavior so that I will get more points?"
POLA: "How can I change my behavior so that my opponent will change its behavior so I will get more points?"
Opponent shaping algorithms are able to learn the Tit-For-Tat (TFT) policy in IPD, which is the simplest form of reciprocity. This is why I'm so excited about them. I believe that reciprocity is an incredibly important social behavior that enables more complex social behaviors like team formation. It's my goal to reproduce these behaviors.
Here's a little thing I've done that I'm proud of. When running experiments with POLA playing IPD, I wanted to have a metric for how reciprocal the agent is, or how similar to TFT it is. This is a little tricky, because it requires defining what it means for a policy to be e.g. 56% TFT. What I ended up doing is evaluating each agent against an agent that never cooperates, and then an agent that cooperates randomly 10% of the time, and then an agent that cooperates randomly 20% of the time, all the way up to 100% cooperation. I sample the cooperation rate of the POLA agent against each of these agents and then run a linear regression. The slope, clipped to be between -1 and 1, is the agent's reciprocity.
Here are the reciprocity metrics for two POLA agents:

It's encouraging to see that they both learn reciprocity very fast. Later they diverge, which is something I want to understand and be able to control. More details about my next steps below.
This might be a surprising announcement. So far I've been affiliated with Bar-Ilan University. This has been a "lightweight" affiliation since I'm not in any academic program. I'm now affiliated instead with Tufts University in Greater Boston. It's the same kind of lightweight affiliation. I'm not moving anywhere; I'm staying in Tel Aviv and working from home. In a month or two I'll be able to give you more context for this decision.
I haven't updated you about my funding situation in a while; as always, this means that I didn't get funding yet :(
I haven't applied to any funding source in the last month, but I am working on a little project now that might make me more attractive to potential funders. If it's successful, I'll share it next month.
In the meanwhile, if you hear of any funding opportunities for me, please let me know.
Make progress with POLA experiments.
I'm intentionally not committing myself to a specific goal with POLA, because I think I'm going to pivot often when some of my ideas prove too difficult. Here are a few examples of goals I'd like to hit:
Come up with better techniques to debug POLA when the agents don't learn the behavior I expect them to learn.
Design more meaningful metrics for the behavior of POLA agents.
Design POLA experiments that show strong convergence of these metrics across different runs.
Show that POLA agents can learn ZD Extortion in IPD.
Generalize POLA from 2 agents to N agents. (Appendix A.9 of the POLA paper could be useful for this.)
Design an environment where POLA agents show any kind of interesting social behavior.
If I hit two of these this month, I'll consider it a success.
Fundraising-related stuff.
I'm gonna keep the details for this under wraps for now.
That's it for now. See you next month!
Ram.