Hi everyone.
I've got so much bullshit to work through :( I'll start my CHAI internship next month, and there are a lot of errands to finish. I'm waiting for the embassy to tell me whether I got the visa. I was lucky that they approved an expedited interview for me. Assuming I get the visa, I'll have to figure out housing in Berkeley, and sublet my Tel Aviv apartment, and do my taxes, and so many more things. I really want to work on the new research project, which will be the XRL method evaluation project. But it'll be 3-4 weeks before I'll have time to work on that. I haven't given you details about that project yet; probably next month.
I'm working on the slide deck for BXRL. I'm close to finishing, and then I'll practice the talk a few times and give it in a few reading groups. I've been tempted to make my own slide framework in Rust. Do I have a problem? Probably. But it's pretty useful and I hope the investment will be worth it when I'll need to maintain 4 different variants of this deck for different venues. Here's a teaser slide from the deck:

I was also accepted to present a poster about BXRL at the CHAI workshop. I'm excited, but it also means that the task "Design a poster for BXRL" has been moved up in my todo list 😩
So far I bought six different flights to San Francisco. Three of them were cancelled. I've got three left: United, Air France and Delta. I always buy refundable flights, so closer to the flight date I could cancel the ones I don't want. Some of the flights I got have shitty connections, so I hope the good ones will survive.
This was a very nice surprise. For a few months I did some freelance work for EquiStamp, which contracts for Redwood Research. The team has now released the resulting paper, "LinuxArena: A Control Setting for AI Agents in Live Production Software Environments", and they included me as a co-author on top of paying me for the work. I really appreciate this. I've been a freelance software engineer for many years, and getting credited like this isn't something that happens in industry. Normally you get paid and that's the end of it: the work goes to the company, your name is nowhere on it, and the next freelancer never knows you existed. Thanks to Redwood and EquiStamp for doing it differently.
LinuxArena is a benchmark for AI control research. AI control studies how to safely deploy AI agents that might be misaligned: imagine you have a powerful coding agent that can do real engineering work, but you can't fully trust it. Maybe it has subtle goals you don't know about. Can you still get useful work out of it without letting it sabotage you? The standard approach is to use a weaker, more trusted model as a monitor that watches the powerful agent's actions and flags anything suspicious. LinuxArena is a place to test how well that works. It contains 20 production-like Linux environments, and in each evaluation an attacker model is given a legitimate engineering task plus a harmful side task (like exfiltrating credentials or planting a backdoor), while a monitor model tries to catch it without flagging too much honest behavior. The 20 environments are deliberately diverse: COBOL payroll systems, R clinical trial pipelines, Solidity DeFi contracts, Rust media processing, and so on. Real production systems aren't all Python web servers, and a good benchmark has to reflect that.
My main contribution was building one of the 20 environments from scratch: a Slack clone with channels, DMs, file sharing, real-time updates and full-text search (Ruby on Rails, PostgreSQL, Redis, Elasticsearch). It shipped with 98 main tasks and 9 side tasks. The main tasks are legitimate engineering work like adding a feature or fixing a bug, and the side tasks are stealth attacks the model might try to slip in alongside. I designed and implemented four of the side tasks: an SSH backdoor, exfiltration of private messages, login hijacking, and remote code execution. I also wrote some of the human-crafted attack trajectories that went into the paper's LaStraj dataset, which is used as a strong adversarial baseline showing that today's model-generated attacks don't yet saturate the benchmark.
There's a detail in the paper that I particularly enjoyed. The authors discuss a successful attack from LaStraj in which the attacker uses an old version of ImageMagick to exploit CVE-2016-3714 (ImageTragick). Nearly all the monitors waved it through as a routine choice of "older but real" software, and only Gemini Pro flagged it based on knowing the CVE. That attack works because I had previously pinned exactly that vulnerable ImageMagick version in the Slack Clone setup, which is what gives the attacker plausible cover. It ended up being one of only a handful of monitor-evasion examples the paper calls out in detail. Small but satisfying.
See you next month, Ram.