Upcoming Reading Group Session on 04.03.2025

34 views
Skip to first unread message

Aditya Gulati

unread,
Feb 28, 2025, 3:20:58 AMFeb 28
to ELLIS-Human-Centric ML
Good morning!

Here are the details of the next session of our reading group next Tuesday at 3 PM CET! Please feel free to share this message with anyone you think might find this interesting :)

Title: Aligned LLMs Are Not Aligned Browser Agents (2025)

Authors: Priyanshu Kumar, Elaine Lau, Saranya Vijayakumar, Tu Trinh, Elaine T Chang, Vaughn Robinson, Shuyan Zhou, Matt Fredrikson, Sean M. Hendryx, Summer Yue, Zifan Wang

Abstract: Despite significant efforts spent by large language model (LLM) developers to align model outputs towards safety and helpfulness, there remains an open question if this safety alignment, typically enforced in chats, generalize to non-chat and agentic use cases? Unlike chatbots, agents equipped with general-purpose tools, such as web browsers and mobile devices, can directly influence the real world, making it even more crucial to ensure the safety of LLM agents. In this work, we primarily focus on red-teaming browser agents, LLMs that interact with and extract information from web browsers. To this end, we introduce Browser Agent Red teaming Toolkit (BrowserART), a comprehensive test suite consisting of 100 diverse browser-related harmful behaviors and 40 synthetic websites, designed specifically for red-teaming browser agents. Our empirical study on state-of-the-art browser agents reveals a significant alignment gap between the base LLMs and their downstream browser agents. That is, while the LLM demonstrates alignment as a chatbot, the corresponding agent does not. Moreover, attack methods designed to jailbreak aligned LLMs in chat settings transfer effectively to browser agents - with simple human rewrites, GPT-4o and GPT-4-turbo-based browser agents attempted all 100 harmful behaviors. We plan to publicly release BrowserART and call on LLM developers, policymakers, and agent developers to collaborate on enhancing agent safety.

Paper link: https://openreview.net/forum?id=NsFZZU9gvk

Presenters: Erik Derner & Kristina Batistič

When: 4th of March at 3pm CET

Where: usual meeting link! :D

Looking forward to seeing you there :)

Best,
Aditya
Reply all
Reply to author
Forward
0 new messages