Online Social Choice and Welfare Seminar: Sonja Kraiczy, Tuesday 4 November

4 views
Skip to first unread message

Marcus Pivato

unread,
Oct 29, 2025, 11:03:17 AM (14 hours ago) Oct 29
to social-choice-a...@googlegroups.com, com...@duke.edu
[with apologies for cross-posting]

Dear all,

The next presentation in the Online Social Choice and Welfare Seminar will be next Tuesday (4 November).   Here are the details.

Time: 2PM GMT (9AM Montréal/Toronto, 11AM Rio de Janeiro, 2PM Oxford, 3PM Paris, 5PM Istanbul, 7:30PM New Delhi, 11PM Tokyo/Seoul)

Speaker: Sonja Kraiczy (University of Oxford)

Title:  "Enforcing Axioms for AI Alignment under Loss-Based Rules"

Abstract: Recent alignment methods for large language models, most notably reinforcement learning from human feedback (RLHF), often train an auxiliary reward model to minimize a loss function on binary preference data over model responses. We study a theoretical setting inspired by principle-guided methods such as Constitutional AI, in which a small set of principles (e.g., helpfulness, toxicity) act as “voters” that guide binary comparisons---such as preferring the less toxic response. We model these principles as linear directions in an embedding space of responses, a simplifying assumption motivated by the Linear Representation Hypothesis---concepts are linear directions in representation space---a useful first-order approximation in practice.

In this linear social choice model, Ge et al. (2024) showed that an optimal linear reward model can violate Pareto optimality (PO): From the principles-as-voters lens, this means a response A can be less helpful and more toxic than B, yet still receive a higher reward. We analyze axiomatic violations in the linear social choice setting and probe the robustness of negative results under realistic assumptions. We show that added expressivity does not resolve the issue: polynomial reward models can still fail PO. We then offer a pragmatic alternative showing that when the data uniformly covers the embedding space, broad classes of loss-based rules in the limit exactly recover the axiomatic guarantees. This yields a recipe for constitutional-style alignment with provable guarantees: enforce balanced coverage via dataset design to restore axiomatic guarantees without abandoning standard training pipelines.

(Joint work with Alexandros Hollender)


To obtain the Zoom link, please subscribe to the Seminar Mailing List, or contact one of the organisers.


Reminder: On the seminar website you can find the video recordings, slides and supplementary materials for all past presentations, as well as information about future presentations.


--
Marcus Pivato
Centre d'Économie de la Sorbonne
Université Paris 1 Panthéon-Sorbonne
Reply all
Reply to author
Forward
0 new messages