Online Social Choice and Welfare Seminar: Sonja Kraiczy, Tuesday 4 November
4 views
Skip to first unread message
Marcus Pivato
unread,
Oct 29, 2025, 11:03:17 AM (14 hours ago) Oct 29
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to social-choice-a...@googlegroups.com, com...@duke.edu
[with apologies for cross-posting]
Dear all,
The next presentation in the Online Social Choice and Welfare Seminar will be next Tuesday (4 November). Here are the details. Time: 2PM GMT (9AM Montréal/Toronto, 11AM Rio de Janeiro, 2PM Oxford, 3PM Paris, 5PM Istanbul, 7:30PM New Delhi, 11PM Tokyo/Seoul)
Title: "Enforcing Axioms for AI Alignment under Loss-Based Rules"
Abstract:
Recent alignment methods for large language models, most notably
reinforcement learning from human feedback (RLHF), often train an
auxiliary reward model to minimize a loss function on binary preference
data over model responses. We study a theoretical setting inspired by
principle-guided methods such as Constitutional AI, in which a small set
of principles (e.g., helpfulness, toxicity) act as “voters” that guide
binary comparisons---such as preferring the less toxic response. We
model these principles as linear directions in an
embedding space of responses, a simplifying assumption motivated by the
Linear Representation Hypothesis---concepts are linear directions in
representation space---a useful first-order approximation in practice.
In this linear social choice model,
Ge et al. (2024) showed that an optimal linear reward model can violate
Pareto optimality (PO): From the principles-as-voters lens, this means a
response A can be less helpful and more toxic than B, yet still receive
a higher reward. We analyze axiomatic violations in the linear social
choice setting and probe the robustness of negative results under
realistic assumptions. We show that added expressivity does not resolve
the issue: polynomial reward models can still fail PO. We then offer a
pragmatic alternative showing that when the data uniformly covers the
embedding space, broad classes of loss-based rules in the limit exactly
recover the axiomatic guarantees. This yields a recipe for
constitutional-style alignment with provable guarantees: enforce
balanced coverage via dataset design to restore axiomatic guarantees without abandoning standard training pipelines.
To obtain the Zoom link, please subscribe to the Seminar Mailing List, or contact one of the organisers.
Reminder: On the seminar website
you can find the video recordings, slides and supplementary materials
for all past presentations, as well as information about future
presentations.