Hi Ben,
Congrats on the ludicrously compelling and meticulous work. It is a monumental contribution, and likely to be seminal. I have some initial questions/thoughts that just cropped up while reading. I hope you don't mind my sharing them. Not withering at all, though.
- does "prompt-based personalization" matter for some issues but not others? Or did it vary by initial belief strength? I'm admittedly surprised that there wasn't more of an effect there, since I think under a roughly Bayesian model of belief-updating then tailoring an informational argument to someone's priors would help. one worry is explanatorily-helpful details becoming lost in the scale of your stimuli, that is, I'd imagine many people didn't have well-formed opinions about some of the political topics discussed (especially because the issues were chosen to ensure most people had moderate support), so perhaps in those cases they didn't have much of a prior belief to target? the evidence you provide is very strong, but I do still find it hard to believe that tailoring an argument's informational content to someone's strong priors wouldn't help? And perhaps relatedly, I wonder if a stronger form of prior-tailoring, like based on someone's larger causal model of politics and behavior, would have more meaningful returns.
- semi-relatedly, it might be interesting to look at how the format of the DV question shapes these results (e.g., how did DV questions vary in structure, rather than content? were some very specific and others quite broad? did some deal with essentially factual disputes ["increasing housing supply will reduce homelessness"] while others were moral or politically strategic?). I'm not sure "political issues" is a coherent category. I think the scale of your data would allow for this kind of analysis, and it could provide some very interesting results.
- how did the distribution of persuasive effects change with model scale, fine-tuning/RM, etc?
- Also, it's incredibly cool to see that both ML approaches improved persuasion; how are you thinking about the muddiness between "people who are more likely to change their minds liked particular kinds of messages" (which you then trained on) vs. "certain messages are more persuasive for everyone"? Or put another way, is it possible that the ML approaches squeezed extra juice out of ready-to-be-persuaded people?
- can you not do supervised fine-tuning for proprietary models? I was under the impression that, e.g., the openai platform allows for this.
- You say " Second, we used 56,283 additional conversations (covering 707 political issues) with GPT-4o to fine-tune a reward model (RM; a version of GPT-4o) that predicted belief change at each turn of the conversation, conditioned on the existing dialogue history." -- but I think this language makes it sound like you collected turn-by-turn belief change scores to do RM with (which you didn't, I think?). This is really minor but I got super excited about the prospect of a turn-by-turn dataset.
So much to dig into! But these were my first impressions.
Cheers,
Tom