Prompt from new AI reviewer paper

1 view
Skip to first unread message

Thomas Costello

unread,
May 21, 2026, 3:42:53 PMMay 21
to Human Cooperation Lab
Just saw this working paper showing that GPT-5.2 reaches expert level in peer review based on 45 scientists took 469 hours evaluating human & AI reviews on 82 papers. "Surprisingly, current AI reviewers are competitive even with the top-rated reviewers in Nature’s official peer review..." 

Dug into the method to find the prompt, which is here. You run it with a coding agent that has access to the repository (paper text, data, code, whatever). Probably worth doing this before submitting papers? 

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

Seungone KimDongkeun YoonKiril GashteovskiJuyoung SukJinheon BaekPranjal AggarwalIan WuViktor ZaverkinSpase PetkoskiDaniel R. SchriderIlija DukovskiFrancesco SantiniBiljana MitreskaYong JeongKyeongha KwonYoung Min SimDragana ManasovaArthur PortoBiljana MojsoskaMakoto TakamotoMarko ShuntovRuoqi LiuHyunjoo Jenny LeeNiyazi Ulas DinçYehhyun JoSunkyu HanChungwoo LeeHuishan LiEsther H. R. TsaiErgun SimsekKhushboo ShafiYeonseung ChungJihye ParkAleksandar ShulevskiHenrik ChristiansenYoosang SonElly KnightAmanda MontoyaJeongyoun AhnChristian LangkammerHeera MoonChangwon YoonNikola StikovMooseok JangEdward ChoiJunhan KimYeon Sik JungWoo Youn KimJae Kyoung KimIshraq Md AnjumHyun Uk KimDrew BridgesCarolin LawrenceXiang YueAlice OhAkari AsaiSean WelleckGraham Neubig
With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence. Understanding what AI reviewers do well, where they fall short, and what challenges remain is essential. However, existing evaluations of AI reviewers have focused on whether their verdicts match human verdicts (e.g., score alignment, acceptance prediction), which is insufficient to characterize their capabilities and limits. In this paper, we close this gap through a large-scale expert annotation study, in which 45 domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms (each targeting one specific aspect of a paper) from human-written and AI-generated reviews of 82 Nature-family papers on correctness, significance, and sufficiency of evidence. On a composite of all three dimensions, a reviewing agent powered by GPT-5.2 scores above each paper's top-rated human reviewer (60.0% vs. 48.2%, p = 0.009), while all three AI reviewers (including Gemini 3.0 Pro and Claude Opus 4.5) exceed the lowest-rated human across every dimension. AI reviewers' accurate criticisms are also more often rated significant and well-evidenced, and surface a distinct 26% of issues no human raises. However, AI reviewers overlap far more than humans do (21% vs. 3% for cross-reviewer pairs), and exhibit 16 recurring weaknesses humans do not share, such as limited subfield knowledge, lack of long context management over multiple files, and overly critical stance on minor issues. Overall, our results position current AI reviewers as complements to, not substitutes for, human reviewers.
Reply all
Reply to author
Forward
0 new messages