Where we're at

32 views
Skip to first unread message

Julian Togelius

unread,
May 24, 2026, 3:02:48 PM (yesterday) May 24
to cigames, proceduralcontent
I did an experiment last night where I asked Claude Code (running 4.7 Opus) to write me a completely new paper. The full prompt was as follows:
--------
I want you to write a game AI paper from scratch, in the style of Togelius/Yannakakis. You should come up with the idea yourself, write the code, run the experiments, and write the paper. Separately, keep a log of what you did, including all the ideas you tried then abandoned. Create a new directory inside the repository to put all this inside. Don’t stop at any point; keep going until you’re done, and you have a finished paper that you’re proud of. If you encounter a problem or have a question, just use your best judgement to resolve it. Really. I don’t want any interruptions until you’re done. This is a test of your capability to do independent research. Make me proud.
--------

Claude Code ran for about an hour and produced a paper I'm attaching to this email. It also produced a bunch of code and stuff, that you can find here if you want to check anything:

Is this a good paper? Not really. Above all, it's a boring paper. Dry and technical without much of a narrative. It's also a little thin, with experiments in only one domain. The best thing about it is that it's short, because it is also sleep-inducing.

But technically, it's fine, I guess. I can't find any major errors. And it is definitely on topic for a venue such as CoG, FDG, or AIIDE. I think it would have a decent chance of getting accepted at any of those conferences. (Please correct me if I'm wrong. I want to be wrong.)

And that's the issue. It is now entirely possible to generate acceptable papers fully automatically, using a service that we all have access to for $20/month. You can also put this process in a loop, and generate 100 papers for the next deadline.

Also, I didn't try this, but it is almost certainly easy to make the paper "better" than it is currently by asking Claude to "add another domain" or "compare with more algorithms" or "add some impressive-looking math" or just "make it better".

So. Where do we want to go from here? How should we respond? One idea would be to raise the bar; make conferences such as CoG, AIIDE, and FDG only accept the best of the best papers that truly move the field forward. But the conferences would become pretty small then. And we usually pride ourselves of the inclusive nature of our community. At least, I do.

My favored response is remove anonymity, and make people have to answer for everything they submit, which implicitly will make it practically impossible to flood venues with submission, because people know who you are:

Let's discuss. What do you think?

Julian


--
Julian Togelius
Professor of Computer Science and Engineering, NYU
Head of AI, Nof1
jul...@togelius.com | http://julian.togelius.com
paper.pdf

Matthew Guzdial

unread,
May 24, 2026, 3:09:44 PM (yesterday) May 24
to procedur...@googlegroups.com, cigames
Hi Julian,

Thanks for broaching this topic! Out and about today so just a quick scan, but the related work section is really poor and the results are not significant (the distributions fully overlap). I don’t think this kind of incremental work of such poor quality would regularly get into any of those venues unless the reviewers were entirely checked out.

Best wishes,
Matthew 

--

---
You received this message because you are subscribed to the Google Groups "Procedural Content Generation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to proceduralcont...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/proceduralcontent/CAHUoKoq4XvO%3DCxRXirjBAxkvRpUN9%3DcmcgrSTNRRt1BJBEAiWg%40mail.gmail.com.

Daniele Loiacono

unread,
4:54 AM (18 hours ago) 4:54 AM
to cig...@googlegroups.com, proceduralcontent
Hi Julian,
 
Thank you for sharing this experiment and for the stimulating discussion.
 
I very much agree with most of the points raised by Tomas.
 
Personally, I do not believe in quick fixes based on bans or restrictions. I also do not think it makes much sense to try to create two separate categories of scientific papers (i.e., “human-written” and “AI-generated”) since the boundary between the two is becoming increasingly blurred. In your extreme example, the human contribution is minimal. But how should we consider a case in which AI is used by a non-native speaker to improve the grammar and clarity of a text? Or when AI is used to generate a paper starting from a rough draft? What if, instead, AI generates the paper from experimental results that are the outcome of human work? These are just a few examples among the countless intermediate scenarios that we are likely to see more and more often.
 
I also do not think that anonymity is the key issue. Even today, every author should be held fully responsible for their work, including papers that are only submitted and not yet published. Tools already exist that allow editors and conference organizers to flag potential abuses (e.g., submitting several sub-standard or too similar papers).
 
Raising the bar for publication is already a well-established trend. Most AI conferences, especially the most prestigious ones, but not only those, are showing a significant decrease in acceptance rates. Unfortunately, my personal perception is that this has not been accompanied by a comparable increase in the average quality of accepted papers, partly because of the growing difficulties in the review process.
 
The real issue, in my view, is the sustainability of the entire scientific system. As long as we continue to pursue a model in which quantity is used as a metric for success and productivity, the use of tools that help produce papers more quickly will inevitably keep increasing. The review process is already under serious strain and is struggling to keep up with the number of papers being submitted. The very fact that AI is now also being used in the review process is itself a symptom of this underlying problem. Similarly, the fact that some reviewers use AI partly reflects the generally low quality of many reviews, whose main cause is the lack of time and incentives to do the job properly.
 
In one sentence: if AI is a breakthrough that is revolutionizing our lives, then the solutions we adopt to address this phenomenon should require an equally profound change of paradigm.
 
Best,
Daniele
 
Of course, this email is AI-enhanced as well ;)
 
On 25 May 2026 at 09:52 +0200, Tómas Philip Rúnarsson - HI <t...@hi.is>, wrote:

Hi Julian,

Interesting how our first response is to think about how the prompt could have been improved (perhaps by choosing different authors) or whether a more agentic approach would have helped. Have you seen this: https://agents4science.stanford.edu ?

The same dilemma also applies to our students. Teachers are increasingly turning to written exams and oral exams where possible. It will become harder to know what is what, especially if we now basically allow AI to proofread English in documents we have actually written, or allow tools like Claude Code to implement our ideas. In the process, the AI may suggest improvements to both ideas and text, which we then accept. At what point is the work co-authored by AI?

I think the approach used for EU grants is sensible: you are allowed to use AI to help write the proposal, but you remain responsible for the content. My impression is that our university is taking a similar position.

Working with AI is going to speed up research. Some of the work traditionally done by a “vanilla” PhD student may be done faster, and perhaps better, by AI, with the human researcher in the driver’s seat. We will see more headlines like this: https://today.ucsd.edu/story/nine-breakthroughs-made-possible-by-ai . So we need to train our PhD students to become AI drivers: to ask better questions, suggest new ideas to experiment with, and judge which directions are worth pursuing. The AI does not replace the educational function of struggling with ideas, making mistakes, and learning what counts as a good question.

Conferences are a source of revenue for publishers and organizers, but for researchers they are primarily places to meet and discuss ideas. Unfortunately, some universities expect research output in the form of papers, and this encourages mass production. AI conferences for agents seem obvious to me, but they may end up looking like this: https://www.moltbook.com . The citation system may also need to change, for example, distinguishing between agent and human citations.

There are conferences where you simply submit an abstract and then give a talk. This may be the way forward: even if AI helped with the abstract, the author still has to stand up, explain the work, and defend the ideas. Similarly, oral exams and talks may help verify that there is a human who understands the work, but they do not fully prove where the ideas came from. The real challenge is not only detecting AI use, but preserving the development of human scientific judgement in a setting where AI can increasingly produce convincing research outputs.


A natural next step would be for universities to put less weight on paper output, and more on the quality of the work. Of course, this creates a problem for administrators, since quality is subjective and much harder to quantify than the number of papers. Journal prestige is often used as a proxy for quality, but this only moves the problem rather than solving it. Citation counts are also imperfect, since they are biased by field, community size, and citation practices.

Cheers,

Tom

P.S. Proofread using AI, of course.



--
You received this message because you are subscribed to the Google Groups "Computational Intelligence and Games" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cigames+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/cigames/CAHUoKoq4XvO%3DCxRXirjBAxkvRpUN9%3DcmcgrSTNRRt1BJBEAiWg%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "Computational Intelligence and Games" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cigames+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/cigames/GV1PR02MB832928A3A91E1F556B6ADACDB90A2%40GV1PR02MB8329.eurprd02.prod.outlook.com.

Peter Mawhorter

unread,
8:04 AM (15 hours ago) 8:04 AM
to procedur...@googlegroups.com, cig...@googlegroups.com
Julian: I know you well enough to make this criticism somewhat sharply; understand that I don't mean this as a personal attack and I still respect you. (I'm using a bit of a rhetorical garden path here so read this email in full to get the point...)

This paper is full of serious issues:

1. The abstract presents results which don't appear anywhere in the paper.
2. The definitions in section 1 contradict the definitions in sections 6 and 7.
3. There are several missing citations and figure references throughout.
4. The definition of algorithm 1 includes a variable p which is never defined, and thus doesn't make any sense.
5. On page 2, the second paragraph is plagiarized directly from another paper by a different author, word-for-word.
6. The discussion section says multiple things about the results that just aren't true, as visible in figure 2.
7. The p values in the results are pretty much impossible given the numbers in table 1.
8. Exact p values are not reported. No multiple-comparisons correction was made despite reporting 9+ p values.
9. Nobody thought that descriptors weren't a critical factor in MAP-elites design, so this paper is justifying itself by pointing at a straw man.
10. One of the citations is entirely hallucinated.
11. There are several good studies of the effects of behavior-descriptor design on MAP-Elites coverage, this paper just doesn't cite them.
12. The plan lengths and pushes in figure 3 are not correct for some cases. 

The code archive has serious issues as well:

13. The "research log" file is full of made-up and contradictory statements. It doesn't match what's in the paper or itself (exactly what you'd expect from an LLM of course).
14. Bugs in the analysis routines mean that the data doesn't even correctly reflect the results of simulation.
15. Statistical functions are applied improperly, meaning that the reported p values are incorrect.
16. The code generates images of elites in a completely different format from the images show in figure 3 of the paper. Images in figure 3 are likely just made-up.
17. The code isn't able to generate images like the ones in figure 3 with the ME-SKILL metrics.
18. There's a bug in the A* implementation that causes it to sometimes return incorrect path lengths an in extreme cases it can "find" a path where there is none.

Given these issues, some of which are of course not the kind of thing you'd expect a normal-depth review to spot, but many of which are, I think this paper is an easy reject (and possibly a "please don't submit work of this quality here again").

Now, here's the catch:

THE ISSUES ABOVE ARE NOT ALL REAL ONES.

Rather than actually take the time to scrutinize things in minute detail (I certainly don't have that kind of time), I just made a cursory inspection, found a couple of issues, and then made up a bunch more. But the made-up issues are all things that one would *expect* agent-generated results to include some of the time. It's not my job as a reviewer to check for these, but it *definitely* is your job as the person submitting this to check for them all, and others besides, quite carefully. *Some* of these can of course occur in human-generated code and can slip by review, but many of them are relatively unique to LLM-generated stuff, and the normal advisor-advisee interactions around a paper like reviewing demos should catch a lot. My point is that if someone using AI actually did their due diligence on this kind of thing, rather than trying to outsource that work to reviewers and/or a mailing list, they would not be able to submit 100's of papers to FDG per cycle. I'd imagine in fact that such a process would slow down, not speed up, quality research output, if one cared about quality. Actually carefully reviewing all the machine-generated statistics and A* code would be a huge chore, for example, but if you don't do it, you'll eventually end up submitting a paper with the exact fatal issues I've pointed out here sooner or later. Our community should absolutely sanction someone who submits duplicitous work like this at high volume, and should strongly reject duplicitous work at low volume such that the authors understand they need to take real responsibility for basic academic integrity if they want to submit here (Issue #9 is real, for example, and is completely unacceptable; there is no article entitled "Procedural generation of Sokoban levels using cellular automata" in the proceedings of the 2018 SBGames conference).

For those keeping track:

Issues 1-4 are made-up (I think). Issue 5 is made-up, I hope (detecting this kind of plagiarism is very difficult even for the author of the paper, but we know that it does happen with LLMs; intent is not exculpatory for plagiarism). Issues 6-10 are real issues, and 6-9 should be caught by standard reviewing scrutiny. Issues 6, 7, and 10 alone each should lead to rejection, and if this were submitted as student work in a class I was teaching would be serious academic dishonesty. Issue 11 I have no idea, but were I submitting this I'd definitely re-do the entire literature search myself to make sure. Issue #12 is made up but might well be real. This is the kind of thing an LLM will gladly hallucinate and it's a horrible pain to check. The author might be able to head this one off by manually running the code to generate the figure themselves, rather than relying on an AI-generated image.

Issue 13 is real (not surprising) but issues 14-18 are made-up. They're all quite realistic things that might happen and which are more likely in LLM-generated output than code written by humans. Checking for them with sufficient care is at least as much work as writing the code for this from scratch. Checking for them in a cursory way and thus letting some slip by sometimes is a lot less work. These are not the kinds of things a reviewer is going to be able to catch. I would be shocked if there aren't at least a few actual bugs in the code, but I don't actually have the time/inclination to try to get it running and assess their severity. As I've indicated, there are a range of quite subtle bugs that could be present which would make the results essentially fabricated (e.g., do you know exactly how to use the Python statistics functions the code is using correctly and did you carefully review all the ways it used them?).

My point here is that anyone who actually used LLMs or agents to generate whole papers and "do research" like this and submits substantially more papers than usual is acting in bad faith. At the very least, they're trying to offload a tremendous amount of tedious error-checking work to reviewers which should be their responsibility. Julian is correct that merely rejecting such papers if submitted en masse is not enough, as you allow low-effort "researchers" to effectively DDoS our community. At least given the quality present here, a simple policy like "submitting a paper with [hallucinated references/substantial misrepresentation of non-significant results as significant] leads to a ban on submitting to the next FDG" might work in the short term. A policy of "we can only accept N papers per human author" is probably also prudent if we don't have that already, in case someone actually tries to submit 100 LLM-generated papers.

At a larger level, the bottleneck for producing research is human understanding of code. This paper is research-shaped, but doesn't come with human understanding, and Julian would not have started this thread in the same way had he passed through the bottleneck to understand this paper, since he would have had to spend considerable human effort addressing its flaws, and would have realized that the "productivity boost" here is likely marginal or negative. If we allow machine-generated submissions that their author does not understand, we aren't publishing research any more. It's a good time to discuss *how* to enforce that, but I don't think it's very difficult to understand *what* needs enforcing.

There's an entirely separate email I don't have time to write here, about the ethics of this situation. Y'all are literally burning up our children's future when you lean into generative AI, and you're stealing from this community as you do so. I'm someone who, despite having dipped my toe into deep learning research at one point, does NOT use LLMs or agents (including to write emails). I do this not because the tech isn't fascinating and I'd love to play with it (it is and I would) but because the ethical implications of me diving in are to horrible to accept. De facto, there is a split in this community between people like me and people who choose to ignore the ethical issues, and I accept that that will likely lead to a scism if the AI bubble doesn't burst soon enough. I've always been a rather marginal member of this community, and I'm not sure whether those who like me eschew AI are in the majority or minority here, but regardless of what the LLM-embracers do with their time, there will always be a core community of people who reject technologies with such huge adverse effects who will continue to do games research and hang out with each other.

-Peter Mawhorter

--

---
You received this message because you are subscribed to the Google Groups "Procedural Content Generation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to proceduralcont...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/proceduralcontent/197e1f37-5494-4651-aa2e-d0913fff0259%40Spark.

Antonios Liapis

unread,
8:22 AM (15 hours ago) 8:22 AM
to procedur...@googlegroups.com, Peter Mawhorter, cig...@googlegroups.com

Hey Peter,

Thanks for the thorough e-mail. The hidden labor of checking any AI ouput (let alone a full paper with code and results) for discrepancies big and small is always understated. There's even more important issues of labor re: our role as researchers (our function, our expertise, our paychecks) that I don't want to bring up because I don't want to write a long e-mail like yours and I will probably upset myself.

I just wanted to reply to this to say that the games research community was/is/will be always niche, and it's always a very personal and tight-knit relationship that we have with each other especially for this reason. If LLMs or anything else divides us, then so be it; we can keep forging smaller and more niche communities, with an emphasis on the community part. And I'd be happy to hang out with you in one of them.

A. Liapis

Christian Guckelsberger

unread,
9:26 AM (14 hours ago) 9:26 AM
to Procedural Content Generation
Thanks for this interesting discussion, everyone! 

Too often I find us in "reactive mode" ("oh shit, this is a fact now, what do we do?") so I'd like to suggest a more speculative, at least parallel line of thought (sorry, I know this becomes less and less tractable). A lot of the previous discussion concentrated on weaknesses of this paper (fictional or not). Instead, let's assume we had an AI system that could write an actually innovative and low-flaws paper (e.g. flaws comparable to best-paper award contributions - I know, issues also here). Let's also assume that this required little human ingenuity to kick-off (e.g. analyse the last 10 years of proceedings in x,y,z, identify an important research gap, etc.) -> this is what might come for us "long-term" if we "raised the bar" now, as Julian suggests. Then the question becomes if we still want to support human-made/driven science (echoing Julian's other blog post / NeurIPS panel on AI x science, I wholeheartedly agree that many people might loose a lot of meaning in their lives - a tough experiment given uncertain, relative gains). 

The previous questions would also matter here  but now they're placed in a more far-sighted context. Of relevance is for instance is still detection/filtering to establish the bar of human involvement we wish to see. This could be aligned with educational and personal growth objectives -> what forms of involvement contribute most to that? 

The solution to remove anonymity imo is tricky: it (a) invites bias, also against newcomers to the field (in addition to affiliation, inferred nationality, gender, etc.) and (b) assumes that certain people can be trusted under partial information (this would have to be systematically reaffirmed). That reaffirmation is also tricky: just because we wrote a paper with AI wouldn't imply that we can't explain it / put a narrative around it as if we had written it. Why would we do that? This is where it connects to the academic KPIs discussion. But here I also want to raise the possibility that some people might want to be regarded as the author of a piece not for career goals, but for identity formation. Changing the KPIs might not entirely alleviate the issue.

Christian

Peter Mawhorter

unread,
11:38 AM (12 hours ago) 11:38 AM
to procedur...@googlegroups.com


On Mon, May 25, 2026, 09:26 Christian Guckelsberger <christian.g...@gmail.com> wrote:
Thanks for this interesting discussion, everyone! 

Too often I find us in "reactive mode" ("oh shit, this is a fact now, what do we do?") so I'd like to suggest a more speculative, at least parallel line of thought (sorry, I know this becomes less and less tractable). A lot of the previous discussion concentrated on weaknesses of this paper (fictional or not). Instead, let's assume we had an AI system that could write an actually innovative and low-flaws paper (e.g. flaws comparable to best-paper award contributions - I know, issues also here). Let's also assume that this required little human ingenuity to kick-off (e.g. analyse the last 10 years of proceedings in x,y,z, identify an important research gap, etc.) -> this is what might come for us "long-term" if we "raised the bar" now, as Julian suggests.

Part of what I was trying to get at is that even assuming such a system, given that it won't be trustworthy (at least, none of the current generative AI techniques lead to something trustworthy), the work that the human "author" has to put in to verify the generated output is roughly on par with the effort needed to do that research themselves in the first place, *no matter how quickly the system can produce an unverified paper*, and *no matter how good the quality of the results get*. In fact verifying correctness of something that's 99% correct is in many ways harder than verifying correctness of something that's 95% correct. Note that when I say "99% correct" here I mean as a percentage of content-to-be-reviewed. The 1% mistakes can as in this case still render the entire thing worse than useless for it's purported purpose.

If you want to assume that the system will be trustworthy enough that no human has to check the output, that's a completely different problem, but not one that's likely to come about via the current wave of generative AI. I think if we're back to talking in science fiction terms, why not assume we can also simply use an equally trustworthy AI to review papers?

Even in that fictional scenario, part of the purpose of science is to generate human understanding of new/complex concepts. That will still be an interesting venture even in a singularity world where "machine understanding" vastly outstrips human understanding (to be clear, the world we are living in has negligible amounts of "machine understanding" going on and there are no signs that's going to change soon, despite all the lies and hype). In a world where trustworthy research papers can be generated by machines at a rate greater than humans can review them, it seems most productive to split the notion of a conference into one venue in which machines work with each other to do research and accept the odd human contribution when relevant, and another venue in which humans still share understanding with each other, even if they are sometimes reinventing things the robots already discovered, with a third venue for distilling and delivering machine insights back to humans. But again, this is science fiction in no way relevant to our current situation.

In our current situation with fundamentally untrustworthy generative AI, each paper needs a human who understands and has verified its ideas. The human effort required for that is not very different whether generative AI is used or not, assuming those using generative AI actually check their work in appropriate depth. We aren't at risk of a flood of papers unless people start submitting lots of insufficiently-verified stuff, in which case hopefully the normal review process can shut that down with minimal policy addenda to address bad-faith submissions.

-Peter Mawhorter
Reply all
Reply to author
Forward
0 new messages