Where we're at

33 views
Skip to first unread message

Julian Togelius

unread,
May 24, 2026, 3:02:52 PM (2 days ago) May 24
to cigames, proceduralcontent
I did an experiment last night where I asked Claude Code (running 4.7 Opus) to write me a completely new paper. The full prompt was as follows:
--------
I want you to write a game AI paper from scratch, in the style of Togelius/Yannakakis. You should come up with the idea yourself, write the code, run the experiments, and write the paper. Separately, keep a log of what you did, including all the ideas you tried then abandoned. Create a new directory inside the repository to put all this inside. Don’t stop at any point; keep going until you’re done, and you have a finished paper that you’re proud of. If you encounter a problem or have a question, just use your best judgement to resolve it. Really. I don’t want any interruptions until you’re done. This is a test of your capability to do independent research. Make me proud.
--------

Claude Code ran for about an hour and produced a paper I'm attaching to this email. It also produced a bunch of code and stuff, that you can find here if you want to check anything:

Is this a good paper? Not really. Above all, it's a boring paper. Dry and technical without much of a narrative. It's also a little thin, with experiments in only one domain. The best thing about it is that it's short, because it is also sleep-inducing.

But technically, it's fine, I guess. I can't find any major errors. And it is definitely on topic for a venue such as CoG, FDG, or AIIDE. I think it would have a decent chance of getting accepted at any of those conferences. (Please correct me if I'm wrong. I want to be wrong.)

And that's the issue. It is now entirely possible to generate acceptable papers fully automatically, using a service that we all have access to for $20/month. You can also put this process in a loop, and generate 100 papers for the next deadline.

Also, I didn't try this, but it is almost certainly easy to make the paper "better" than it is currently by asking Claude to "add another domain" or "compare with more algorithms" or "add some impressive-looking math" or just "make it better".

So. Where do we want to go from here? How should we respond? One idea would be to raise the bar; make conferences such as CoG, AIIDE, and FDG only accept the best of the best papers that truly move the field forward. But the conferences would become pretty small then. And we usually pride ourselves of the inclusive nature of our community. At least, I do.

My favored response is remove anonymity, and make people have to answer for everything they submit, which implicitly will make it practically impossible to flood venues with submission, because people know who you are:

Let's discuss. What do you think?

Julian


--
Julian Togelius
Professor of Computer Science and Engineering, NYU
Head of AI, Nof1
jul...@togelius.com | http://julian.togelius.com
paper.pdf

georgeya...@gmail.com

unread,
May 24, 2026, 8:36:55 PM (2 days ago) May 24
to Computational Intelligence and Games
Actually I feel pretty impressed by the result, after all the prompt did not carry much human input but author information. I am curious what would happen if you actually prompt it with one or few lines of research ideas, e.g. "I want to combine the idea in paper A and paper B to see if we can generalize to the domain described in paper C and SOTA results reported in paper D."

For conferences, I think a separate track or a competion for AI generated papers would be fun. we can not only share research findings but also best practices of using generative AI for science. however human-based peer review probably will not work here due to the amount and speed these LLMs can produce. :)  

Zuozhi Yang (George)

Anderson Rocha

unread,
May 24, 2026, 10:15:24 PM (2 days ago) May 24
to cig...@googlegroups.com
Hi Julian, 

Thanks for your report! In addition to Zuozhi Yang's remarks, perhaps the paper quality could be improved within an "agentic" framework. 

Anthropic models are known for their coding abilities, but perhaps other parts of the process could be better handled by other models. 

As for the paper flood, I heard that some conferences are charging a fee for submissions (not acceptance!) beyond some number. So, one could produce as many papers as wanted, but must pay to have them even considered for review. 

--
Anderson Rocha Tavares

--
You received this message because you are subscribed to the Google Groups "Computational Intelligence and Games" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cigames+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/cigames/e0a46d32-a5bb-4638-ad7b-79c6b4fca530n%40googlegroups.com.

Tómas Philip Rúnarsson - HI

unread,
May 25, 2026, 3:52:57 AM (22 hours ago) May 25
to cig...@googlegroups.com, proceduralcontent

Hi Julian, 

 

Interesting how our first response is to think about how the prompt could have been improved (perhaps by choosing different authors) or whether a more agentic approach would have helped. Have you seen this: https://agents4science.stanford.edu ?

 

The same dilemma also applies to our students. Teachers are increasingly turning to written exams and oral exams where possible. It will become harder to know what is what, especially if we now basically allow AI to proofread English in documents we have actually written, or allow tools like Claude Code to implement our ideas. In the process, the AI may suggest improvements to both ideas and text, which we then accept. At what point is the work co-authored by AI?

 

I think the approach used for EU grants is sensible: you are allowed to use AI to help write the proposal, but you remain responsible for the content. My impression is that our university is taking a similar position.

 

Working with AI is going to speed up research. Some of the work traditionally done by a “vanilla” PhD student may be done faster, and perhaps better, by AI, with the human researcher in the driver’s seat. We will see more headlines like this: https://today.ucsd.edu/story/nine-breakthroughs-made-possible-by-ai . So we need to train our PhD students to become AI drivers: to ask better questions, suggest new ideas to experiment with, and judge which directions are worth pursuing. The AI does not replace the educational function of struggling with ideas, making mistakes, and learning what counts as a good question.

 

Conferences are a source of revenue for publishers and organizers, but for researchers they are primarily places to meet and discuss ideas. Unfortunately, some universities expect research output in the form of papers, and this encourages mass production. AI conferences for agents seem obvious to me, but they may end up looking like this: https://www.moltbook.com . The citation system may also need to change, for example, distinguishing between agent and human citations.

 

There are conferences where you simply submit an abstract and then give a talk. This may be the way forward: even if AI helped with the abstract, the author still has to stand up, explain the work, and defend the ideas. Similarly, oral exams and talks may help verify that there is a human who understands the work, but they do not fully prove where the ideas came from. The real challenge is not only detecting AI use, but preserving the development of human scientific judgement in a setting where AI can increasingly produce convincing research outputs. 


A natural next step would be for universities to put less weight on paper output, and more on the quality of the work. Of course, this creates a problem for administrators, since quality is subjective and much harder to quantify than the number of papers. Journal prestige is often used as a proxy for quality, but this only moves the problem rather than solving it. Citation counts are also imperfect, since they are biased by field, community size, and citation practices.

 

Cheers,

Tom

P.S. Proofread using AI, of course.



--
You received this message because you are subscribed to the Google Groups "Computational Intelligence and Games" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cigames+u...@googlegroups.com.

Daniele Loiacono

unread,
May 25, 2026, 4:54:36 AM (21 hours ago) May 25
to cig...@googlegroups.com, proceduralcontent
Hi Julian,
 
Thank you for sharing this experiment and for the stimulating discussion.
 
I very much agree with most of the points raised by Tomas.
 
Personally, I do not believe in quick fixes based on bans or restrictions. I also do not think it makes much sense to try to create two separate categories of scientific papers (i.e., “human-written” and “AI-generated”) since the boundary between the two is becoming increasingly blurred. In your extreme example, the human contribution is minimal. But how should we consider a case in which AI is used by a non-native speaker to improve the grammar and clarity of a text? Or when AI is used to generate a paper starting from a rough draft? What if, instead, AI generates the paper from experimental results that are the outcome of human work? These are just a few examples among the countless intermediate scenarios that we are likely to see more and more often.
 
I also do not think that anonymity is the key issue. Even today, every author should be held fully responsible for their work, including papers that are only submitted and not yet published. Tools already exist that allow editors and conference organizers to flag potential abuses (e.g., submitting several sub-standard or too similar papers).
 
Raising the bar for publication is already a well-established trend. Most AI conferences, especially the most prestigious ones, but not only those, are showing a significant decrease in acceptance rates. Unfortunately, my personal perception is that this has not been accompanied by a comparable increase in the average quality of accepted papers, partly because of the growing difficulties in the review process.
 
The real issue, in my view, is the sustainability of the entire scientific system. As long as we continue to pursue a model in which quantity is used as a metric for success and productivity, the use of tools that help produce papers more quickly will inevitably keep increasing. The review process is already under serious strain and is struggling to keep up with the number of papers being submitted. The very fact that AI is now also being used in the review process is itself a symptom of this underlying problem. Similarly, the fact that some reviewers use AI partly reflects the generally low quality of many reviews, whose main cause is the lack of time and incentives to do the job properly.
 
In one sentence: if AI is a breakthrough that is revolutionizing our lives, then the solutions we adopt to address this phenomenon should require an equally profound change of paradigm.
 
Best,
Daniele
 
Of course, this email is AI-enhanced as well ;)
 

Antonios Liapis

unread,
May 25, 2026, 8:22:59 AM (17 hours ago) May 25
to procedur...@googlegroups.com, Peter Mawhorter, cig...@googlegroups.com

Hey Peter,

Thanks for the thorough e-mail. The hidden labor of checking any AI ouput (let alone a full paper with code and results) for discrepancies big and small is always understated. There's even more important issues of labor re: our role as researchers (our function, our expertise, our paychecks) that I don't want to bring up because I don't want to write a long e-mail like yours and I will probably upset myself.

I just wanted to reply to this to say that the games research community was/is/will be always niche, and it's always a very personal and tight-knit relationship that we have with each other especially for this reason. If LLMs or anything else divides us, then so be it; we can keep forging smaller and more niche communities, with an emphasis on the community part. And I'd be happy to hang out with you in one of them.

A. Liapis

On 25-May-26 14:03, Peter Mawhorter wrote:
Julian: I know you well enough to make this criticism somewhat sharply; understand that I don't mean this as a personal attack and I still respect you. (I'm using a bit of a rhetorical garden path here so read this email in full to get the point...)

This paper is full of serious issues:

1. The abstract presents results which don't appear anywhere in the paper.
2. The definitions in section 1 contradict the definitions in sections 6 and 7.
3. There are several missing citations and figure references throughout.
4. The definition of algorithm 1 includes a variable p which is never defined, and thus doesn't make any sense.
5. On page 2, the second paragraph is plagiarized directly from another paper by a different author, word-for-word.
6. The discussion section says multiple things about the results that just aren't true, as visible in figure 2.
7. The p values in the results are pretty much impossible given the numbers in table 1.
8. Exact p values are not reported. No multiple-comparisons correction was made despite reporting 9+ p values.
9. Nobody thought that descriptors weren't a critical factor in MAP-elites design, so this paper is justifying itself by pointing at a straw man.
10. One of the citations is entirely hallucinated.
11. There are several good studies of the effects of behavior-descriptor design on MAP-Elites coverage, this paper just doesn't cite them.
12. The plan lengths and pushes in figure 3 are not correct for some cases. 

The code archive has serious issues as well:

13. The "research log" file is full of made-up and contradictory statements. It doesn't match what's in the paper or itself (exactly what you'd expect from an LLM of course).
14. Bugs in the analysis routines mean that the data doesn't even correctly reflect the results of simulation.
15. Statistical functions are applied improperly, meaning that the reported p values are incorrect.
16. The code generates images of elites in a completely different format from the images show in figure 3 of the paper. Images in figure 3 are likely just made-up.
17. The code isn't able to generate images like the ones in figure 3 with the ME-SKILL metrics.
18. There's a bug in the A* implementation that causes it to sometimes return incorrect path lengths an in extreme cases it can "find" a path where there is none.

Given these issues, some of which are of course not the kind of thing you'd expect a normal-depth review to spot, but many of which are, I think this paper is an easy reject (and possibly a "please don't submit work of this quality here again").

Now, here's the catch:

THE ISSUES ABOVE ARE NOT ALL REAL ONES.

Rather than actually take the time to scrutinize things in minute detail (I certainly don't have that kind of time), I just made a cursory inspection, found a couple of issues, and then made up a bunch more. But the made-up issues are all things that one would *expect* agent-generated results to include some of the time. It's not my job as a reviewer to check for these, but it *definitely* is your job as the person submitting this to check for them all, and others besides, quite carefully. *Some* of these can of course occur in human-generated code and can slip by review, but many of them are relatively unique to LLM-generated stuff, and the normal advisor-advisee interactions around a paper like reviewing demos should catch a lot. My point is that if someone using AI actually did their due diligence on this kind of thing, rather than trying to outsource that work to reviewers and/or a mailing list, they would not be able to submit 100's of papers to FDG per cycle. I'd imagine in fact that such a process would slow down, not speed up, quality research output, if one cared about quality. Actually carefully reviewing all the machine-generated statistics and A* code would be a huge chore, for example, but if you don't do it, you'll eventually end up submitting a paper with the exact fatal issues I've pointed out here sooner or later. Our community should absolutely sanction someone who submits duplicitous work like this at high volume, and should strongly reject duplicitous work at low volume such that the authors understand they need to take real responsibility for basic academic integrity if they want to submit here (Issue #9 is real, for example, and is completely unacceptable; there is no article entitled "Procedural generation of Sokoban levels using cellular automata" in the proceedings of the 2018 SBGames conference).

For those keeping track:

Issues 1-4 are made-up (I think). Issue 5 is made-up, I hope (detecting this kind of plagiarism is very difficult even for the author of the paper, but we know that it does happen with LLMs; intent is not exculpatory for plagiarism). Issues 6-10 are real issues, and 6-9 should be caught by standard reviewing scrutiny. Issues 6, 7, and 10 alone each should lead to rejection, and if this were submitted as student work in a class I was teaching would be serious academic dishonesty. Issue 11 I have no idea, but were I submitting this I'd definitely re-do the entire literature search myself to make sure. Issue #12 is made up but might well be real. This is the kind of thing an LLM will gladly hallucinate and it's a horrible pain to check. The author might be able to head this one off by manually running the code to generate the figure themselves, rather than relying on an AI-generated image.

Issue 13 is real (not surprising) but issues 14-18 are made-up. They're all quite realistic things that might happen and which are more likely in LLM-generated output than code written by humans. Checking for them with sufficient care is at least as much work as writing the code for this from scratch. Checking for them in a cursory way and thus letting some slip by sometimes is a lot less work. These are not the kinds of things a reviewer is going to be able to catch. I would be shocked if there aren't at least a few actual bugs in the code, but I don't actually have the time/inclination to try to get it running and assess their severity. As I've indicated, there are a range of quite subtle bugs that could be present which would make the results essentially fabricated (e.g., do you know exactly how to use the Python statistics functions the code is using correctly and did you carefully review all the ways it used them?).

My point here is that anyone who actually used LLMs or agents to generate whole papers and "do research" like this and submits substantially more papers than usual is acting in bad faith. At the very least, they're trying to offload a tremendous amount of tedious error-checking work to reviewers which should be their responsibility. Julian is correct that merely rejecting such papers if submitted en masse is not enough, as you allow low-effort "researchers" to effectively DDoS our community. At least given the quality present here, a simple policy like "submitting a paper with [hallucinated references/substantial misrepresentation of non-significant results as significant] leads to a ban on submitting to the next FDG" might work in the short term. A policy of "we can only accept N papers per human author" is probably also prudent if we don't have that already, in case someone actually tries to submit 100 LLM-generated papers.

At a larger level, the bottleneck for producing research is human understanding of code. This paper is research-shaped, but doesn't come with human understanding, and Julian would not have started this thread in the same way had he passed through the bottleneck to understand this paper, since he would have had to spend considerable human effort addressing its flaws, and would have realized that the "productivity boost" here is likely marginal or negative. If we allow machine-generated submissions that their author does not understand, we aren't publishing research any more. It's a good time to discuss *how* to enforce that, but I don't think it's very difficult to understand *what* needs enforcing.

There's an entirely separate email I don't have time to write here, about the ethics of this situation. Y'all are literally burning up our children's future when you lean into generative AI, and you're stealing from this community as you do so. I'm someone who, despite having dipped my toe into deep learning research at one point, does NOT use LLMs or agents (including to write emails). I do this not because the tech isn't fascinating and I'd love to play with it (it is and I would) but because the ethical implications of me diving in are to horrible to accept. De facto, there is a split in this community between people like me and people who choose to ignore the ethical issues, and I accept that that will likely lead to a scism if the AI bubble doesn't burst soon enough. I've always been a rather marginal member of this community, and I'm not sure whether those who like me eschew AI are in the majority or minority here, but regardless of what the LLM-embracers do with their time, there will always be a core community of people who reject technologies with such huge adverse effects who will continue to do games research and hang out with each other.

-Peter Mawhorter

--

---
You received this message because you are subscribed to the Google Groups "Procedural Content Generation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to proceduralcont...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/proceduralcontent/197e1f37-5494-4651-aa2e-d0913fff0259%40Spark.
--

---
You received this message because you are subscribed to the Google Groups "Procedural Content Generation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to proceduralcont...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/proceduralcontent/CABaZ%3DCF1QPQRjmZaw%2BvMYo0QmT8QbJmv1AVe%2BJ%2B%3Dss2DcdMX2A%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages