--
You received this message because you are subscribed to the Google Groups "Computational Intelligence and Games" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cigames+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/cigames/e0a46d32-a5bb-4638-ad7b-79c6b4fca530n%40googlegroups.com.
Hi Julian,
Interesting how our first response is to think about how the prompt could have been improved (perhaps by choosing different authors) or whether a more agentic approach would have helped. Have you seen this: https://agents4science.stanford.edu ?
The same dilemma also applies to our students. Teachers are increasingly turning to written exams and oral exams where possible. It will become harder to know what is what, especially if we now basically allow AI to proofread English in documents we have actually written, or allow tools like Claude Code to implement our ideas. In the process, the AI may suggest improvements to both ideas and text, which we then accept. At what point is the work co-authored by AI?
I think the approach used for EU grants is sensible: you are allowed to use AI to help write the proposal, but you remain responsible for the content. My impression is that our university is taking a similar position.
Working with AI is going to speed up research. Some of the work traditionally done by a “vanilla” PhD student may be done faster, and perhaps better, by AI, with the human researcher in the driver’s seat. We will see more headlines like this: https://today.ucsd.edu/story/nine-breakthroughs-made-possible-by-ai . So we need to train our PhD students to become AI drivers: to ask better questions, suggest new ideas to experiment with, and judge which directions are worth pursuing. The AI does not replace the educational function of struggling with ideas, making mistakes, and learning what counts as a good question.
Conferences are a source of revenue for publishers and organizers, but for researchers they are primarily places to meet and discuss ideas. Unfortunately, some universities expect research output in the form of papers, and this encourages mass production. AI conferences for agents seem obvious to me, but they may end up looking like this: https://www.moltbook.com . The citation system may also need to change, for example, distinguishing between agent and human citations.
There are conferences where you simply submit an abstract and then give a talk. This may be the way forward: even if AI helped with the abstract, the author still has to stand up, explain the work, and defend the ideas. Similarly, oral exams and talks may help verify that there is a human who understands the work, but they do not fully prove where the ideas came from. The real challenge is not only detecting AI use, but preserving the development of human scientific judgement in a setting where AI can increasingly produce convincing research outputs.
A natural next step would be for universities to put less weight on paper output, and more on the quality of the work. Of course, this creates a problem for administrators, since quality is subjective and much harder to quantify than the number of papers. Journal prestige is often used as a proxy for quality, but this only moves the problem rather than solving it. Citation counts are also imperfect, since they are biased by field, community size, and citation practices.
Cheers,
Tom
P.S. Proofread using AI, of course.
To view this discussion visit https://groups.google.com/d/msgid/cigames/GV1PR02MB832928A3A91E1F556B6ADACDB90A2%40GV1PR02MB8329.eurprd02.prod.outlook.com.
Hey Peter,
Thanks for the thorough e-mail. The hidden labor of checking any AI ouput (let alone a full paper with code and results) for discrepancies big and small is always understated. There's even more important issues of labor re: our role as researchers (our function, our expertise, our paychecks) that I don't want to bring up because I don't want to write a long e-mail like yours and I will probably upset myself.
I just wanted to reply to this to say that the games research community was/is/will be always niche, and it's always a very personal and tight-knit relationship that we have with each other especially for this reason. If LLMs or anything else divides us, then so be it; we can keep forging smaller and more niche communities, with an emphasis on the community part. And I'd be happy to hang out with you in one of them.
A. Liapis
Julian: I know you well enough to make this criticism somewhat sharply; understand that I don't mean this as a personal attack and I still respect you. (I'm using a bit of a rhetorical garden path here so read this email in full to get the point...)
This paper is full of serious issues:
1. The abstract presents results which don't appear anywhere in the paper.2. The definitions in section 1 contradict the definitions in sections 6 and 7.3. There are several missing citations and figure references throughout.4. The definition of algorithm 1 includes a variable p which is never defined, and thus doesn't make any sense.5. On page 2, the second paragraph is plagiarized directly from another paper by a different author, word-for-word.6. The discussion section says multiple things about the results that just aren't true, as visible in figure 2.7. The p values in the results are pretty much impossible given the numbers in table 1.8. Exact p values are not reported. No multiple-comparisons correction was made despite reporting 9+ p values.9. Nobody thought that descriptors weren't a critical factor in MAP-elites design, so this paper is justifying itself by pointing at a straw man.10. One of the citations is entirely hallucinated.11. There are several good studies of the effects of behavior-descriptor design on MAP-Elites coverage, this paper just doesn't cite them.12. The plan lengths and pushes in figure 3 are not correct for some cases.
The code archive has serious issues as well:
13. The "research log" file is full of made-up and contradictory statements. It doesn't match what's in the paper or itself (exactly what you'd expect from an LLM of course).14. Bugs in the analysis routines mean that the data doesn't even correctly reflect the results of simulation.15. Statistical functions are applied improperly, meaning that the reported p values are incorrect.16. The code generates images of elites in a completely different format from the images show in figure 3 of the paper. Images in figure 3 are likely just made-up.17. The code isn't able to generate images like the ones in figure 3 with the ME-SKILL metrics.18. There's a bug in the A* implementation that causes it to sometimes return incorrect path lengths an in extreme cases it can "find" a path where there is none.
Given these issues, some of which are of course not the kind of thing you'd expect a normal-depth review to spot, but many of which are, I think this paper is an easy reject (and possibly a "please don't submit work of this quality here again").
Now, here's the catch:
THE ISSUES ABOVE ARE NOT ALL REAL ONES.
Rather than actually take the time to scrutinize things in minute detail (I certainly don't have that kind of time), I just made a cursory inspection, found a couple of issues, and then made up a bunch more. But the made-up issues are all things that one would *expect* agent-generated results to include some of the time. It's not my job as a reviewer to check for these, but it *definitely* is your job as the person submitting this to check for them all, and others besides, quite carefully. *Some* of these can of course occur in human-generated code and can slip by review, but many of them are relatively unique to LLM-generated stuff, and the normal advisor-advisee interactions around a paper like reviewing demos should catch a lot. My point is that if someone using AI actually did their due diligence on this kind of thing, rather than trying to outsource that work to reviewers and/or a mailing list, they would not be able to submit 100's of papers to FDG per cycle. I'd imagine in fact that such a process would slow down, not speed up, quality research output, if one cared about quality. Actually carefully reviewing all the machine-generated statistics and A* code would be a huge chore, for example, but if you don't do it, you'll eventually end up submitting a paper with the exact fatal issues I've pointed out here sooner or later. Our community should absolutely sanction someone who submits duplicitous work like this at high volume, and should strongly reject duplicitous work at low volume such that the authors understand they need to take real responsibility for basic academic integrity if they want to submit here (Issue #9 is real, for example, and is completely unacceptable; there is no article entitled "Procedural generation of Sokoban levels using cellular automata" in the proceedings of the 2018 SBGames conference).
For those keeping track:
Issues 1-4 are made-up (I think). Issue 5 is made-up, I hope (detecting this kind of plagiarism is very difficult even for the author of the paper, but we know that it does happen with LLMs; intent is not exculpatory for plagiarism). Issues 6-10 are real issues, and 6-9 should be caught by standard reviewing scrutiny. Issues 6, 7, and 10 alone each should lead to rejection, and if this were submitted as student work in a class I was teaching would be serious academic dishonesty. Issue 11 I have no idea, but were I submitting this I'd definitely re-do the entire literature search myself to make sure. Issue #12 is made up but might well be real. This is the kind of thing an LLM will gladly hallucinate and it's a horrible pain to check. The author might be able to head this one off by manually running the code to generate the figure themselves, rather than relying on an AI-generated image.
Issue 13 is real (not surprising) but issues 14-18 are made-up. They're all quite realistic things that might happen and which are more likely in LLM-generated output than code written by humans. Checking for them with sufficient care is at least as much work as writing the code for this from scratch. Checking for them in a cursory way and thus letting some slip by sometimes is a lot less work. These are not the kinds of things a reviewer is going to be able to catch. I would be shocked if there aren't at least a few actual bugs in the code, but I don't actually have the time/inclination to try to get it running and assess their severity. As I've indicated, there are a range of quite subtle bugs that could be present which would make the results essentially fabricated (e.g., do you know exactly how to use the Python statistics functions the code is using correctly and did you carefully review all the ways it used them?).
My point here is that anyone who actually used LLMs or agents to generate whole papers and "do research" like this and submits substantially more papers than usual is acting in bad faith. At the very least, they're trying to offload a tremendous amount of tedious error-checking work to reviewers which should be their responsibility. Julian is correct that merely rejecting such papers if submitted en masse is not enough, as you allow low-effort "researchers" to effectively DDoS our community. At least given the quality present here, a simple policy like "submitting a paper with [hallucinated references/substantial misrepresentation of non-significant results as significant] leads to a ban on submitting to the next FDG" might work in the short term. A policy of "we can only accept N papers per human author" is probably also prudent if we don't have that already, in case someone actually tries to submit 100 LLM-generated papers.
At a larger level, the bottleneck for producing research is human understanding of code. This paper is research-shaped, but doesn't come with human understanding, and Julian would not have started this thread in the same way had he passed through the bottleneck to understand this paper, since he would have had to spend considerable human effort addressing its flaws, and would have realized that the "productivity boost" here is likely marginal or negative. If we allow machine-generated submissions that their author does not understand, we aren't publishing research any more. It's a good time to discuss *how* to enforce that, but I don't think it's very difficult to understand *what* needs enforcing.
There's an entirely separate email I don't have time to write here, about the ethics of this situation. Y'all are literally burning up our children's future when you lean into generative AI, and you're stealing from this community as you do so. I'm someone who, despite having dipped my toe into deep learning research at one point, does NOT use LLMs or agents (including to write emails). I do this not because the tech isn't fascinating and I'd love to play with it (it is and I would) but because the ethical implications of me diving in are to horrible to accept. De facto, there is a split in this community between people like me and people who choose to ignore the ethical issues, and I accept that that will likely lead to a scism if the AI bubble doesn't burst soon enough. I've always been a rather marginal member of this community, and I'm not sure whether those who like me eschew AI are in the majority or minority here, but regardless of what the LLM-embracers do with their time, there will always be a core community of people who reject technologies with such huge adverse effects who will continue to do games research and hang out with each other.
-Peter Mawhorter
----
---
You received this message because you are subscribed to the Google Groups "Procedural Content Generation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to proceduralcont...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/proceduralcontent/197e1f37-5494-4651-aa2e-d0913fff0259%40Spark.
---
You received this message because you are subscribed to the Google Groups "Procedural Content Generation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to proceduralcont...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/proceduralcontent/CABaZ%3DCF1QPQRjmZaw%2BvMYo0QmT8QbJmv1AVe%2BJ%2B%3Dss2DcdMX2A%40mail.gmail.com.