In June 2022, I bet a commenter $100 that AI would master image compositionality by June 2025. DALL-E2 had just come out, showcasing the potential of AI art. But it couldn’t follow complex instructions; its images only matched the “vibe” of the prompt. For example, here were some of its attempts at “a red sphere on a blue cube, with a yellow pyramid on the right, all on top of a green table”. At the time, I wrote:
Commenters objected that this was overly optimistic. AI was just a pattern-matching “stochastic parrot”. It would take a deep understanding of grammar to get a prompt exactly right, and that would require some entirely new paradigm beyond LLMs. For example, from Vitor:
So we made a bet!
Some image models of the time refused to draw humans, so we agreed that robots could stand in for humans in pictures that required them. In September 2022, I got some good results from Google Imagen and announced I had won the three-year bet in three months. Commenters yelled at me, saying that Imagen still hadn’t gotten them quite right and my victory declaration was premature. The argument blew up enough that Edwin Chen of Surge, an “RLHF and human LLM evaluation platform”, stepped in and asked his professional AI data labelling team. Their verdict was clear: the AI was bad and I was wrong. Rather than embarrass myself further, I agreed to wait out the full length of the bet and re-evaluate in June 2025. The bet is now over, and official judge Gwern agrees I’ve won. Before I gloat, let’s look at the images that got us here. AI Compositionality: A Three Year RetrospectiveImage Set 1: June 2022 When we first made the bet in June 2022, the best that an AI model could do on the five prompts was: You can see why people would be skeptical! In most images, the pieces are all there: astronauts, foxes, lipstick. But they’re combined in whatever way seems most “plausible” or “realistic”, rather than the way indicated by the prompt - so for example, the astronaut is wearing the lipstick, rather than the fox. Other times there are unrelated inexplicable failures, like the half-fox, half-astronaut abomination in panel #1. Here we get 0/5. Image Set 2: June 2022 Three months later, I declared premature victory when Google Imagen produced the following: I said it got the cat, llama, and basketball exactly right, meeting the necessary 3/5. Edwin and his evaluators disagreed. They granted success on the cat. But the llama didn’t really have a clear bell on its tail (the closest, #4, was more of a globe). And the final robot wasn’t much of a farmer, wasn’t in much of a cathedral, and the basketball was more orange than red. They granted me 1/5. Fine. Image Set 3: January 2024 One of the questions on the 2023 - 2024 ACX prediction contest was whether any AI would win the bet by the end of 2023. In order to resolve the question, Edwin and his Surge team returned to the image mines in January 2024. They checked DALL-E3 and Midjourney; I’m including only the pictures from DALL-E3, which did better. Here they are: These are of higher artistic quality, and they can finally generate humans (instead of just robots). But they still don’t win the bet. This time Edwin granted the cat and the farmer. But the stupid llama still didn’t have the bell on its tail, the #$%&ing raven still didn’t have the key in its mouth, and although the fox had lipstick in one picture (#2), the astronaut wasn’t exactly holding it. 2/5, one short of victory. On prediction markets, where users had given 62% probability that Edwin would grant me the win that year, reactions were outraged. “Are you kidding me?” asked one commenter. “Is Edwin Chen an asshole? Clearly he is,” said another. Image Set 4: September - December 2024 User askwho on the Bayesian Conspiracy Discord claimed that Google Imagen passed the test in September 2024 (he said Imagen 2, but based on the timing it may have been Imagen 3). But he didn’t post it publicly and couldn’t remember all details, so I’ll evaluate this related claim, also about Imagen 3, from December:
I would give this 3/5. We keep the top-hatted cat and the basketball-holding farmer, and the bell is finally on the llama’s tail. But the raven picture isn’t stained glass, and the fox still doesn’t have lipstick. I tried to contact Edwin for confirmation, without success. I wondered what had happened to him, and a quick search found that his AI data-labeling company did very well and he’s now probably a billionaire. I hope he’s relaxing on a yacht somewhere, far away from angry prediction market commenters. In the absence of a grader, I figured I would let the bet run out the clock. Image Set 5: May - June 2025 These are using ChatGPT 4o, released in May 2025, all images generated June 1 (thanks a reader):
Not only is this 5/5, but it’s an obvious step up in matching the styles, and these were all produced on the first try. In retrospect, it feels like judges were right to dismiss former models, which were sort of blundering about and getting some of them right by coincidence. 4o just works. Edwin is presumably still on his yacht, but original contest judge Gwern gave it his seal of approval, saying:
In Memoriam: Your Last Set Of Goalposts, Gone But Not ForgottenIt’s probably bad form to write a whole blog post gloating that you won a bet. I’m doing it anyway, because we’re still having the same debate - whether AI is a “stochastic parrot” that will never be able to go beyond “mere pattern-matching” into the realm of “real understanding”. My position has always been that there’s no fundamental difference: you just move from matching shallow patterns to deeper patterns, and when the patterns are as deep as the ones humans can match, we call that “real understanding”. This isn’t quite right - there’s a certain form of mental agency that humans still do much better than AIs - but again, it’s a (large) difference in degree rather than in kind. I think this thesis has done well so far. So far, every time people have claimed there’s something an AI can never do without “real understanding”, the AI has accomplished it with better pattern-matching. This was true back in 2020 when GPT-2 failed to add 2+1 and Gary Marcus declared that scaling had failed and it was time to “consider investing in different approaches” (according to Terence Tao, working with AIs is now “on par with trying to advise a mediocre, but not completely incompetent, static simulation of a graduate student”). I think progress in AI art tells the same story. There is still one discordant note in this story. When I give 4o a really hard prompt…
…it still can’t get it quite right: But a smart human can complete an arbitrarily complicated prompt. So is there still some sense in which the AI is “just pattern matching”, but the human is “really understanding”? Maybe AIs get better at pattern-matching as they scale up, and eventually they’ll get good enough for every conceivable reasonably task, but they still won’t be infinitely good in the same way humans are? I think there’s something going on here where the AI is doing the equivalent of a human trying to keep a prompt in working memory after hearing it once - something we can’t do arbitrarily well. I admit I can’t prove this, and it’s not necessarily intuitive - the AI does have a scratchpad, not to mention it has the prompt in front of it the whole time. It’s just what makes sense to me based on an analogy with math problems, where AIs often break down at the same point humans do (eg they can multiply two-digit numbers “in their head”, but not three-digit numbers). I think this will be solved when we solve agency well enough that the AI can generate plans like drawing part of the picture at a time, then checking the prompt, then doing the rest of it. This may require new skills, like self-reference and planning, which might be added in by hand, emerge naturally from the scaling and training process, or some combination of both. If you disagree, let me know - maybe we can bet on it! Thanks to everyone who helped operationalize, judge, and generate images for this bet. Vitor, you owe me $100, email me at sc...@slatestarcodex.com. You're currently a free subscriber to Astral Codex Ten. For the full experience, upgrade your subscription. |