--
You received this message because you are subscribed to the Google Groups "Everything List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to everything-li...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/everything-list/CAJPayv0tZqzZO_ik%2B4Jqb0DtMN6ZsbUuoYTx7Y62vYgvNRNZyA%40mail.gmail.com.
Not enough detail for any conclusions. According to the blog post on arcprize.org (see here: https://arcprize.org/blog/oai-o3-pub-breakthrough),
“OpenAI’s new o3 system—trained on the ARC-AGI-1 Public Training set—has scored a breakthrough 75.7% on the Semi-Private Evaluation set at our stated public leaderboard $10k compute limit. A high-compute (172x) o3 configuration scored 87.5%.”
This excerpt is central to the discussion. The blog is announcing what it calls a “breakthrough” result, attributing the model’s performance on an evaluation set to the new “o3 system.” The mention of the “$10k compute limit” probably refers to a constraint or budget allocated for training and/or inference on the public leaderboard. Additionally, there is a statement that when the system is scaled up dramatically (172 times more compute resources), it manages to score 87.5%. The difference between the 75.7% result and 87.5% result is thus explained by a large disparity in the computational budget used for training or inference.
More significantly, as I’ve highlighted in bold, the model was explicitly trained on the very same data (or a substantial subset of it) against which it was later tested. The text itself says: “trained on the ARC-AGI-1 Public Training set” and then, in the next phrase, reports scores on “the Semi-Private Evaluation set,” which is presumably meant to be the test portion of that same overall dataset (or at least closely related). While it is possible in machine learning to maintain a strictly separate portion of data for testing, the mention of “Semi-Private” invites questions about how distinct or withheld that portion really is. If the “Semi-Private Evaluation set” is derived from the same overall data distribution used in training, or if it contains overlapping examples, then the resulting 75.7% or 87.5% scores might reflect overfitting/memorization more than genuine progress toward robust, generalizable intelligence.
The separation of training data from test or evaluation data is critical to ensure that performance metrics capture generalization, rather than the model having “seen” or memorized the answers in training. When a blog post highlights a “breakthrough” but simultaneously acknowledges that the data used to measure said breakthrough was closely related to the training set, skeptics like yours truly naturally question whether this milestone is more about tuning to a known distribution instead of a leap in fundamental capabilities. Memorization is not reasoning, as I’ve stated many times before. But this falls on deaf ears here all the time.
Beyond the bare mention of “trained on the ARC-AGI-1 Public Training set,” there is an implied process of repeated tuning or hyperparameter searches. If the developers iterated many times over that dataset—adjusting parameters, architecture decisions, or training strategies to maximize performance on that very same distribution—then the reported results are likely inflated. In other words, repeated attempts to boost the benchmark score can lead to an overly optimistic portrayal of performance. Everybody knows that this becomes “benchmark gaming,” or in milder terms, an accidental overfitting to the validation/test data that was supposed to be isolated.
The blog post mentions two different performance figures: one achieved under the publicly stated $10k compute limit, and another, much higher score (87.5%) when the system was scaled up 172 times in terms of compute expenditure. Consider the costliness of such experiments for so small of a performance bump! That’s not a positive sign for optimists regarding this question.
AI models can indeed be improved—sometimes dramatically—by throwing more compute at them. However, from the perspective of practicality or genuine progress, it may be less impressive if the improvement depends purely on scaling up hardware resources by a large factor, rather than demonstrating a new or more efficient approach to learning and reasoning. Is this warranted if the tasks the model is solving do not necessarily require advanced reasoning skills that a much smaller system (or a human child) can handle in a simpler way?
If a large-scale AI system with a massive compute budget is merely matching or modestly exceeding the performance that a human child can achieve, it undercuts the notion of a major “breakthrough.” Additionally, children’s ability to adapt to novel tasks and generalize without being artificially “trained” on the same data is a key part of the skepticism: the kind of intelligence the AI system is demonstrating might be narrower or more brittle compared to natural, human-like intelligence.
All of this underscores how these “breakthrough” claims can be misleading if not accompanied by rigorous methodology (e.g., truly held-out data, minimal overlap, reproducible results under consistent compute budgets). While the raw numbers of 75.7% and 87.5% might look impressive at face value, the context provided—including the fact that the ARC-AGI-1 dataset was also used for training—casts doubt on the significance of those scores as an indicator of robust progress in AI or alignment research.
I leave you with a quote from the blog:
Passing ARC-AGI does not equate to achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.
Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training). This demonstrates the continued possibility of creating challenging, unsaturated benchmarks without having to rely on expert domain knowledge. You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.
And even with that, the memory vs. reasoning problem doesn’t vanish. You believe that your interlocutor is generally intelligent or not. But I’ve repeated this so many times, it’s getting too time consuming to keep responding in detail. Thanks to John anyway for posting past all the narcissism needing therapy spam here. It’s getting tedious, I have to agree with Quentin. Less and less posts with big picture + good level of nuances.
> there is a statement that when the system is scaled up dramatically (172 times more compute resources), it manages to score 87.5%. The difference between the 75.7% result and 87.5% result is thus explained by a large disparity in the computational budget used for training or inference.
> the model was explicitly trained on the very same data (or a substantial subset of it) against which it was later tested. The text itself says: “trained on the ARC-AGI-1 Public Training set”
> Beyond the bare mention of “trained on the ARC-AGI-1 Public Training set,” there is an implied process of repeated tuning or hyperparameter searches.
> children’s ability to adapt to novel tasks and generalize without being artificially “trained” on the same data is a key part of the skepticism:
> a quote from the blog:
"Passing ARC-AGI does not equate to achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet."
> Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3,
@Brent. Shut up you woke communist! In case you don't know, you are a straight white male. It that woke feminazi would have won the election, you would have been the first to be exterminated. Be glad that Trump won!
--
You received this message because you are subscribed to the Google Groups "Everything List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to everything-li...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/everything-list/888605d5-5f47-469a-b22d-8db3fa22e9f8n%40googlegroups.com.
"indicating fundamental differences with human intelligence."
Dude! Human intelligence is the property of consciousness of being able to bring new ideas into existence out of nothing.
You cannot simulate such a thing.
Omg... so many children! When will you people ever grow up ?
--
You received this message because you are subscribed to the Google Groups "Everything List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to everything-li...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/everything-list/b6747490-b794-4a8e-a454-adac5681207cn%40googlegroups.com.
> While it’s true that scaling a model’s compute often improves performance—e.g., O3 going from 75.7% to 87.5%—that alone doesn’t prove that we’ve achieved “fundamental AGI.”
> In machine learning history, many benchmarks have been surpassed by throwing more resources at them, yet models often fail when faced with novel tasks.
> I’ve personally seen kids under age ten handle about 80 to upwards of 90% of the daily “play” tasks (besides acing the 6 problems on the landing page) on the ARC site once they grasp the basic rule of finding the rule,
> As for the claim that François Chollet “moved the goalpost” once AI systems approached the 75% mark, it’s common in AI research for benchmarks to evolve precisely because scoring high on an older test doesn’t necessarily reflect deep, generalizable reasoning.
> simulating what appears to be reasoning or problem-solving.
> For instance, an LLM solving a riddle or answering a complex question does so by leveraging patterns that mimic logical steps or dependencies, even though it lacks true understanding
> It feels different and "more intelligent" because this functional selection imparts a structured response that aligns with human expectations of reasoning.
this is far from genuine intelligence or reasoning. LLMs are bound by their probabilistic nature and lack the ability to generalize beyond their training data,
> or generate higher-order abstractions.
--
You received this message because you are subscribed to the Google Groups "Everything List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to everything-li...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/everything-list/CAJPayv3NM6jr57u%3D2XoQCh15jsqO%3Di3xGXjEJH2BF1OJPw3EdA%40mail.gmail.com.
> You see what I'm getting at. It’s true that from a purely outcome-based perspective—either a problem is solved, or it isn’t—it can seem irrelevant whether an AI is “really” reasoning or simply following patterns. This I'll gladly concede to John's argument.If Einstein’s “real reasoning” and an AI’s “simulated reasoning” both yield a correct solution, the difference might appear purely philosophical.
> However, when we look at how that solution is derived—whether there’s a coherent, reusable framework or just a brute-force pattern assembly within a large but narrowly defined training distribution—we begin to see distinctions that matter. Achievements like superhuman board-game play and accurate protein folding often leverage enormous coverage in training or simulation data.
> This can conceal limited adaptability when confronted with radically unfamiliar inputs, and even if a large model appears to have used “higher-order abstraction,” it may not have produced a stable or generalizable theory—only an ad hoc solution that might fail under new conditions.
> in high-risk domains like brain surgery, we risk encountering unforeseen circumstances well beyond any training distribution.
> AI’s specialized “intelligence” is inefficient compared to a competent human team
> Other domains off the top in which this robust kind of generalization is critical are Self-Driving Vehicles in unmapped terrains or extreme weather:
> Another one would be financial trading during market crises: Extreme market shifts can quickly invalidate learned patterns, and “intuitive” AI decisions might fail without robust abstraction and loose billions.
> How about nuclear power plant operations? Rare emergency scenarios, if not seen in training, could lead to catastrophic outcomes
I’d note first that your analogy with chess and engine moves ignores an asymmetry: a grandmaster can often sense the “non-human” quality of an engine’s play, whereas engines are not equipped to detect distinctly human patterns unless explicitly trained for that. Even then, the very subtleties grandmasters notice—intuitive plausibility, certain psychological hallmarks—are not easily reduced to a static dataset of “human moves.” That’s why a GM plus an engine can typically spot purely artificial play in a way the engine itself cannot reciprocate. Particularly with longer sets of moves, but sometimes a single move suffices.
Chess self-play is btw trivial to scale because it operates in a closed domain with clear rules. You can’t replicate that level of synthetic data generation in, say, urban traffic, nuclear power plants, surgery, or modern warfare.
On self-driving cars, relying on “training” alone to handle out-of-distribution anomalies can lead to catastrophic, repetitive failures in the face of curbside anomalies of various kinds (unusual geometries, rock formations, obstacles not in the training data etc). It is effin disturbing to see a vehicle costing x thousand dollars repeatedly slamming into an obstacle, change direction, only to do so again and again and again…
Humans, even dumb ones like yours truly, by contrast, tend to halt or re-evaluate instantly after one near miss. Equating “education” for humans and “training data” for AI ignores how real-world safety demands robust adaptation to new, possibly singular events.
Now you will again perform one of the following rhetorical strategies to obfuscate:
1. Equate AI with human/Löbian learning through whataboutism or symmetry arguments
2. Undermine the uniqueness and gravity of AI limitations by highlighting human failures/mistakes, thereby diluting AI risks
3. Cherry pick examples of dramatic appearing AI advantages
4. Employ historical allusions to show parallels between AI and scientific discoveries. E.g. The references to Planck’s “quantized atom” and the slow, iterative acceptance of quantum mechanics cast doubt on the suggestion that AI’s ad hoc solutions are a unique liability. The argument is that scientists, too, often start with incomplete or approximate theories and refine them over time. While this is valid to a point, it conflates two different processes: humans can revise and expand theories with continuing insight, whereas most current AI systems cannot autonomously re-architect their own approach without extensive retraining or reprogramming.
As well as others I don't want to bother with.
But this is irrelevant because in the last post, that wasn’t meant for you, I prepared a trap regardless, as I will only engage good faith arguments that push the discussion in new directions (I will engage more deeply with Brent’s reply for this reason). And, being your usual modest self, you took the bait, which ends my conversation with you on this subject for now as I will illustrate in the concluding section of this post. Even if I appreciate your input to this list, your posts hyping AI advances at this point, are becoming almost all spam/advertising for ideological reasons, as I will illustrate.
The bait was luring you to clarify your stance that “a problem is solved or it isn’t” while simultaneously implying that AI failure is more tolerable than human failure. That is self-contradictory. If correctness is correctness, then the consequences of AI mistakes should weigh at least as heavily as those of human missteps—yet you are clearly content to overlook AI’s more rigid blind spots.
That’s rigid ideology/personal belief validation masking itself as discussion. It does not respect sincere and open discussions that make this list a resource imho. Employ your usual rhetorical tricks and split hairs as you wish in reply. Your horse in this race is revealed. You are free to believe in perfect AI’s and let them run your monetary affairs, surgery, life etc. I simply don’t care and wish you all the best with that.
I think that LLM thinking is not like reasoning. It reminds me of my
grandfather who was a cattleman in Texas. I used to go to auction with
him where he would buy calves to raise and where he would auction off
one's he had raised. He could do calculations of what to pay, gain and
loss, expected prices, cost of feed all in his head almost instantly.
But he couldn't explain how he did it. He could do it pencil and paper
and explain that; but not how he did it in his head. So although the
arithmetic would be of the same kind he couldn't figure insurance rates
and payouts, or medical expenses, or home construction costs in his
head. The difference with LLM's is they have absorbed so many examples
on every subject, as my grandfather had of auctioning cattle, that the
LLM's don't have reasoning, they have finely developed intuition, and
they have it about every subject. Humans don't have the capacity to
develop that level of intuition about more that one or two subjects;
beyond that they have to rely on slow, formal reasoning.
Nail on the head. That’s an interesting story and got me thinking (too much again). This reminds me of the distinction in Kahneman’s bestseller from 2011 between “type 1” and “type 2” thinking (aka “system 1” and “system 2”). And although there are problems/controversies with his ideas/evidence, as is almost standard with psychological proposals of this kind, we can, if you indulge me, take his ideas less literally describing loose cognitive styles of reasoning as follows:
LLMs today (end of 2024), with their vast memory-like parameter sets, are closer to system 1 “intuitive reasoning styles” that appear startlingly broad and context-sensitive—even "fine", as you say, somewhat related to your grandfather's style of reasoning in their area(s) of expertise. Yet, in my view, LLMs (not wishing to imply anything about your grandfather) are not performing the slower, more deliberative, rule-based thinking we often associate with system 2.
Take the example of Magnus Carlsen, arguably one of the best performing chess players in recent history, relying increasingly on instinctive, “type 1” moves, after thousands of hours of deliberate “type 2” calculation (when he was young and ascending the throne, he was more known for his calculation skills). This illustrates how extensive practice can shift certain skills from a laborious, step-by-step process toward fast, experience-based pattern recognition. I’ve observed many times how he looks at the clock in a time crunch situation and makes decisive, critical, superior computeresque moves, based purely on a more precise and finely tuned gut feeling, than his opponent. Especially with too little time to calculate his way through a situation in recent years. In streams where he comments on other GM’s play, he often instantly blurts out statements like “that knight just belongs on e6” or “the position is screaming for bishop f8” and similar, without a second of thought or looking at the AI/engine evaluation.
A similar thing happens, perhaps not at a world class level (but even Magnus makes mistakes, just less often than others), when a person learns to drive: at first, every procedure (shifting gears, 360 degree checks, mirror vigilance, steering) is deliberate and conscious, whereas a practiced driver performs multiple operations fluently/simultaneously without thinking in type 2 manner. We see that in most advanced tasks—like playing chess, doing math, or solving puzzles—humans merge both system 1 and system 2, but they may rely more on one depending on experience and context.
I see current LLMs as occupying a distinct spot: the “intuition” they rely on is not gained through slow, personal experience in one domain, but rather from the vast breadth of their pretraining across nearly every field for which text data or code exists. Again, it’s not just static memory and I do recognize the nuance that they are not just memorizing answers to questions or merely content. AGI advocates scream: “LLMs interpolate, so its reasoning!” Unfortunately, they don’t often specify what that means.
What they primarily memorize, if I understand correctly is functions/programs in a certain way. Programs do generalize to some extent by mathematical definition. When John questions an LLM via his keyboard, he is essentially querying a point in program space, where we can think of the LLM in some domain as a manifold with each point encoding a function/program. Yes, they do interpolate across these manifolds to combine/compose programs which implies an infinite number of possible programs that John can choose from through his keyboard. That is why they appear to reason richly and organically, unlike earlier years and can help debug code (as they have been trained against compositions that yield false results) in sophisticated manners with human assistance.
So what we’re doing with LLMs is that we’re training them as rich flexible models to predict the next token. And if we had infinite memory capacity, we could simply train them to learn a sort of lookup table. But the reality is more modest: LLMs only have some billion or trillion parameters. That’s why they screw up basic grade school math problems not in their training set or do not manage to remove a single small element John doesn’t want in the complex image he’s generated with his favorite image generator. Therefore it cannot learn a lookup table for every possible input sequence relating to its training data. It is forced to compress. So what these programs “learn” is predictive functions that take and detect the form of vector functions because the LLM is a curve… and the only thing we can encode with a curve is a set of vector functions. These take elements of the entry sequence as inputs and add output elements of what follows.
Say John feeds it the works of Oscar Wilde for the first time and the LLM has already “learned” a model of the English language. The text that John has inputed is slightly different and yet still the English language. And that’s why it’s so good at emulating linguistic styles and elegance or lack thereof, as its possible to model Oscar Wilde by reusing a lot of functions of learning to model English in general. It therefore becomes trivial to model Wilde’s style by simply deriving a style transfer function that will go from the model of English to the Wilde style texts. That’s how/why people are amazed at its linguistic dexterity in this sense.
That’s why they appear to have an “intuitive” command of everything—law, biology, programming, philosophy—despite lacking the capacity to handle tasks that need deeper, stepwise reasoning or dynamic re-planning in new contexts. This resonates with your grandfather’s “cattle auction” case: he could handle his specialized domain through near-instant intuition but needed pencil-and-paper for general arithmetic outside of it, if I understand correctly. LLMs similarly, may seem “fluent” in many subjects, yet they cannot truly engage in what we would label “system 2 program synthesis”, unless of course the training distribution already covers a close analogue.
In short, the reason LLM outputs feel like “type 1 reasoning on steroids” is that these models have memorized so many examples that their combined intuition extends across nearly all known textual domains. But when a problem truly demands formal reasoning steps absent from their training data, LLMs lack a real “type 2” counterpart—no robust self-critique, no internal program writing, and no persistent memory to refine their logic. We can therefore liken them to formidable intuition machines without the same embedded capacity for system 2, top-down reasoning or architectural self-modification that we see in real human skill acquisition. My big mistake is of course that John would never read Oscar Wilde.
As a short note to Russell: yes, they appear to be competent at such assistance for similar reasons. I don't see advancements in the coding use case as a function purely of scaling and compute. The more advanced models can assist in such ways because they have longer histories of having suppressed "wrong" function chains/interpolation. Thx for the inspo guys.
> I’d note first that your analogy with chess and engine moves ignores an asymmetry: a grandmaster can often sense the “non-human” quality of an engine’s play,
> whereas engines are not equipped to detect distinctly human patterns unless explicitly trained for that. That’s why a GM plus an engine can typically spot purely artificial play in a way the engine itself cannot reciprocate.
> Even then, the very subtleties grandmasters notice—intuitive plausibility, certain psychological hallmarks—are not easily reduced to a static dataset of “human moves.”
> Chess self-play is btw trivial to scale because it operates in a closed domain with clear rules.
> You can’t replicate that level of synthetic data generation in, say, urban traffic,
> nuclear power plants, surgery, or modern warfare.
> The bait was luring you to clarify your stance that “a problem is solved or it isn’t” while simultaneously implying that AI failure is more tolerable than human failure. That is self-contradictory.
> they [AIs] screw up basic grade school math problems
Diagnostics that mostly involve info that have been encoded in a machine readable way will be available for machine use. However it will take some time to encode age of clothes, smell, choice of clothes and the gut feeling that i can't quantity as anything other than "sick person".
Bad data will be a big problem for ai diagnostics for a long time. You can count on one thing and that is the extremely shoddy data in all historical databases.
It is really bad, som bad that it is a problem for human interpretation of the data even when you have the patient with you to "correct" errors in the Medical records.
When those problems have been corrected we are no doubt of for a bright future.
For suitable diagnostics it is however already not if AI is beneficial, it is rather how do you train your AI to fit your patient specifics.
/Henrik
--
You received this message because you are subscribed to the Google Groups "Everything List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to everything-li...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/everything-list/CAJPayv10WqcFwDOSiDzGDvccyudoxhzHgmTeBG_wbXp8MgjCKg%40mail.gmail.com.