Apple Just Pulled the Plug on the AI Hype - apologies to those that got two copies of this!

5 views

Skip to first unread message

Robert Lewis

unread,

Aug 31, 2025, 8:28:43 PMAug 31

Apple Just Pulled the Plug on the AI Hype. Here’s What Their Shocking Study Found

New research reveals that today’s “reasoning” models aren’t thinking at all. They’re just sophisticated pattern-matchers that completely break down when things get tough

We’re living in an era of incredible AI hype. Every week, a new model is announced that promises to “reason,” “think,” and “plan” better than the last. That’s the bombshell conclusion from a quiet, systematic study published by a team of researchers at Apple. They didn’t rely on hype or flashy demos. Instead, they put these so-called “Large Reasoning Models” (LRMs) to the test in a controlled environment, and what they found shatters the entire narrative.

In this article, I’m going to break down their findings for you, without the dense academic jargon. Because what they discovered isn’t just an incremental finding.. it’s a fundamental reality check for the entire AI industry.

Why We’ve Been Fooled by AI “Reasoning”

First, you have to ask: how do we even test if an AI can “reason”?

Usually, companies point to benchmarks like complex math problems (MATH-500) or coding challenges. And sure, models like Claude 3.7 and DeepSeek-R1 are getting better at these. But the Apple researchers point out a massive flaw in this approach: data contamination.

In simple terms, these models have been trained on a huge chunk of the internet.

This is why the researchers threw out the standard benchmarks. Instead, they built a more rigorous proving ground.

The AI Proving Ground: Puzzles, Not Problems

To truly test reasoning, you need a task that is:

Controllable: You can make it slightly harder or easier.
Uncontaminated: The model has almost certainly never seen the exact solution.
Logical: It follows clear, unbreakable rules.

So, the researchers turned to classic logic puzzles: Tower of Hanoi, Blocks World, River Crossing, and Checker Jumping.

These puzzles are perfect. You can’t “fudge” the answer. Either you follow the rules and solve it, or you don’t. By simply increasing the number of disks in Tower of Hanoi or blocks in Blocks World, they could precisely crank up the complexity and watch how the AI responded.

Press enter or click to view image in full size

This is where the illusion of thinking began to crumble.

When they ran the tests, a clear and disturbing pattern emerged. The performance of these advanced reasoning models didn’t just decline as problems got harder — it fell off a cliff.

The researchers identified three distinct regimes of performance:

Low-Complexity Tasks: Here’s the first surprise. On simple puzzles, standard models (like the regular Claude 3.7 Sonnet) actually outperformed their “thinking” counterparts. They were faster, more accurate, and used far fewer computational resources. The extra “thinking” was just inefficient overhead.
Medium-Complexity Tasks: This is the sweet spot where the reasoning models finally showed an advantage. The extra “thinking” time and chain-of-thought processing helped them solve problems that the standard models couldn’t. This is the zone that AI companies love to demo. It looks like real progress.
High-Complexity Tasks: And this is where it all goes wrong. Beyond a certain complexity threshold, both model types experienced a complete and total collapse. Their accuracy plummeted to zero. Not 10%. Not 5%. Zero.

This isn’t a graceful degradation. It’s a fundamental failure. The models that could solve a 7-disk Tower of Hanoi puzzle were utterly incapable of solving a 10-disk one, even though the underlying logic is identical. This finding alone destroys the narrative that these models have developed generalizable reasoning skills.

Even Weirder: When the Going Gets Tough, AI Gives Up

This is where the study gets truly bizarre. You would assume that when a problem gets harder, a “thinking” model would.. well, think harder. It would use more of its allocated processing power and token budget to work through the more complex steps.

But the Apple researchers found the exact opposite.

fewer

Let that sink in.

Faced with a harder challenge, the AI’s reasoning effort decreased. It’s like a marathon runner who, upon seeing a steep hill at mile 20, decides to start walking slower instead of digging deeper, even though they have plenty of energy left. It’s a counter-intuitive and deeply illogical behavior that suggests the model “knows” it’s out of its depth and simply gives up.

Inside the AI’s “Mind”: A Tale of Overthinking and Underthinking

The researchers didn’t stop at just measuring final accuracy. They went deeper, analyzing the “thought” process of the models step-by-step to see how they were failing.

What they found was a story of profound inefficiency.

On easy problems, models “overthink.” They would often find the correct solution very early in their thought process. But instead of stopping and giving the answer, they would continue to explore dozens of incorrect paths, wasting massive amounts of computation.
On hard problems, models “underthink.” This is the flip side of the collapse. When the complexity was high, the models failed to find any correct intermediate solutions. Their thought process was just a jumble of failed attempts from the very beginning. They never even got on the right track.

The Final Nail in the Coffin: The “Cheat Sheet” Test

If there was any lingering doubt about whether these models were truly reasoning, the researchers designed one final, damning experiment.

They took the Tower of Hanoi puzzle: a task with a well-known, recursive algorithm and literally gave the AI the answer key. They provided the model with a perfect, step-by-step pseudocode algorithm for solving the puzzle. The model’s only job was to execute the instructions. It didn’t have to invent a strategy; it just had to follow the recipe.

The result?

The models still failed at the exact same complexity level.

This is the most crucial finding in the entire paper. It proves that the limitation isn’t in problem-solving or high-level planning. The limitation is in the model’s inability to consistently follow a chain of logical steps. If an AI can’t even follow explicit instructions for a simple, rule-based task, then it is not “reasoning” in any meaningful human sense.

So, What Are We Actually Witnessing?

The Apple study, titled “The Illusion of Thinking,” forces us to confront an uncomfortable truth. The “reasoning” we’re seeing in today’s most advanced AI models is not a budding form of general intelligence.

It mimic

The bottom line from Apple’s research is stark: we’re not witnessing the birth of AI reasoning. We’re seeing the limits of very expensive autocomplete that breaks when it matters most.

The AGI timeline didn’t just get a reality check. It might have been reset entirely.

So the next time you hear about a new AI that can “reason,” ask yourself: Or is it just running the most expensive and convincing magic trick in history?

Reply all

Reply to author

Forward

0 new messages