Undermined Claims
Reinforcement Learning May Not Lead to Robust Reasoning
Limitations

Be Reasonable: ‘Reasoning’ Models Don’t Really Reason, Apple Scientists Report

AI, Research, Slider

Undermined Claims
Reinforcement Learning May Not Lead to Robust Reasoning
Limitations

Insider Brief

A new Apple study finds that advanced AI reasoning models struggle with complex problem-solving, often failing completely as task difficulty increases.
The research shows that Large Reasoning Models (LRMs) reduce reasoning effort as puzzles grow harder, even when given the algorithm or enough compute.
Despite generating detailed thought sequences, LRMs do not demonstrate consistent improvements over standard models and often rely on shallow strategies that collapse under pressure.

New research from scientists at Apple casts doubt on the problem-solving abilities of today’s most advanced artificial intelligence models, revealing fundamental limitations in their capacity to reason through complex challenges, even when given ample computational resources.

The study, titled “The Illusion of Thinking,” explores how so-called Large Reasoning Models (LRMs) — a new class of AI designed to “think” through a problem step by step — handle puzzles of increasing difficulty. Unlike standard large language models (LLMs), which output answers directly, LRMs are optimized to generate detailed reasoning traces before reaching conclusions. The researchers set out to examine whether this approach truly enhances reasoning or merely gives the appearance of deeper understanding.

Their findings suggest that these models can hit a cognitive wall of sorts. When faced with increasingly complex puzzles, such as multi-step planning tasks in Tower of Hanoi or River Crossing scenarios, their accuracy drops sharply, often to zero. Surprisingly, the study shows that these models begin to reduce their reasoning effort as tasks get harder, even when they still have room to generate more tokens. This counterintuitive pattern raises questions about whether LRMs genuinely engage in deeper reasoning, or whether they simply exhaust their learned strategies early and give up.

The work was conducted by researchers Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar and published by Apple. Rather than relying on standard math and coding benchmarks — many of which are vulnerable to data contamination — the team used four controlled puzzle environments with adjustable complexity. These included Tower of Hanoi, Checkers Jumping, River Crossing, and Blocks World. Each puzzle required sequential reasoning and precise rule-following, allowing the researchers to isolate the models’ true reasoning capacities.

In simple puzzles, conventional LLMs — without any added reasoning steps — often outperformed their “thinking” counterparts, and did so more efficiently. As puzzles grew moderately more difficult, LRMs gained an advantage, showcasing the potential benefit of reasoning traces. But when pushed into higher complexity — longer planning paths, deeper compositional logic — both LRMs and LLMs broke down completely.

This “three-regime” pattern repeated across all puzzle types. In the low-complexity regime, standard models excelled. In the mid-complexity range, LRMs proved useful. But at high complexity, neither class of model could reliably solve problems.

Undermined Claims

One of the study’s most unexpected findings was the LRMs’ failure to benefit from being given the solution algorithm upfront. For example, even when researchers provided the correct steps for solving the Tower of Hanoi puzzle, the models failed to execute them accurately once the number of disks exceeded a certain threshold. This undermines claims that LRMs can perform symbolic manipulation or execute deterministic processes reliably, both considered key markers of true reasoning ability.

Another major insight came from analyzing the models’ “thoughts”—the sequences of intermediate steps produced before arriving at a final answer. At low complexity, LRMs often found correct solutions early in their traces but kept reasoning anyway, sometimes veering into incorrect territory, an inefficiency dubbed the “overthinking” phenomenon. As complexity increased, the models tended to start with incorrect approaches and only reached the right answer after much effort. But at high complexity, the models failed to land on any correct solution at all, and often shortened their reasoning traces, possibly reflecting a learned futility in trying.

The authors say these behaviors reveal a compute scaling limitation: as the cognitive demand of a problem grows, the model’s internal strategy breaks down, even when more tokens or compute are available. This limitation suggests that current LRMs have not yet developed generalizable reasoning strategies but rely on surface-level heuristics that fail under pressure.

Reinforcement Learning May Not Lead to Robust Reasoning

The study also examined whether training methods like reinforcement learning actually help models learn new reasoning patterns. Their analysis suggests that while these techniques may improve performance on curated benchmarks, they don’t lead to fundamentally more robust reasoning. In fact, when comparing models across different datasets and environments, performance was highly inconsistent, raising the possibility that some gains may come from memorized patterns rather than genuine understanding.

Despite the sobering results, the researchers see value in their approach. By moving away from fixed benchmarks and into controllable environments, they were able to draw more nuanced conclusions about reasoning performance. Their work also points to new opportunities for designing future models: emphasizing exact computation, creating more diverse training data that fosters generalization, and developing better tools for evaluating internal reasoning traces.

Still, the findings represent a cautionary note for developers and users of advanced AI. While reasoning models may appear to “think,” this illusion breaks down under scrutiny. The ability to generate verbose explanations and simulate reflection does not necessarily mean the model understands what it is doing.

Limitations

The researchers note that their study is constrained to narrow planning problems and may not capture the full scope of real-world reasoning tasks. They also acknowledge that black-box access to closed-source models limited their ability to inspect architectural contributions to failure.

While deterministic simulators made validation possible in these environments, they may not generalize to more ambiguous or open-ended domains.

The research team included Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio and Mehrdad Farajtabar.