Apple Study Fuels Debate on AI Reasoning Skills

In early June 2025, Apple’s AI research division published a landmark paper that rigorously examines whether state-of-the-art large language models (LLMs) actually “think” through complex puzzles or merely retrieve patterns from vast training corpora. Titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,” the study provides new empirical evidence on what Apple labels simulated reasoning (SR) models and their performance cliffs under increasing problem difficulty.
Study overview
Led by Parshin Shojaee and Iman Mirzadeh, with contributions from Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar, the Apple team builds on earlier work by the United States of America Mathematical Olympiad (USAMO) that found sub-5% success on novel proof problems. The research targets Large Reasoning Models (LRMs)—LLMs fine-tuned via reinforcement learning from human feedback (RLHF) to emit chain-of-thought traces that ostensibly mirror logical, step-by-step reasoning.
- Definition: LRMs produce a deliberative text “thought process” before generating final answers.
- Hypothesis: If models truly reason, performance should scale gracefully with problem complexity.
- Methodology: Benchmark classic puzzles across a wide complexity spectrum.
Puzzle-based experimental framework
Apple’s researchers assembled four canonical reasoning tasks and scaled them from trivial to extreme:
- Tower of Hanoi: Moving N disks between three pegs under the rule that no larger disk sits atop a smaller one. Apple tested N=1 up to N=20 (over 1 million moves).
- Checkers Jumping: Sequenced eliminations on an 8×8 board, evolving into NP-complete variants when the board scales.
- River Crossing: Transport constraints (e.g., wolf, goat, cabbage) with exponential state-space explosion.
- Blocks World: Stacking colored cubes to match target configurations under limited action sets.
Scaling complexity and token dynamics
Across these puzzles, Apple measured three axes:
- Move count: From a handful to >106 required steps.
- Chain-of-thought length: Token counts up to 100k in some GPT-4 Turbo tests.
- Inference compute: GPU clusters with NVLink-connected A100/H100 nodes, measuring latency vs. reasoning depth.
Key findings
- On novel mathematical proofs, most LRMs scored <5%, matching USAMO’s results; only one model hit 25%, with no perfect solutions in ~200 trials.
- In Tower of Hanoi, SR models “overthink” on simple instances (N≤3) and underperform “standard” LLMs like GPT-4o, flipping only 3 disks versus perfect solves by GPT-4o.
- Moderately complex puzzles (N=5–9) saw SR models edge out standard models by 10–15% in correct move sequences.
- Beyond N≥10, both model classes collapse to 0%—chains of thought truncate prematurely, despite sufficient GPU memory and token budgets.
- Apple identified a “counterintuitive scaling limit”: models increase deliberation tokens up to a complexity threshold, then sharply reduce them even with unused context window.
- Task-specific failure modes—Claude 3.7 Sonnet Thinking achieved 100 correct Hanoi moves but failed after <5 moves on the simpler river-crossing puzzle—imply reasoning breakdowns are problem-dependent rather than purely computational.
Expert reactions and competing interpretations
“It is truly embarrassing that LLMs cannot reliably solve Hanoi,” Gary Marcus tweeted, highlighting that algorithmic solutions have existed since 1957 and available code snippets online did not aid SR models when provided as explicit instructions.
Marcus, a longtime neural network skeptic, sees Apple’s study as robust empirical support for his critique that LLMs lack out-of-distribution generalization. However, others argue Apple merely measured engineered constraints rather than fundamental reasoning deficits.
“If you tell me to solve a problem requiring an hour of pen and paper but only give me five minutes, I’ll resort to heuristics,” said Kevin A. Bryan, University of Toronto economist, suggesting RLHF penalizes long computations at inference time.
Software engineer Sean Goedecke pointed out on his blog that DeepSeek-R1 declines puzzles requiring >1,000 moves by design—choosing shortcuts over brute-force—thus conflating refusal with inability. Independent researcher Simon Willison added that context-window truncation in standard LLMs often caps chain-of-thought length, muddying conclusions about “reasoning.”
Additional analysis: Hardware and inference constraints
Modern LLM deployments run on GPU clusters with hundreds of gigabytes of HBM2e memory and high-bandwidth NVLink interconnects. Despite ample raw compute, each extra token in a chain-of-thought adds CPU-GPU synchronization overhead and inference latency. Benchmarks from OpenAI’s internal 128K context pilot show linear throughput degradation beyond 50k tokens, leading production systems to enforce token budgets of 8–32k to maintain <10ms per token latency. Apple’s paper exposes how such budgets translate to truncated reasoning chains under real-world QoS constraints.
Emerging paradigms: Neuro-symbolic hybrids
To overcome pure pattern-matching pitfalls, researchers are integrating symbolic engines via API calls or function-calling interfaces. For example:
- OpenAI Function Calling: Orchestrates external Python solvers for discrete search and planning.
- IBM NeSyCo: Combines LLM latent features with Prolog-style rule engines.
- Meta SymbolicGPT: Embeds a tiny SAT solver in the transformer’s attention heads to approximate backtracking.
Future directions and implications
Apple’s study and USAMO findings underscore that current chain-of-thought hacks are not a turnkey path to general intelligence. Potential research avenues include:
- Memory-Augmented Neural Architectures: External differentiable memory structures for storing intermediate states.
- Dynamic Algorithm Synthesis: Meta-learning frameworks that compose subroutines on the fly, analogous to program induction.
- Adaptive Token Budgeting: Reinforcement signals that optimize reasoning depth vs. latency in real-time.
Conclusion
While Apple’s “Illusion of Thinking” study delivers a sobering assessment of LLM reasoning boundaries, it also highlights that these systems retain immense practical value in coding, technical writing, and ideation—provided users understand their token constraints, context-window limits, and propensity for confabulation. With forthcoming advances in context length (e.g., GPT-4 Turbo’s 128K window) and neuro-symbolic integration, the next generation of AI may finally bridge the gap between pattern matching and genuine systematic reasoning.