Apple's new research on large inference models is attracting attention. Photo: Macrumors . |
Just three years after its launch, artificial intelligence has begun to appear in many daily activities such as studying and working. Many people fear that it will not be long before it is capable of replacing humans.
But new AI models aren't as smart as we think they are. A finding from a major tech company has helped to reinforce that belief.
Everyone knows "too difficult to ignore"
In a newly published study called “Illusionary Thinking,” the Apple research team asserts that inference models like Claude, DeepSeek-R1, and o3-mini are not actually “brain-driven” as their names suggest.
The word inference should be replaced with “imitation.” The group argues that these models are simply efficient at memorizing and repeating patterns. But when the question is changed or the complexity increases, they almost collapse.
More simply, chatbots work well when they can recognize and match patterns, but once the problem becomes too complex, they can’t handle it. “State-of-the-art Large Reasoning Models (LRMs) suffer from a complete collapse in accuracy when complexity exceeds a certain threshold,” the study notes.
This goes against the developer’s expectation that complexity will improve with more resources. “AI inference effort increases with complexity, but only up to a point, and then decreases, even if there is still enough token budget (computational power) to handle it,” the study added.
In this study, the researchers turned the question model that is typically used to answer questions on its head. Instead of the usual math test, they introduced cleverly designed puzzles like Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World.
Each puzzle game has simple and clear rules, with varying levels of complexity, such as adding more disks, blocks, and agents. The inference model performs better on medium difficulty, but loses to the normal version on easy. Notably, everything falls apart completely on hard difficulty, as if the AI has given up.
In the Tower of Hanoi problem, the team did not improve the performance of the inference model much despite “feeding” the problem solving algorithm. Some models could do well on the game up to 100 levels, but could only pass 5 steps incompletely in the Crossing the River problem.
![]() |
With Hanoi Tower, players need to move and reposition the circles in order of size. Photo: Wikipedia. |
This points to poor inference performance, as well as poor stability of LRM models. Amidst much debate about AI’s ability to match humans, this new Apple study proves otherwise.
Apple's discovery is not new
Gary Marcus, an American psychologist and author, said Apple's findings were impressive, but not really new and merely reinforced previous research. The professor emeritus of psychology and neuroscience at New York University cited his 1998 study as an example.
In it, he argues that neural networks, the precursors to large language models, can generalize well within the distribution of the data they were trained on, but often collapse when faced with data outside the distribution.
He also cites arguments made by Arizona State University computer scientist Subbarao Kambhampati in recent years. Professor Rao believes that “thought chains” and “inference models” are inherently less reliable than many people think.
“People tend to over-anthropomorphize the inference traces of large language models, calling them ‘thoughts’ when they may not deserve that name,” says the professor, who has written a series of papers on how the thought sequences generated by LLMs do not always accurately reflect what they actually do.
New research from Apple shows that even the latest generation of inference models are unreliable outside of their training data. Marcus says that LLM and LRM models both have their uses, and are useful in some cases. However, users should not trust any of their results.
Source: https://znews.vn/apple-doi-gao-nuoc-lanh-vao-ai-suy-luan-post1559526.html
Comment (0)