A special meeting where mathematicians seek to defeat artificial intelligence

One weekend in mid-May, a secret meeting of the world of mathematics took place. 30 of the world's leading mathematicians secretly traveled to Berkeley, California, USA, to attend a confrontation with a chatbot capable of "reasoning". The chatbot was tasked with solving problems written by mathematicians themselves, to test its problem-solving abilities.

After two days of continuously firing professor-level questions, mathematicians were surprised to realize that this chatbot could solve some of the most difficult problems ever solved in history.

“I saw colleagues say outright that this large language model is approaching the level of mathematical genius,” Ken Ono, a professor at the University of Virginia and the meeting's chair and judge, told Scientific American.

The chatbot that caught our attention is powered by o4-mini , a large language model (LLM) designed for complex reasoning. It is a product of OpenAI and is trained to perform sophisticated reasoning. The equivalent model from Google, Gemini 2.5 Flash, has similar capabilities.

Like previous ChatGPT LLMs, o4-mini learns to predict the next word in a text string. However, o4-mini is a lighter, more flexible version trained on deep data and closely tuned by humans — allowing it to delve into mathematical problems that previous models couldn't reach.

To test the o4-mini’s abilities, OpenAI asked Epoch AI, a nonprofit that specializes in testing LLM models, to create 300 previously unpublished mathematical questions. While traditional LLMs can solve many complex problems, when challenged with completely new questions, most of them got less than 2% correct, suggesting they weren’t truly capable of reasoning.

In the new evaluation project, Epoch AI recruited young mathematician Dr. Elliot Glazer as its leader. The new project, called FrontierMath , will be deployed from September 2024.

The project collects new questions at four levels of difficulty, from undergraduate, graduate, to advanced research. By April 2025, Glazer found that o4-mini could solve about 20% of the problems. So he moved all the way to level 4—asking it to solve problems that even advanced mathematicians would struggle with.

Participants were forced to sign a confidentiality agreement and could only communicate via the encrypted app Signal, as using email could allow LLM to scan and “sniff” the content, thereby falsifying the evaluation data.

Each problem that o4-mini cannot solve will bring the questioner a prize of 7,500 USD.

The initial team made slow, but steady progress in coming up with questions. But Glazer decided to speed things up by holding an in-person meeting on May 17–18. The 30 participating mathematicians were divided into groups of six, competing against each other—not to solve problems, but to come up with problems that AI couldn’t solve.

By the evening of May 17, Ken Ono was starting to get frustrated with the chatbot, which was showing a level of mathematical proficiency far beyond what was expected, making it difficult for the team to “trap” it. “I came up with a problem that experts in the field would recognize as an open problem in number theory—a problem suitable for a PhD,” he said.

As a result, when he asked o4-mini, he was stunned to see the chatbot analyze, reason, and come up with the correct solution in just 10 minutes. Specifically, in the first two minutes, it learned and grasped all the relevant documents. Then, it proposed to try a simpler version of the problem to learn how to approach it.

Five minutes later, the chatbot gave the correct answer, speaking with a confident — even arrogant — tone. “It started to get cheeky,” Ono says, “And it added: ‘No need for a quote because I figured out the secret number!’”

Defeated by the AI, in the early morning of May 18, Ono immediately sent a warning message to the team via Signal. “I was completely unprepared to deal with a model like this,” he said. “I had never seen this kind of reasoning in a computer model. It was thinking like a real scientist would think. And that was scary.”

Although the mathematicians finally succeeded in finding 10 questions that left the o4-mini stumped, they still couldn't hide their shock at the speed of AI's development in just one year.

Ono compares the experience of working with o4-mini to collaborating with a very talented colleague. And Yang Hui He, a mathematician at the London Institute of Mathematical Sciences and a pioneer in applying AI to mathematics, comments: “This is what a very, very talented PhD student can do — and even more.”

And it should be noted that AI does it much faster than humans. While it takes humans weeks or months to solve it, o4-mini only takes a few minutes.

The excitement surrounding the o4-mini is not without its concerns. Both Ono and He warn that the o4-mini’s capabilities can make people overconfident. “We have proof by induction, proof by contradiction, and now proof by… overwhelming,” He says. “If you say something with enough confidence, people will be intimidated. I think the o4-mini has mastered this kind of proof: it says everything with great confidence.”

As the meeting ended, the mathematicians began to think about the future of mathematics. They discussed the possibility of a “fifth level”—questions that even the world’s best mathematicians cannot solve. If AI reaches that threshold, the role of mathematicians will change dramatically: perhaps they will become questioners, interacting with and guiding AI reasoning to discover new mathematical truths—similar to the way a professor works with a graduate student.

“I have been telling my colleagues for a long time that it would be a huge mistake to think that general AI will never happen, that it is just a computer,” Ono said. “I don’t want to panic, but in some ways these big language models are already starting to surpass most of the world’s best PhD students.”

(Vietnam+)

Source: https://www.vietnamplus.vn/cuoc-gap-go-dac-biet-noi-cac-nha-toan-hoc-tim-cach-danh-bai-tri-tue-nhan-tao-post1043183.vnp