Four top AI models compete to find the app that is most confident in answering wrongly

Researchers at Arthur AI, a machine learning monitoring platform, tested industry-leading models and found that GPT-4 was the best at math, Llama 2 was average across the board, Anthropic’s Claude 2 “knew” its limits the best, and Cohere AI took the title of the most “delusional” model with the most confident wrong answers.

Arthur AI's report comes as AI-generated misinformation becomes a hot issue as the 2024 US presidential election approaches.

The issue of AI-generated misinformation is heating up as the 2024 US presidential election approaches.

According to Adam Wenchel, co-founder and CEO of Arthur, this is the first report to “take a comprehensive look at the hallucination rates of large language models (LLMs) rather than just publishing rankings.”

AI illusion refers to the phenomenon of LLMs completely fabricating information and acting as if they are telling the truth. For example, in June 2023, it was reported that ChatGPT had extracted “false” information in a filing with a New York federal court, and the lawyers involved could face severe penalties.

In the experiment, Arthur AI researchers let AI models compete in categories such as combinatorial mathematics, knowledge of US presidents, Moroccan political leaders, etc. with questions "designed" to expose AI mistakes, which is "requiring the models to explain the steps of reasoning about the information given".

The results showed that OpenAI’s GPT-4 performed the best overall among the models tested. It also had lower illusions than its predecessor, GPT-3.5. For example, on math questions, GPT-4 had 33% to 50% fewer illusions.

On the other hand, Meta's Llama 2 is generally more psychedelic than Anthropic's GPT-4 and Claude 2.

In the math category, GPT-4 came in first place, closely followed by Claude 2, but in tests about US presidents, Claude 2 took first place in accuracy, edging out GPT-4 into second place. When asked about Moroccan politics, GPT-4 again came in first, with Claude 2 and Llama 2 almost entirely choosing not to answer.

In a second experiment, the researchers tested how “risk-averse” the AI models were (providing the message “As an AI model, I cannot give an opinion”).

In this test, GPT-4 showed a 50% increase in defensiveness compared to GPT-3.5, which was also “quantified by GPT-4 users reporting that the new version was more annoying.” Cohere’s AI model, on the other hand, showed no defensiveness at all. The study found that Claude 2 was the most reliable in terms of “self-awareness,” meaning it accurately assessed what it knew and didn’t know, and only answered questions for which it had training data to support it.

A Cohere representative dismissed the findings, arguing that the company’s “enhanced traceability technology, which was not incorporated into the tested model, is highly effective in quoting verifiable information to verify the source” for the business.

(According to CNBC)

Source