Dario Amodei, CEO of Anthropic, shares insights into the company's latest research. Photo: Fortune . |
Researchers at the AI company Anthropic say they have achieved a fundamental breakthrough in understanding exactly how large language models (LLMs) work. This breakthrough has significant implications for improving the safety and security of future AI models.
Research shows that AI models are even smarter than we thought. One of the biggest problems with LLM models, which underpin the most powerful chatbots like ChatGPT, Gemini, and Copilot, is that they operate like a black box.
We can input information and receive results from chatbots, but how they deliver a specific answer remains a mystery, even to the researchers who built them.
This makes it difficult to predict when the model might be prone to hallucinations, that is, producing misleading results. Researchers also built barriers to prevent the AI from answering dangerous questions, but they couldn't explain why some barriers were more effective than others.
AI agents also have the potential for "reward hacking." In some cases, AI models can lie to users about what they have done or are trying to do.
Although recent AI models are capable of reasoning and generating chains of thought, some experiments have shown that they still do not accurately reflect the process by which the models arrive at answers.
Essentially, the tool that Anthropic researchers developed is similar to an fMRI scanner that neuroscientists use to scan the human brain. By applying it to their Claude 3.5 Haiku model, Anthropic was able to understand, in part, how LLM models work.
Researchers discovered that although Claude was only trained to predict the next word in a sentence, in certain tasks, it spontaneously learned to plan for the longer term.
For example, when asked to write a poem, Claude would first look for words that suited the theme and could rhyme, then go back to complete the verse.
Claude also has a common AI language. Although trained to support multiple languages, Claude will think in that language first, and then express the results in whatever language it supports.
Furthermore, after giving Claude a difficult problem but intentionally suggesting a wrong solution, the researchers discovered that Claude could lie about his thought process, following the suggestion to please the user.
In other instances, when asked a simple question that the model could answer immediately without reasoning, Claude still fabricated a false inference process.
Josh Baston, a researcher at Anthropic, said that even though Claude claimed it had performed a calculation, he still couldn't find anything that had happened.
Meanwhile, experts argue that studies show that sometimes people don't even understand themselves, but instead create rational explanations to justify the decisions they make.
In general, people have similar thinking patterns. This is also why psychology has discovered common cognitive biases.
However, LLMs can make mistakes that humans cannot, because the way they generate answers is vastly different from how we approach a task.
The Anthropic research team implemented a method of grouping neurons into circuits based on characteristics, rather than analyzing each neuron individually as in previous techniques.
Baston explained that this method aims to understand the roles played by different components and allows researchers to track the entire reasoning process across the layers of the network.
This method also has limitations in that it is only an approximation and does not reflect the entire information processing process of LLM, especially the changes in attention, which are crucial when LLM produces results.
Furthermore, identifying neural network patterns, even with commands only a few dozen words long, takes hours for an expert. They say it is still unclear how to scale this technique to analyze longer commands.
Despite its limitations, the ability of LLM to monitor internal reasoning processes opens up many new opportunities in controlling AI systems to ensure security and safety.
At the same time, it can also help researchers develop new training methods, improve AI control barriers, and minimize hallucinations and erroneous outputs.
Source: https://znews.vn/nghien-cuu-dot-pha-mo-ra-hop-den-suy-luan-cua-ai-post1541611.html






Comment (0)