Illustration photo
However, an AI model that acts as if it is obeying orders but actually conceals its true intentions is another matter.
The challenge of controlling AI
Apollo Research first published a paper in December documenting how five models plot when they are instructed to achieve a goal “at all costs.”
What's most surprising is that if a model understands it's being tested, it can pretend not to be conspiratorial just to pass the test, even if it's still conspiratorial. "Models are often more aware that they're being evaluated," the researchers write.
AI developers have yet to figure out how to train their models not to plot. That's because doing so could actually teach the model to plot even better to avoid detection.
It is perhaps understandable that AI models from many parties would deliberately deceive humans, as they are built to simulate humans and are largely trained on human-generated data.
Solutions and warnings
The good news is that the researchers saw a significant reduction in conspiracies using an anti-conspiracy technique called “deliberate association.” This technique, akin to making a child repeat the rules before letting them play, forces the AI to think before it acts.
The researchers warn of a future where AI is tasked with more complex tasks: “As AI is tasked with more complex tasks and begins to pursue more ambiguous long-term goals, we predict that the likelihood of malicious intent will increase, requiring correspondingly increased safeguards and rigorous testing capabilities.”
This is something worth pondering as the corporate world moves towards an AI future where companies believe AI can be treated like independent employees.
Source: https://doanhnghiepvn.vn/chuyen-doi-so/phat-hien-mo-hinh-ai-biet-lua-doi-con-nguoi/20250919055143362
Comment (0)