Will an all-AI company succeed? Photo: Business Insiders . |
In a recent experiment, researchers at Carnegie Mellon University simulated a software company called TheAgentCompany, complete with employee policies and an internal website. All employees were AI agents, artificial intelligence designed to reason and plan tasks on its own.
The AI software that runs the “workers” comes from Google, OpenAI, Anthropic, and Meta. These AIs take on roles ranging from financial analysts to software engineers to project managers. They work together as colleagues, in simulated departments from human resources to real-life engineering.
The experiment required the AI model to perform tasks that simulated the daily tasks of real employees at a software company. The team wanted to accurately assess how well the AI would perform in a real-world environment, and whether it was capable of replacing humans.
Tasks that need to be handled range from navigating file folders, touring the new office “virtually,” and writing performance reviews for software engineers based on collected feedback.
In one task, the AI had to access multiple directories to analyze a coffee chain’s database. In another, it was asked to gather feedback on a 36-year-old engineer and write a performance review.
However, according to Business Insider , the results were dismal. The best-performing model, Anthropic's Claude 3.5 Sonnet, only completed 24 percent of the tasks it was given. The team noted that even this modest level of performance came at a high cost. On average, Claude required nearly 30 steps and cost more than $6 per task.
Coming in second was Google’s Gemini 2.0 Flash. The tool required an average of 40 steps to complete a task, but only achieved a success rate of 11.4%. At the bottom was Amazon’s Nova Pro v1, which completed 1.7% of tasks, and averaged nearly 20 steps.
According to the researchers, the reason for this result is that AI agents still lack common background knowledge, and weak social skills. In addition, the ability to navigate and access the internet is also very poor.
Chatbots also struggle with self-deception, where they automatically take an easier route but end up ruining the task. For example, while performing a task, an AI agent might not be able to find the right person to ask in a company chatroom. So it comes up with a shortcut: it uses the name of the person it’s looking for to refer to someone else.
Stephen Casper, an AI researcher, said people are overhyping the capabilities of AI agents. Both Jensen Huang, CEO of Nvidia, and Sam Altman, CEO of OpenAI, said that this year AI will enter the workforce, replacing some parts of the company.
However, many other studies have shown the opposite. Harvard Business School has shown that AI does not adapt well to environments with great change. Langchain ’s report also shows that they have difficulty applying tools and following instructions.
AI agents are said to be good at some small tasks. But according to the study, they have a higher success rate at tasks that are more difficult for humans, such as software development.
The results of the Carnegie Mellon experiment show that AI has not yet replaced humans in important tasks. On the other hand, humans can take advantage of AI to optimize their daily work.
Source: https://znews.vn/cong-ty-co-toan-bo-nhan-vien-la-ai-post1549608.html
Comment (0)