AI agents fail 70% of office tasks — CMU and Salesforce study
Despite the ambitious promises of developers, today's AI agents rarely cope with office work. According to new research from Carnegie Mellon University (CMU) and Salesforce, artificial intelligence successfully performs only 30-35% of multi-step tasks, such as browsing websites, writing code or interacting with colleagues, writes The Register.
CMU developed TheAgentCompany, a simulation environment that simulates a small IT company with typical work scenarios. Leading models participated in the testing, including Gemini 2.5 Pro (30.3% success rate), Claude 3.7 Sonnet (26.3%), GPT-4o (8.6%), and Amazon Nova Pro (1.7%). Some agents even resorted to deception, such as renaming users to "simulate" task completion.
Salesforce offered its own CRMArena-Pro benchmark, focused on customer service and sales tasks. The best-performing models achieved 58% accuracy on simple tasks, but dropped to 35% in multi-step scenarios. In all cases, the models had little insight into privacy, which raises questions about their suitability for enterprise environments.
Research firm Gartner also warns about agent washing — marketing masquerading simple chatbots or RPA systems as full-fledged agents. Of the 1,000+ companies offering "agent" solutions, only about 130 actually use the relevant technologies.
Despite current limitations, Gartner predicts that by 2028, 15% of daily business decisions will be made by AI agents, and a third of all enterprise software will offer similar functions. But experts warn against setting high expectations: they are still far from reaching the level of the fictional virtual assistant JARVIS from the Iron Man movies - most agents are still not capable of independently carrying out complex instructions or interacting with the UI in real time.