Salesforce AI Launches CRMArena-Pro: The First Enterprise-Grade Multi-Turn Benchmark for LLM Agents

The Challenge of Evaluating AI Agents in Business

AI agents powered by large language models (LLMs) hold significant promise for managing complex business tasks, particularly in Customer Relationship Management (CRM). However, it has been difficult to evaluate their real-world effectiveness due to the absence of publicly available and realistic business datasets. Existing benchmarks typically focus on simple, single-turn interactions or narrow applications like customer service, overlooking broader business domains such as sales, Configure-Price-Quote (CPQ) processes, and B2B operations. Moreover, these benchmarks often neglect to assess how well AI agents handle sensitive information, a crucial factor in enterprise environments.

Limitations of Previous Benchmarks

Most prior benchmarks have concentrated on customer service tasks within B2C scenarios, leaving out critical business functions such as sales and CPQ. They also fail to capture the complexity of B2B interactions, which often involve longer sales cycles. Many benchmarks lack realism by ignoring multi-turn dialogues and skipping expert validation of tasks and environments. A significant gap is the lack of evaluation for confidentiality awareness, which is vital due to the sensitive nature of business and customer data handled by AI agents. Without this, privacy, legal risks, and trust issues remain unaddressed.

Introducing CRMArena-Pro

Salesforce AI Research has developed CRMArena-Pro, a new benchmark designed to realistically evaluate LLM agents like Gemini 2.5 Pro in professional business settings. This benchmark includes expert-validated tasks covering customer service, sales, and CPQ across both B2B and B2C contexts. It tests multi-turn conversations and evaluates agents' awareness and handling of confidential information.

Benchmark Features and Findings

CRMArena-Pro is constructed using synthetic yet structurally accurate enterprise data generated with GPT-4, based on Salesforce schemas. It simulates business environments via sandboxed Salesforce Organizations and includes 19 tasks grouped into four key skill areas: database querying, textual reasoning, workflow execution, and policy compliance. Multi-turn dialogues with simulated users and confidentiality awareness testing are integral parts of the benchmark. Expert evaluations have confirmed the realism and reliability of the data and environments used.

In evaluations, top LLM agents were tested across the 19 business tasks with a focus on task completion and confidentiality handling. Metrics included exact match for structured outputs and F1 scores for generative responses. A GPT-4o-based LLM Judge was used to determine if models appropriately refused to share sensitive information. Advanced reasoning models like Gemini-2.5-Pro and o1 outperformed lighter or non-reasoning models, especially in complex tasks. While performance was similar across B2B and B2C settings, model strength influenced nuanced performance trends. Confidentiality-aware prompting improved refusal rates but sometimes compromised task accuracy, highlighting a trade-off between privacy and performance.

Performance Insights

Top-performing models achieved around 58% accuracy on single-turn tasks, but performance dropped to approximately 35% in multi-turn conversations. Workflow execution was the easiest skill area, with Gemini 2.5 Pro exceeding 83% accuracy. However, confidentiality handling remains a significant challenge for all models tested. These results reveal a substantial gap between the current capabilities of LLM agents and the complex demands of enterprise business scenarios.

Further Resources

For more detailed information, check the [Paper], [GitHub Page], [Hugging Face Page], and the [Technical Blog]. This research was conducted by the Salesforce AI Research team.