Microsoft and Salesforce Reveal Major Performance Drop of LLMs in Real Multi-Turn Conversations

Challenges of Conversational AI with Multi-Turn Instructions

Conversational AI aims to enable large language models (LLMs) to understand and respond to progressively revealed user needs over multiple turns. Unlike single-turn prompts where all information is given upfront, real conversations unfold incrementally, requiring models to maintain context and adapt dynamically. However, LLMs often struggle with this setup, as early assumptions about incomplete instructions lead to persistent errors and misguided responses.

Limitations of Current Evaluation Methods

Most existing evaluations focus on single-turn, fully-specified prompts or episodic multi-turn tasks treated as isolated subtasks. These approaches overlook the complexity of real dialogues where information is fragmented and must be integrated over time. As a result, current benchmarks do not capture the models' difficulties in handling underspecified inputs distributed across multiple turns.

The Sharded Simulation Method

Researchers from Microsoft and Salesforce introduced the sharded simulation, a novel approach that mimics how users disclose information gradually in real conversations. They split complete instructions into smaller, logically connected parts or "shards" delivered sequentially. An LLM-powered simulated user reveals these shards naturally, while the assistant’s responses are classified to determine if it attempts a solution or seeks clarification.

Extensive Testing Across Tasks and Models

The study tested 15 LLMs on six generation tasks, including coding, SQL queries, API actions, math problems, data-to-text, and document summarization. Using well-known datasets like GSM8K, Spider, and ToTTo, the researchers conducted over 200,000 simulations comparing single-turn full instructions against multi-turn sharded inputs.

Significant Performance Decline in Multi-Turn Scenarios

Results showed a consistent average performance drop from 90% in single-turn to 65% in multi-turn tasks — a 25-point decrease. The key issue was a sharp increase in unreliability rather than reduced capability. Unreliability rose by 112%, indicating that models frequently failed or provided inconsistent outputs when processing fragmented information. Even leading models such as GPT-4.1 and Gemini 2.5 Pro experienced 30-40% performance degradation. Attempts to mitigate this by increasing computation or lowering randomness had minimal effect.

Implications for Future AI Development

This research highlights the urgent need to improve LLMs’ reliability in real-world conversations where task details emerge gradually. The sharded simulation exposes fundamental weaknesses in current models’ ability to handle underspecified, evolving instructions. Enhancing multi-turn understanding is essential for advancing conversational AI applications that interact naturally and effectively with users over extended dialogues.

For further details, refer to the original paper and GitHub repository. Follow updates on Twitter and join the community discussion on the ML SubReddit.