C3: The Bilingual Benchmark Revolutionizing Complex Spoken Dialogue Evaluation

Challenges in Spoken Dialogue Modeling

Spoken Dialogue Models (SDMs) are crucial for enabling natural and seamless voice interactions between humans and AI systems. Unlike text-based Large Language Models, SDMs face unique challenges due to the complexity of spoken language, including phonological ambiguity, semantic ambiguity, omissions, coreferences, and multi-turn interactions. These complexities are particularly pronounced in tonal languages like Chinese, where intonation can drastically change meaning.

Introducing the C3 Benchmark

The new C3 benchmark dataset addresses these challenges by providing a bilingual evaluation framework across English and Chinese. It includes 1,079 instances spanning five critical phenomena: phonological ambiguity, semantic ambiguity, omission, coreference, and multi-turn interaction. The dataset features paired audio-text samples (1,586 pairs due to multi-turn dialogues) with rigorous manual quality controls, including regenerated or human-voiced audio to ensure clarity and consistent timbre.

Evaluation Methodology: Leveraging LLMs and Human Judgments

C3 employs an innovative evaluation method using large language models (LLMs) such as GPT-4o and DeepSeek-R1 as automatic judges. These models assess SDM responses with high correlation to human raters (Pearson and Spearman coefficients above 0.87). The evaluation covers transcription comparisons for most tasks, while human annotation is used for audio-specific phenomena like intonation. Task-specific metrics measure both detection and resolution accuracy for omission and coreference.

Benchmark Results and Insights

Testing six state-of-the-art SDMs revealed several insights:

Ambiguity (phonological and semantic) is more difficult for SDMs than context-dependent tasks.
Performance is generally better in English than in Chinese, highlighting language-specific challenges.
Different models excel in different areas; for example, Qwen2.5-Omni shows strength in multi-turn context tracking, whereas GPT-4o-Audio-Preview performs better at ambiguity resolution in English.
Detecting omissions and coreferences is easier than resolving them, indicating distinct challenges in understanding versus responding.

Impact on Future Research

C3 highlights the gap between current SDM capabilities and human-level performance, especially in handling complex conversational phenomena and language-specific features. By offering an open-source, bilingual, and comprehensive benchmark, C3 paves the way for future research to develop more sophisticated spoken dialogue systems that can better interpret and generate natural conversations in multiple languages.