Evaluating Voice Agents in 2025: Beyond WER to Task Success, Barge-In, and Noise-Driven Hallucinations
Why WER alone fails to capture interaction quality
Word Error Rate (WER) measures transcription fidelity, but it does not measure whether a voice agent actually helps the user complete tasks or behaves sensibly in conversational settings. Two systems with similar WERs can produce very different user experiences when latency, turn-taking, interruption handling, safety decisions, and robustness to acoustic or content perturbations are taken into account. Real-world evaluations show that predicting user satisfaction requires interaction signals and task-level outcomes, not only ASR accuracy.
Core metrics to add and how to measure them
- End-to-end Task Success
- Metric: Task Success Rate (TSR), with strict task-specific success criteria, plus Task Completion Time (TCT) and Turns-to-Success.
- Why: Assistants are judged by outcome. TSR reflects whether the user achieved their goal under realistic interaction conditions.
- Protocol: Define verifiable tasks with clear endpoints (for example, assemble a shopping list with N items subject to constraints). Use blinded human raters and instrumented logs to compute TSR, TCT, and Turns. For multilingual and SLU-rich tasks, draw intents and slots from datasets like MASSIVE.
- Barge-in and turn-taking
- Metrics: Barge-In Detection Latency (ms), True/False Barge-In Rates, Endpointing Latency (ms).
- Why: Smooth interruption handling and fast endpointing shape perceived responsiveness. Poor handling causes frustration or lost user input.
- Protocol: Script interruptions at controlled offsets and SNRs. Record suppression and recognition timings with high-precision frame timestamps. Include noisy and far-field conditions and measure both correct interruptions and spurious stops.
- Hallucination-Under-Noise (HUN)
- Metric: HUN Rate — the fraction of fluent outputs that are semantically unrelated to the audio under controlled noise or non-speech overlays.
- Why: ASR + audio-LLM stacks can emit convincing but incorrect content when exposed to non-speech or noisy audio. Hallucinations propagate into downstream actions and harm task completion.
- Protocol: Create audio sets with additive environmental noise, non-speech distractors, and disfluent speech. Use human judgments with adjudication to score semantic relatedness and compute HUN. Track whether hallucinations cause incorrect task steps.
- Instruction following, safety, and robustness
- Metric families: Instruction-Following Accuracy (including format and constraint adherence), Safety Refusal Rate on adversarial spoken prompts, Robustness Deltas across speaker age/accent/pitch and environment.
- Why: Voice agents must follow instructions correctly and fail safely on adversarial or dangerous prompts, while retaining performance across diverse speakers and environments.
- Protocol: Use VoiceBench for breadth on instruction following and safety. Use SLUE and Phase-2 for SLU-specific tasks like NER, dialog acts, summarization, and QA.
- Perceptual speech quality (TTS and enhancement)
- Metric: Subjective Mean Opinion Score using ITU-T P.808 (crowdsourced ACR/DCR/CCR).
- Why: Playback quality affects the overall interaction; measure TTS and enhancement quality in the end-to-end loop using standardized crowdsourced protocols.
Benchmark landscape and what each covers
- VoiceBench: Multi-facet speech-interaction benchmark covering knowledge, instruction following, safety, and robustness across speaker/environment/content variations. Uses real and synthetic speech but does not include barge-in or real-device task completion measurements.
- SLUE / SLUE Phase-2: SLU-focused datasets for NER, sentiment, dialog acts, QA, and summarization; useful to study pipeline fragility under ASR errors.
- MASSIVE: Large multilingual intent/slot corpus for building multilingual task suites and measuring TSR and slot F1 under speech conditions.
- Spoken-SQuAD / HeySQuAD: Spoken QA datasets for probing comprehension and multi-accent robustness.
- DSTC tracks: Dialog robustness, human ratings, and safety-focused evaluations in spoken settings.
- Real-world Task Assistance (Alexa Prize TaskBot): Inspiration and gold-standard definitions for multi-step task success and user-centric KPIs.
What to add to obtain a complete picture
- Barge-in and endpointing KPIs: explicit harnesses to measure detection latency, suppression correctness, endpointing delay, and false barge-ins.
- Hallucination-under-noise protocols: controlled non-speech tests and reporting of HUN rates with downstream impact analysis.
- On-device interaction latency: measure time-to-first-token, time-to-final, and local processing overhead to correlate with user-perceived responsiveness.
- Cross-axis robustness matrices: combine task suites with speaker/environment/content perturbations to expose failure surfaces (for example, barge-in performance under far-field echo).
- Perceptual playback quality: use P.808 in the end-to-end loop rather than only on isolated TTS outputs.
A reproducible evaluation plan
- Assemble the suite: VoiceBench for interaction breadth; SLUE/Phase-2 for SLU depth; MASSIVE for multilingual task coverage; Spoken-SQuAD for QA and accent tests.
- Add missing harnesses: scripted barge-in tests across offsets and SNRs; HUN audio overlays and annotation; scenario tasks with objective success checks for TSR/TCT/Turns.
- Perceptual quality: run crowdsourced P.808 MOS with available toolkits.
- Report structure: primary table with TSR/TCT/Turns, barge-in latency and error rates, endpointing latency, HUN rate, VoiceBench aggregate and per-axis scores, SLU metrics, and P.808 MOS. Include stress plots such as TSR and HUN versus SNR and reverberation, and barge-in latency versus interrupt timing.
Practical considerations and next steps
- Use blinded human raters where semantic judgment is required and combine with fine-grained automatic logging for timing metrics.
- Make protocols reproducible by publishing task definitions, audio perturbation scripts, and analysis notebooks.
- Prefer cross-axis analyses to single-metric leaderboards: track where systems fail when multiple adverse conditions coincide.
References
Key resources to consult include VoiceBench, SLUE and Phase-2, MASSIVE, Spoken-SQuAD sets, DSTC tracks, Alexa Prize TaskBot reports, and recent papers on barge-in, endpointing, and ASR hallucinations.