Evaluating Voice Agents in 2025: Beyond WER to Task Success, Barge-In, and Noise-Driven Hallucinations

October 5, 2025 · 4 min

Why WER alone fails to capture interaction quality

Word Error Rate (WER) measures transcription fidelity, but it does not measure whether a voice agent actually helps the user complete tasks or behaves sensibly in conversational settings. Two systems with similar WERs can produce very different user experiences when latency, turn-taking, interruption handling, safety decisions, and robustness to acoustic or content perturbations are taken into account. Real-world evaluations show that predicting user satisfaction requires interaction signals and task-level outcomes, not only ASR accuracy.

Core metrics to add and how to measure them

End-to-end Task Success

Metric: Task Success Rate (TSR), with strict task-specific success criteria, plus Task Completion Time (TCT) and Turns-to-Success.
Why: Assistants are judged by outcome. TSR reflects whether the user achieved their goal under realistic interaction conditions.
Protocol: Define verifiable tasks with clear endpoints (for example, assemble a shopping list with N items subject to constraints). Use blinded human raters and instrumented logs to compute TSR, TCT, and Turns. For multilingual and SLU-rich tasks, draw intents and slots from datasets like MASSIVE.

Barge-in and turn-taking

Metrics: Barge-In Detection Latency (ms), True/False Barge-In Rates, Endpointing Latency (ms).
Why: Smooth interruption handling and fast endpointing shape perceived responsiveness. Poor handling causes frustration or lost user input.
Protocol: Script interruptions at controlled offsets and SNRs. Record suppression and recognition timings with high-precision frame timestamps. Include noisy and far-field conditions and measure both correct interruptions and spurious stops.

Hallucination-Under-Noise (HUN)

Metric: HUN Rate — the fraction of fluent outputs that are semantically unrelated to the audio under controlled noise or non-speech overlays.
Why: ASR + audio-LLM stacks can emit convincing but incorrect content when exposed to non-speech or noisy audio. Hallucinations propagate into downstream actions and harm task completion.
Protocol: Create audio sets with additive environmental noise, non-speech distractors, and disfluent speech. Use human judgments with adjudication to score semantic relatedness and compute HUN. Track whether hallucinations cause incorrect task steps.

Instruction following, safety, and robustness

Metric families: Instruction-Following Accuracy (including format and constraint adherence), Safety Refusal Rate on adversarial spoken prompts, Robustness Deltas across speaker age/accent/pitch and environment.
Why: Voice agents must follow instructions correctly and fail safely on adversarial or dangerous prompts, while retaining performance across diverse speakers and environments.
Protocol: Use VoiceBench for breadth on instruction following and safety. Use SLUE and Phase-2 for SLU-specific tasks like NER, dialog acts, summarization, and QA.

Perceptual speech quality (TTS and enhancement)

Metric: Subjective Mean Opinion Score using ITU-T P.808 (crowdsourced ACR/DCR/CCR).
Why: Playback quality affects the overall interaction; measure TTS and enhancement quality in the end-to-end loop using standardized crowdsourced protocols.

Benchmark landscape and what each covers

VoiceBench: Multi-facet speech-interaction benchmark covering knowledge, instruction following, safety, and robustness across speaker/environment/content variations. Uses real and synthetic speech but does not include barge-in or real-device task completion measurements.
SLUE / SLUE Phase-2: SLU-focused datasets for NER, sentiment, dialog acts, QA, and summarization; useful to study pipeline fragility under ASR errors.
MASSIVE: Large multilingual intent/slot corpus for building multilingual task suites and measuring TSR and slot F1 under speech conditions.
Spoken-SQuAD / HeySQuAD: Spoken QA datasets for probing comprehension and multi-accent robustness.
DSTC tracks: Dialog robustness, human ratings, and safety-focused evaluations in spoken settings.
Real-world Task Assistance (Alexa Prize TaskBot): Inspiration and gold-standard definitions for multi-step task success and user-centric KPIs.

What to add to obtain a complete picture

Barge-in and endpointing KPIs: explicit harnesses to measure detection latency, suppression correctness, endpointing delay, and false barge-ins.
Hallucination-under-noise protocols: controlled non-speech tests and reporting of HUN rates with downstream impact analysis.
On-device interaction latency: measure time-to-first-token, time-to-final, and local processing overhead to correlate with user-perceived responsiveness.
Cross-axis robustness matrices: combine task suites with speaker/environment/content perturbations to expose failure surfaces (for example, barge-in performance under far-field echo).
Perceptual playback quality: use P.808 in the end-to-end loop rather than only on isolated TTS outputs.

A reproducible evaluation plan

Assemble the suite: VoiceBench for interaction breadth; SLUE/Phase-2 for SLU depth; MASSIVE for multilingual task coverage; Spoken-SQuAD for QA and accent tests.
Add missing harnesses: scripted barge-in tests across offsets and SNRs; HUN audio overlays and annotation; scenario tasks with objective success checks for TSR/TCT/Turns.
Perceptual quality: run crowdsourced P.808 MOS with available toolkits.
Report structure: primary table with TSR/TCT/Turns, barge-in latency and error rates, endpointing latency, HUN rate, VoiceBench aggregate and per-axis scores, SLU metrics, and P.808 MOS. Include stress plots such as TSR and HUN versus SNR and reverberation, and barge-in latency versus interrupt timing.

Practical considerations and next steps

Use blinded human raters where semantic judgment is required and combine with fine-grained automatic logging for timing metrics.
Make protocols reproducible by publishing task definitions, audio perturbation scripts, and analysis notebooks.
Prefer cross-axis analyses to single-metric leaderboards: track where systems fail when multiple adverse conditions coincide.

References

Key resources to consult include VoiceBench, SLUE and Phase-2, MASSIVE, Spoken-SQuAD sets, DSTC tracks, Alexa Prize TaskBot reports, and recent papers on barge-in, endpointing, and ASR hallucinations.