Evaluating Voice Agents in 2025: Beyond WER to Task Success, Barge-In, and Noise-Driven Hallucinations

Why WER alone fails to capture interaction quality

Word Error Rate (WER) measures transcription fidelity, but it does not measure whether a voice agent actually helps the user complete tasks or behaves sensibly in conversational settings. Two systems with similar WERs can produce very different user experiences when latency, turn-taking, interruption handling, safety decisions, and robustness to acoustic or content perturbations are taken into account. Real-world evaluations show that predicting user satisfaction requires interaction signals and task-level outcomes, not only ASR accuracy.

Core metrics to add and how to measure them

  1. End-to-end Task Success
  1. Barge-in and turn-taking
  1. Hallucination-Under-Noise (HUN)
  1. Instruction following, safety, and robustness
  1. Perceptual speech quality (TTS and enhancement)

Benchmark landscape and what each covers

What to add to obtain a complete picture

A reproducible evaluation plan

Practical considerations and next steps

References

Key resources to consult include VoiceBench, SLUE and Phase-2, MASSIVE, Spoken-SQuAD sets, DSTC tracks, Alexa Prize TaskBot reports, and recent papers on barge-in, endpointing, and ASR hallucinations.