AU-Harness Unveiled: Fast, Open Toolkit to Benchmark Audio LLMs at Scale

Why a new audio evaluation framework matters

Voice AI is rapidly becoming central to multimodal systems, powering assistants and interactive agents that must understand and reason over audio. Despite rapid model progress, evaluation tools lag behind: many benchmarks are fragmented, slow, and narrowly focused, which makes fair comparison and realistic multi-turn testing difficult.

AU-Harness: a unified, scalable toolkit

UT Austin and the ServiceNow Research Team released AU-Harness, an open-source toolkit designed for large audio language model (LALM) evaluation. AU-Harness aims to be fast, standardized, and extensible, enabling researchers to evaluate models across a broad spectrum of audio tasks within one unified framework.

Speed and scalable design

AU-Harness focuses on throughput and resource utilization. By integrating with the vLLM inference engine, it implements a token-based request scheduler for concurrent evaluations across multiple nodes and shards datasets to distribute workloads proportionally. This architecture yields near-linear scaling and keeps compute resources busy. In experiments AU-Harness achieves around 127% higher throughput and reduces the real-time factor by nearly 60% compared to prior toolkits, turning multi-day evaluation runs into tasks that finish in hours.

Flexible configuration and multi-turn evaluation

Flexibility is built in: each model can run with independent hyperparameters such as temperature or max tokens while preserving standardized protocols. Dataset filtering by accent, audio length, or noise profile enables targeted diagnostics. Crucially, AU-Harness supports multi-turn dialogue evaluation, allowing benchmarks of dialogue continuity, contextual reasoning, and adaptability across extended conversational exchanges — something many earlier toolkits could not handle.

Broad task coverage and novel approaches

AU-Harness expands task coverage to 50+ datasets, 380+ subsets, and 21 tasks across six categories:

Two notable innovations are LLM-Adaptive Diarization, which evaluates diarization via prompting instead of specialized neural pipelines, and the Spoken Language Reasoning suite, which tests the ability to follow and reason over spoken instructions rather than merely transcribing them.

What the benchmarks reveal

When AU-Harness was used to evaluate leading models such as GPT-4o, Qwen2.5-Omni, and Voxtral-Mini-3B, results showed clear strengths and weaknesses. Models performed strongly on ASR and spoken QA, but struggled with temporal reasoning tasks like diarization and with complex instruction-following when instructions were delivered in audio form. The analysis highlights an instruction modality gap: presenting the same task as spoken instructions instead of text causes performance drops of up to 9.5 points, pointing to remaining challenges for audio-native reasoning.

Open source and community-driven progress

AU-Harness is open source and comes with a public leaderboard, inviting the research community to collaborate, reproduce results, and push voice-first AI forward. The project links to a paper, project page, and GitHub repository with tutorials, code, and notebooks for adoption and extension.

For full details see the paper and project resources on the AU-Harness GitHub and arXiv.