The growing capabilities of large language models and multimodal systems have spurred interest in
voice-first AI assistants, yet existing benchmarks are inadequate for evaluating the full range of these
systems' capabilities.
We summarize four key weaknesses of current benchmarks, highlighting the urgent need for a new evaluation
framework:
- Weakness 1 (W1): Lack of voice personalization evaluation. The
ability to mimic a specific voice is crucial for creating personalized and engaging AI assistants.
Existing benchmarks emphasize intelligibility or naturalness but rarely examine a model's ability to
mimic a specific voice. In practice, personalization is crucial for user trust and sustained
engagement. For example, healthcare and elderly care assistants require a familiar voice to provide
comfort. Without a systematic assessment of this ability, models risk failing in personalized
applications.
- Weakness 2 (W2): Limited focus on hands-free interaction.
Current audio understanding benchmarks often rely on text-based instructions, creating a modality
mismatch with actual voice-first usage. This discrepancy is especially consequential in
safety-critical and accessibility-oriented contexts, such as driving, operating machinery, or
supporting visually impaired users, where hands-free, speech-only interaction is not a matter of
convenience but a fundamental requirement. Ignoring this dimension raises uncertainty about model
reliability in these scenarios.
- Weakness 3 (W3): Neglect of various audio contexts in daily life.
While some datasets include speech samples with background noise or environmental disturbances,
they rarely evaluate models under realistic conditions with varied audio contexts. In practice,
assistants are expected to engage in conversations beyond human speech, including topics related to
natural sounds, music, and other complex contexts. Without evaluation across diverse contexts,
benchmarks offer little assurance that models can remain reliable and helpful in everyday
environments.
- Weakness 4 (W4): Insufficient multi-modal (vision + audio) integration
assessment.
Despite rapid advances in multi-modal learning, benchmarks rarely evaluate scenarios in which
speech must be interpreted alongside visual input. Yet many applications, such as smart teachers,
require assistants to process language and visual context jointly. The absence means that current
benchmarks fall short of reflecting the multimodal demands of real-world humanโAI interaction.
We introduce
VoiceAssistant-Eval, a comprehensive benchmark designed to
assess AI assistants across listening, speaking, and viewing.
VoiceAssistant-Eval comprises 10,497
curated examples spanning 13 task categories. These tasks include natural sounds, music, and spoken
dialogue for listening; multi-turn dialogue, role-play imitation, and various scenarios for speaking; and
highly heterogeneous images for viewing.
To demonstrate its utility, we
evaluate 21 open-source models and GPT-4o-Audio, measuring the
quality of the response content and speech, as well as their consistency. The results reveal
three key
findings: (1) proprietary models do not universally outperform open-source models;
(2)
most models excel at speaking tasks but lag in audio understanding; and
(3) well-designed smaller
models can rival much larger ones. Notably, the mid-sized Step-Audio-2-mini (7B) achieves more than double
the listening accuracy of LLaMA-Omni2-32B-Bilingual.
However,
challenges remain: multimodal (audio+visual) input and role-play voice imitation tasks
are difficult for current models, and significant gaps persist in robustness and safety alignment.
VoiceAssistant-Eval identifies these gaps and establishes a rigorous framework for evaluating and guiding
the development of next-generation multimodal voice assistants.