VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing

Figure 1: (a) Scores of six prominent omni-models across 13 tasks. (b) Examples from four newly designed tasks for voice assistants: I. Example from the role-play task with reference audio. II. A truly voice-based multi-turn conversation, instead of providing multi-round context in text. III. Multi-modal (vision + audio) integration understanding. IV. An audio question with music context.

Introduction

The growing capabilities of large language models and multimodal systems have spurred interest in voice-first AI assistants, yet existing benchmarks are inadequate for evaluating the full range of these systems' capabilities. We summarize four key weaknesses of current benchmarks, highlighting the urgent need for a new evaluation framework:

Weakness 1 (W1): Lack of voice personalization evaluation. The ability to mimic a specific voice is crucial for creating personalized and engaging AI assistants. Existing benchmarks emphasize intelligibility or naturalness but rarely examine a model's ability to mimic a specific voice. In practice, personalization is crucial for user trust and sustained engagement. For example, healthcare and elderly care assistants require a familiar voice to provide comfort. Without a systematic assessment of this ability, models risk failing in personalized applications.
Weakness 2 (W2): Limited focus on hands-free interaction. Current audio understanding benchmarks often rely on text-based instructions, creating a modality mismatch with actual voice-first usage. This discrepancy is especially consequential in safety-critical and accessibility-oriented contexts, such as driving, operating machinery, or supporting visually impaired users, where hands-free, speech-only interaction is not a matter of convenience but a fundamental requirement. Ignoring this dimension raises uncertainty about model reliability in these scenarios.
Weakness 3 (W3): Neglect of various audio contexts in daily life. While some datasets include speech samples with background noise or environmental disturbances, they rarely evaluate models under realistic conditions with varied audio contexts. In practice, assistants are expected to engage in conversations beyond human speech, including topics related to natural sounds, music, and other complex contexts. Without evaluation across diverse contexts, benchmarks offer little assurance that models can remain reliable and helpful in everyday environments.
Weakness 4 (W4): Insufficient multi-modal (vision + audio) integration assessment. Despite rapid advances in multi-modal learning, benchmarks rarely evaluate scenarios in which speech must be interpreted alongside visual input. Yet many applications, such as smart teachers, require assistants to process language and visual context jointly. The absence means that current benchmarks fall short of reflecting the multimodal demands of real-world human–AI interaction.

We introduce Logo

VoiceAssistant-Eval, a comprehensive benchmark designed to assess AI assistants across listening, speaking, and viewing. VoiceAssistant-Eval comprises 10,497 curated examples spanning 13 task categories. These tasks include natural sounds, music, and spoken dialogue for listening; multi-turn dialogue, role-play imitation, and various scenarios for speaking; and highly heterogeneous images for viewing. To demonstrate its utility, we evaluate 21 open-source models and GPT-4o-Audio, measuring the quality of the response content and speech, as well as their consistency. The results reveal three key findings: (1) proprietary models do not universally outperform open-source models; (2) most models excel at speaking tasks but lag in audio understanding; and (3) well-designed smaller models can rival much larger ones. Notably, the mid-sized Step-Audio-2-mini (7B) achieves more than double the listening accuracy of LLaMA-Omni2-32B-Bilingual. However, challenges remain: multimodal (audio+visual) input and role-play voice imitation tasks are difficult for current models, and significant gaps persist in robustness and safety alignment. VoiceAssistant-Eval identifies these gaps and establishes a rigorous framework for evaluating and guiding the development of next-generation multimodal voice assistants.

Official Leaderboard

Scores on Logo VoiceAssistant-Eval across 3 main tasks.

🚨 To submit your results to the leaderboard, please send to this email with your results.

#	Model	Source	Date	Overall	Average Score on Listening	Average Score on Speaking	Average Score on Viewing
1	Qwen2.5-Omni-7B 🥇	Link	2025-09-26	36.37	33.56	41.27	34.27
2	Qwen2.5-Omni-3B 🥈	Link	2025-09-26	30.57	31.02	35.60	25.08
3	Baichuan-Omni-1d5 🥉	Link	2025-09-26	29.66	30.48	32.73	25.77
4	MiniCPM-o-2_6	Link	2025-09-26	28.29	31.63	35.81	17.42
5	mini-omni2	Link	2025-09-26	6.80	4.45	12.97	2.99
6	moshika-vis-pytorch-bf16	Link	2025-09-26	3.64	2.68	5.24	2.99
7	GPT-4o-Audio	Link	2025-09-26	-	39.78	51.26	-
8	Step-Audio-2-min	Link	2025-09-26	-	40.06	31.30	-
9	LLaMA-Omni2-32B-Bilingual	Link	2025-09-26	-	16.00	39.44	-
10	Kimi-Audio-7B-Instruct	Link	2025-09-26	-	26.38	21.66	-
11	GLM-4-Voice-9B	Link	2025-09-26	-	15.83	29.99	-
12	LLaMA-Omni2-3B-Bilingual	Link	2025-09-26	-	13.56	32.12	-
13	LLaMA-Omni2-14B-Bilingual	Link	2025-09-26	-	13.11	31.10	-
14	Step-Audio	Link	2025-09-26	-	15.57	28.43	-
15	LLaMA-Omni2-7B-Bilingual	Link	2025-09-26	-	12.63	27.11	-
16	Freeze-Omni	Link	2025-09-26	-	10.58	24.34	-
17	Llama-3.1-8B-Omni	Link	2025-09-26	-	10.47	17.09	-
18	LLaMA-Omni2-1.5B-Bilingual	Link	2025-09-26	-	9.03	12.08	-
19	LLaMA-Omni2-0.5B-Bilingual	Link	2025-09-26	-	7.91	8.32	-
20	mini-omni	Link	2025-09-26	-	2.49	7.94	-
21	moshiko-pytorch-bf16	Link	2025-09-26	-	2.03	4.66	-
22	moshika-pytorch-bf16	Link	2025-09-26	-	2.02	3.65	-

Detailed Leaderboard

Scores on Logo VoiceAssistant-Eval across 13 tasks.

🚨 To submit your results to the leaderboard, please send to this email with your results.

#	Model	SRC	Date	AVG	Listening General	Listening Music	Listening Sound	Listening Speech	Speaking Assistant	Speaking Emotion	Speaking Instruction Following	Speaking Multi Round	Speaking Rea- soning	Speaking Robust- ness	Speaking Roleplay	Speaking Safety	Viewing Multi Discipline
1	Qwen2.5-Omni-7B 🥇	Link	2025-09-26	36.37	29.8	23.1	45.5	35.9	51.1	31.3	27.6	55.7	48.9	38.6	5.2	71.9	34.3
2	Qwen2.5-Omni-3B 🥈	Link	2025-09-26	30.57	24.2	25.7	44.1	30.1	44.9	27.4	24.0	47.5	42.4	32.3	3.61	62.8	25.1
3	Baichuan-Omni-1d5 🥉	Link	2025-09-26	29.66	31.5	21.6	33.6	35.2	43.1	19.2	27.7	37.3	41.0	22.5	5.3	65.9	25.8
4	MiniCPM-o-2_6	Link	2025-09-26	28.29	28.8	24.5	32.6	40.6	40.3	33.6	23.2	45.6	35.5	27.7	6.5	74.3	17.4
5	mini-omni2	Link	2025-09-26	6.80	3.8	2.1	4.6	7.3	13.0	17.1	3.2	5.6	7.2	12.9	0.2	44.6	3.0
6	moshika-vis-pytorch-bf16	Link	2025-09-26	3.64	1.4	2.4	3.4	3.4	2.1	4.2	1.7	1.0	5.0	0.4	0.1	27.5	3.0
7	GPT-4o-Audio	Link	2025-09-26	-	38.6	35.4	47.7	37.4	62.7	32.5	44.3	64.0	63.8	54.7	13.7	74.5	-
8	Step-Audio-2-min	Link	2025-09-26	-	30.2	31.5	52.0	46.5	34.7	21.7	24.2	31.8	44.8	12.5	6.8	73.9	-
9	LLaMA-Omni2-32B-Bilingual	Link	2025-09-26	-	17.2	4.4	12.9	29.4	51.5	24.7	33.5	49.4	50.5	32.1	0.3	73.6	-
10	Kimi-Audio-7B-Instruct	Link	2025-09-26	-	21.0	23.3	30.7	30.5	23.9	19.8	18.0	24.0	27.4	10.3	5.5	44.4	-
11	GLM-4-Voice-9B	Link	2025-09-26	-	19.2	11.2	13.1	19.9	33.8	28.1	18.1	43.2	25.6	24.4	4.5	62.3	-
12	LLaMA-Omni2-3B-Bilingual	Link	2025-09-26	-	14.1	4.8	11.8	23.5	42.9	21.3	23.6	40.6	37.3	31.0	0.3	59.8	-
13	LLaMA-Omni2-14B-Bilingual	Link	2025-09-26	-	10.7	6.3	14.5	21.0	47.5	23.2	23.1	41.0	29.5	27.7	0.3	56.6	-
14	Step-Audio	Link	2025-09-26	-	14.3	9.0	15.6	23.3	33.2	17.9	20.0	43.2	29.8	20.0	12.9	50.4	-
15	LLaMA-Omni2-7B-Bilingual	Link	2025-09-26	-	9.2	5.2	14.4	21.9	42.0	23.7	18.8	36.6	25.1	26.8	0.4	43.7	-
16	Freeze-Omni	Link	2025-09-26	-	11.4	7.6	9.0	14.4	12.1	23.8	11.0	18.6	25.2	24.2	0.2	79.8	-
17	Llama-3.1-8B-Omni	Link	2025-09-26	-	9.7	4.2	12.3	15.6	34.6	15.0	12.5	19.5	19.3	19.6	0.3	16.0	-
18	LLaMA-Omni2-1.5B-Bilingual	Link	2025-09-26	-	6.9	5.0	7.6	16.7	28.3	13.3	8.2	13.9	14.0	14.3	0.0	10.5	-
19	LLaMA-Omni2-0.5B-Bilingual	Link	2025-09-26	-	5.2	1.9	8.3	16.3	18.4	10.0	4.2	7.9	7.8	7.6	0.3	10.5	-
20	mini-omni	Link	2025-09-26	-	1.9	1.8	2.4	3.9	6.6	10.8	1.5	2.8	4.1	7.1	0.0	30.7	-
21	moshiko-pytorch-bf16	Link	2025-09-26	-	1.6	2.3	1.3	2.9	1.6	3.4	1.3	2.1	4.7	0.4	0.1	23.7	-
22	moshika-pytorch-bf16	Link	2025-09-26	-	1.4	2.4	1.6	2.6	1.6	3.1	1.6	0.8	4.0	0.3	0.0	17.8	-

Roleplay Leaderboard

Scores on the Speaking Roleplay task in Logo VoiceAssistant-Eval.

🚨 To submit your results to the leaderboard, please send to this email with your results.

#	Model	SRC	Date	Speaking Roleplay	Content	Speech	Consistency	Speaker Similarity
1	GPT-4o-Audio 🥇	Link	2025-09-26	13.72	36.5	76.0	95.0	51.8
2	Step-Audio 🥈	Link	2025-09-26	12.92	33.2	56.0	90.5	75.1
3	Step-Audio-2-min 🥉	Link	2025-09-26	6.81	12.7	76.0	93.4	72.6
4	MiniCPM-o-2_6	Link	2025-09-26	6.46	21.8	64.0	74.8	59.7
5	Kimi-Audio-7B-Instruct	Link	2025-09-26	5.54	23.0	54.0	83.4	51.2
6	Baichuan-Omni-1d5	Link	2025-09-26	5.52	14.3	82.0	84.3	51.8
7	Qwen2.5-Omni-7B	Link	2025-09-26	5.15	12.7	82.0	96.6	51.6
8	GLM-4-Voice-9B	Link	2025-09-26	4.45	12.2	78.0	89.1	51.5
9	Qwen2.5-Omni-3B	Link	2025-09-26	3.61	8.7	82.0	95.6	51.7

Error Analysis Examples

BibTeX

@misc{wang2025voiceassistantevalbenchmarkingaiassistants,
      title={VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing}, 
      author={Ke Wang and Houxing Ren and Zimu Lu and Mingjie Zhan and Hongsheng Li},
      year={2025},
      eprint={2509.22651},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.22651}, 
}

Acknowledgement

We would like to thank MathVista and MathVision for this website, which is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

VoiceAssistant-Eval

Benchmarking AI Assistants across Listening, Speaking, and Viewing

Introduction

Official Leaderboard

Detailed Leaderboard

Roleplay Leaderboard

VoiceAssistant-Eval Dataset

Visualization

Error Analysis Examples

BibTeX

Acknowledgement