geometric reasoning

Figure 1: (a) Scores of six prominent omni-models across 13 tasks. (b) Examples from four newly designed tasks for voice assistants: I. Example from the role-play task with reference audio. II. A truly voice-based multi-turn conversation, instead of providing multi-round context in text. III. Multi-modal (vision + audio) integration understanding. IV. An audio question with music context.

Introduction

The growing capabilities of large language models and multimodal systems have spurred interest in voice-first AI assistants, yet existing benchmarks are inadequate for evaluating the full range of these systems' capabilities. We summarize four key weaknesses of current benchmarks, highlighting the urgent need for a new evaluation framework:

  • Weakness 1 (W1): Lack of voice personalization evaluation. The ability to mimic a specific voice is crucial for creating personalized and engaging AI assistants. Existing benchmarks emphasize intelligibility or naturalness but rarely examine a model's ability to mimic a specific voice. In practice, personalization is crucial for user trust and sustained engagement. For example, healthcare and elderly care assistants require a familiar voice to provide comfort. Without a systematic assessment of this ability, models risk failing in personalized applications.
  • Weakness 2 (W2): Limited focus on hands-free interaction. Current audio understanding benchmarks often rely on text-based instructions, creating a modality mismatch with actual voice-first usage. This discrepancy is especially consequential in safety-critical and accessibility-oriented contexts, such as driving, operating machinery, or supporting visually impaired users, where hands-free, speech-only interaction is not a matter of convenience but a fundamental requirement. Ignoring this dimension raises uncertainty about model reliability in these scenarios.
  • Weakness 3 (W3): Neglect of various audio contexts in daily life. While some datasets include speech samples with background noise or environmental disturbances, they rarely evaluate models under realistic conditions with varied audio contexts. In practice, assistants are expected to engage in conversations beyond human speech, including topics related to natural sounds, music, and other complex contexts. Without evaluation across diverse contexts, benchmarks offer little assurance that models can remain reliable and helpful in everyday environments.
  • Weakness 4 (W4): Insufficient multi-modal (vision + audio) integration assessment. Despite rapid advances in multi-modal learning, benchmarks rarely evaluate scenarios in which speech must be interpreted alongside visual input. Yet many applications, such as smart teachers, require assistants to process language and visual context jointly. The absence means that current benchmarks fall short of reflecting the multimodal demands of real-world humanโ€“AI interaction.
We introduce Logo VoiceAssistant-Eval, a comprehensive benchmark designed to assess AI assistants across listening, speaking, and viewing. VoiceAssistant-Eval comprises 10,497 curated examples spanning 13 task categories. These tasks include natural sounds, music, and spoken dialogue for listening; multi-turn dialogue, role-play imitation, and various scenarios for speaking; and highly heterogeneous images for viewing. To demonstrate its utility, we evaluate 21 open-source models and GPT-4o-Audio, measuring the quality of the response content and speech, as well as their consistency. The results reveal three key findings: (1) proprietary models do not universally outperform open-source models; (2) most models excel at speaking tasks but lag in audio understanding; and (3) well-designed smaller models can rival much larger ones. Notably, the mid-sized Step-Audio-2-mini (7B) achieves more than double the listening accuracy of LLaMA-Omni2-32B-Bilingual. However, challenges remain: multimodal (audio+visual) input and role-play voice imitation tasks are difficult for current models, and significant gaps persist in robustness and safety alignment. VoiceAssistant-Eval identifies these gaps and establishes a rigorous framework for evaluating and guiding the development of next-generation multimodal voice assistants.

Official Leaderboard

Scores on Logo VoiceAssistant-Eval across 3 main tasks.

๐Ÿšจ To submit your results to the leaderboard, please send to this email with your results.

# Model Source Date Overall Average Score on Listening Average Score on Speaking Average Score on Viewing
1 Qwen2.5-Omni-7B ๐Ÿฅ‡ Link 2025-09-26 36.37 33.56 41.27 34.27
2 Qwen2.5-Omni-3B ๐Ÿฅˆ Link 2025-09-26 30.57 31.02 35.60 25.08
3 Baichuan-Omni-1d5 ๐Ÿฅ‰ Link 2025-09-26 29.66 30.48 32.73 25.77
4 MiniCPM-o-2_6 Link 2025-09-26 28.29 31.63 35.81 17.42
5 mini-omni2 Link 2025-09-26 6.80 4.45 12.97 2.99
6 moshika-vis-pytorch-bf16 Link 2025-09-26 3.64 2.68 5.24 2.99
7 GPT-4o-Audio Link 2025-09-26 - 39.78 51.26 -
8 Step-Audio-2-min Link 2025-09-26 - 40.06 31.30 -
9 LLaMA-Omni2-32B-Bilingual Link 2025-09-26 - 16.00 39.44 -
10 Kimi-Audio-7B-Instruct Link 2025-09-26 - 26.38 21.66 -
11 GLM-4-Voice-9B Link 2025-09-26 - 15.83 29.99 -
12 LLaMA-Omni2-3B-Bilingual Link 2025-09-26 - 13.56 32.12 -
13 LLaMA-Omni2-14B-Bilingual Link 2025-09-26 - 13.11 31.10 -
14 Step-Audio Link 2025-09-26 - 15.57 28.43 -
15 LLaMA-Omni2-7B-Bilingual Link 2025-09-26 - 12.63 27.11 -
16 Freeze-Omni Link 2025-09-26 - 10.58 24.34 -
17 Llama-3.1-8B-Omni Link 2025-09-26 - 10.47 17.09 -
18 LLaMA-Omni2-1.5B-Bilingual Link 2025-09-26 - 9.03 12.08 -
19 LLaMA-Omni2-0.5B-Bilingual Link 2025-09-26 - 7.91 8.32 -
20 mini-omni Link 2025-09-26 - 2.49 7.94 -
21 moshiko-pytorch-bf16 Link 2025-09-26 - 2.03 4.66 -
22 moshika-pytorch-bf16 Link 2025-09-26 - 2.02 3.65 -

Detailed Leaderboard

Scores on Logo VoiceAssistant-Eval across 13 tasks.

๐Ÿšจ To submit your results to the leaderboard, please send to this email with your results.

# Model SRC Date AVG Listening
General
Listening
Music
Listening
Sound
Listening
Speech
Speaking
Assistant
Speaking
Emotion
Speaking
Instruction
Following
Speaking
Multi
Round
Speaking
Rea-
soning
Speaking
Robust-
ness
Speaking
Roleplay
Speaking
Safety
Viewing
Multi
Discipline
1 Qwen2.5-Omni-7B ๐Ÿฅ‡ Link 2025-09-26 36.37 29.8 23.1 45.5 35.9 51.1 31.3 27.6 55.7 48.9 38.6 5.2 71.9 34.3
2 Qwen2.5-Omni-3B ๐Ÿฅˆ Link 2025-09-26 30.57 24.2 25.7 44.1 30.1 44.9 27.4 24.0 47.5 42.4 32.3 3.61 62.8 25.1
3 Baichuan-Omni-1d5 ๐Ÿฅ‰ Link 2025-09-26 29.66 31.5 21.6 33.6 35.2 43.1 19.2 27.7 37.3 41.0 22.5 5.3 65.9 25.8
4 MiniCPM-o-2_6 Link 2025-09-26 28.29 28.8 24.5 32.6 40.6 40.3 33.6 23.2 45.6 35.5 27.7 6.5 74.3 17.4
5 mini-omni2 Link 2025-09-26 6.80 3.8 2.1 4.6 7.3 13.0 17.1 3.2 5.6 7.2 12.9 0.2 44.6 3.0
6 moshika-vis-pytorch-bf16 Link 2025-09-26 3.64 1.4 2.4 3.4 3.4 2.1 4.2 1.7 1.0 5.0 0.4 0.1 27.5 3.0
7 GPT-4o-Audio Link 2025-09-26 - 38.6 35.4 47.7 37.4 62.7 32.5 44.3 64.0 63.8 54.7 13.7 74.5 -
8 Step-Audio-2-min Link 2025-09-26 - 30.2 31.5 52.0 46.5 34.7 21.7 24.2 31.8 44.8 12.5 6.8 73.9 -
9 LLaMA-Omni2-32B-Bilingual Link 2025-09-26 - 17.2 4.4 12.9 29.4 51.5 24.7 33.5 49.4 50.5 32.1 0.3 73.6 -
10 Kimi-Audio-7B-Instruct Link 2025-09-26 - 21.0 23.3 30.7 30.5 23.9 19.8 18.0 24.0 27.4 10.3 5.5 44.4 -
11 GLM-4-Voice-9B Link 2025-09-26 - 19.2 11.2 13.1 19.9 33.8 28.1 18.1 43.2 25.6 24.4 4.5 62.3 -
12 LLaMA-Omni2-3B-Bilingual Link 2025-09-26 - 14.1 4.8 11.8 23.5 42.9 21.3 23.6 40.6 37.3 31.0 0.3 59.8 -
13 LLaMA-Omni2-14B-Bilingual Link 2025-09-26 - 10.7 6.3 14.5 21.0 47.5 23.2 23.1 41.0 29.5 27.7 0.3 56.6 -
14 Step-Audio Link 2025-09-26 - 14.3 9.0 15.6 23.3 33.2 17.9 20.0 43.2 29.8 20.0 12.9 50.4 -
15 LLaMA-Omni2-7B-Bilingual Link 2025-09-26 - 9.2 5.2 14.4 21.9 42.0 23.7 18.8 36.6 25.1 26.8 0.4 43.7 -
16 Freeze-Omni Link 2025-09-26 - 11.4 7.6 9.0 14.4 12.1 23.8 11.0 18.6 25.2 24.2 0.2 79.8 -
17 Llama-3.1-8B-Omni Link 2025-09-26 - 9.7 4.2 12.3 15.6 34.6 15.0 12.5 19.5 19.3 19.6 0.3 16.0 -
18 LLaMA-Omni2-1.5B-Bilingual Link 2025-09-26 - 6.9 5.0 7.6 16.7 28.3 13.3 8.2 13.9 14.0 14.3 0.0 10.5 -
19 LLaMA-Omni2-0.5B-Bilingual Link 2025-09-26 - 5.2 1.9 8.3 16.3 18.4 10.0 4.2 7.9 7.8 7.6 0.3 10.5 -
20 mini-omni Link 2025-09-26 - 1.9 1.8 2.4 3.9 6.6 10.8 1.5 2.8 4.1 7.1 0.0 30.7 -
21 moshiko-pytorch-bf16 Link 2025-09-26 - 1.6 2.3 1.3 2.9 1.6 3.4 1.3 2.1 4.7 0.4 0.1 23.7 -
22 moshika-pytorch-bf16 Link 2025-09-26 - 1.4 2.4 1.6 2.6 1.6 3.1 1.6 0.8 4.0 0.3 0.0 17.8 -

Roleplay Leaderboard

Scores on the Speaking Roleplay task in Logo VoiceAssistant-Eval.

๐Ÿšจ To submit your results to the leaderboard, please send to this email with your results.

# Model SRC Date Speaking Roleplay Content Speech Consistency Speaker Similarity
1 GPT-4o-Audio ๐Ÿฅ‡ Link 2025-09-26 13.72 36.5 76.0 95.0 51.8
2 Step-Audio ๐Ÿฅˆ Link 2025-09-26 12.92 33.2 56.0 90.5 75.1
3 Step-Audio-2-min ๐Ÿฅ‰ Link 2025-09-26 6.81 12.7 76.0 93.4 72.6
4 MiniCPM-o-2_6 Link 2025-09-26 6.46 21.8 64.0 74.8 59.7
5 Kimi-Audio-7B-Instruct Link 2025-09-26 5.54 23.0 54.0 83.4 51.2
6 Baichuan-Omni-1d5 Link 2025-09-26 5.52 14.3 82.0 84.3 51.8
7 Qwen2.5-Omni-7B Link 2025-09-26 5.15 12.7 82.0 96.6 51.6
8 GLM-4-Voice-9B Link 2025-09-26 4.45 12.2 78.0 89.1 51.5
9 Qwen2.5-Omni-3B Link 2025-09-26 3.61 8.7 82.0 95.6 51.7

Logo VoiceAssistant-Eval Dataset

data-overview

Overview of principal statistics for VoiceAssistant-Eval.

data-composition

Proportional distribution of tasks and the corresponding weaknesses addressed in VoiceAssistant-Eval.

Logo Visualization

๐Ÿšจ๐Ÿšจ๐Ÿšจ Note! The data here is heavily compressed for easier visualization.


Logo Error Analysis Examples

BibTeX

@misc{wang2025voiceassistantevalbenchmarkingaiassistants,
      title={VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing}, 
      author={Ke Wang and Houxing Ren and Zimu Lu and Mingjie Zhan and Hongsheng Li},
      year={2025},
      eprint={2509.22651},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.22651}, 
}

Acknowledgement

We would like to thank MathVista and MathVision for this website, which is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.