2025
pdf
bib
abs
SpeechIQ: Speech-Agentic Intelligence Quotient Across Cognitive Levels in Voice Understanding by Large Language Models
Zhen Wan
|
Chao-Han Huck Yang
|
Yahan Yu
|
Jinchuan Tian
|
Sheng Li
|
Ke Hu
|
Zhehuai Chen
|
Shinji Watanabe
|
Fei Cheng
|
Chenhui Chu
|
Sadao Kurohashi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We introduce Speech-based Intelligence Quotient (SIQ) as a new form of human cognition-inspired evaluation pipeline for voice understanding large language models (LLM_Voice), designed to assess their voice understanding ability. Moving beyond popular voice understanding metrics such as word error rate (WER), SIQ examines LLM_Voice across three cognitive levels motivated by Bloom’s Taxonomy: (1) Remembering (i.e., WER for verbatim accuracy); (2) Understanding (i.e., similarity of LLM’s interpretations); and (3) Application (i.e., QA accuracy for simulating downstream tasks). We demonstrate that SIQ not only quantifies voice understanding abilities but also provides unified comparisons between cascaded methods (e.g., ASR-LLM) and end-to-end models, identifies annotation errors in existing benchmarks, and detects hallucinations in LLM_Voice. Our framework represents a first-of-its-kind intelligence examination that bridges cognitive principles with voice-oriented benchmarks, while exposing overlooked challenges in multi-modal training. Our code and data will be open source to encourage future studies.
pdf
bib
abs
CoVoGER: A Multilingual Multitask Benchmark for Speech-to-text Generative Error Correction with Large Language Models
Zhengdong Yang
|
Zhen Wan
|
Sheng Li
|
Chao-Han Huck Yang
|
Chenhui Chu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) can rewrite the N-best hypotheses from a speech-to-text model, often fixing recognition or translation errors that traditional rescoring cannot. Yet research on generative error correction (GER) has been focusing on monolingual automatic speech recognition (ASR), leaving its multilingual and multitask potential underexplored. We introduce CoVoGER, a benchmark for GER that covers both ASR and speech-to-text translation (ST) across 15 languages and 28 language pairs. CoVoGER is constructed by decoding Common Voice 20.0 and CoVoST-2 with Whisper of three model sizes and SeamlessM4T of two model sizes, providing 5-best lists obtained via a mixture of beam search and temperature sampling. We evaluated various instruction-tuned LLMs, including commercial models in zero-shot mode and open-sourced models with LoRA fine-tuning, and found that the mixture decoding strategy yields the best GER performance in most settings. CoVoGER will be released to promote research on reliable language-universal speech-to-text GER. The code and data for the benchmark are available at https://github.com/N-Orien/CoVoGER.
pdf
bib
abs
Generative Error Correction for Emotion-aware Speech-to-text Translation
Zhengdong Yang
|
Sheng Li
|
Chenhui Chu
Findings of the Association for Computational Linguistics: ACL 2025
This paper explores emotion-aware speech-to-text translation (ST) using generative error correction (GER) by large language models (LLMs). Despite recent advancements in ST, the impact of the emotional content has been overlooked. First, we enhance the translation of emotional speech by adopting the GER paradigm: Finetuned an LLM to generate the translation based on the decoded N-best hypotheses. Moreover, we combine the emotion and sentiment labels into the LLM finetuning process to enable the model to consider the emotion content. In addition, we project the ST model’s latent representation into the LLM embedding space to further improve emotion recognition and translation. Experiments on an English-Chinese dataset show the effectiveness of the combination of GER, emotion and sentiment labels, and the projector for emotion-aware ST. Our code is available at https://github.com/N-Orien/EmoST.
2023
pdf
bib
abs
Towards Speech Dialogue Translation Mediating Speakers of Different Languages
Shuichiro Shimizu
|
Chenhui Chu
|
Sheng Li
|
Sadao Kurohashi
Findings of the Association for Computational Linguistics: ACL 2023
We present a new task, speech dialogue translation mediating speakers of different languages. We construct the SpeechBSD dataset for the task and conduct baseline experiments. Furthermore, we consider context to be an important aspect that needs to be addressed in this task and propose two ways of utilizing context, namely monolingual context and bilingual context. We conduct cascaded speech translation experiments using Whisper and mBART, and show that bilingual context performs better in our settings.
pdf
bib
abs
Multi-Domain Dialogue State Tracking with Disentangled Domain-Slot Attention
Longfei Yang
|
Jiyi Li
|
Sheng Li
|
Takahiro Shinozaki
Findings of the Association for Computational Linguistics: ACL 2023
As the core of task-oriented dialogue systems, dialogue state tracking (DST) is designed to track the dialogue state through the conversation between users and systems. Multi-domain DST has been an important challenge in which the dialogue states across multiple domains need to consider. In recent mainstream approaches, each domain and slot are aggregated and regarded as a single query feeding into attention with the dialogue history to obtain domain-slot specific representations. In this work, we propose disentangled domain-slot attention for multi-domain dialogue state tracking. The proposed approach disentangles the domain-slot specific information extraction in a flexible and context-dependent manner by separating the query about domains and slots in the attention component. Through a series of experiments on MultiWOZ 2.0 and MultiWOZ 2.4 datasets, we demonstrate that our proposed approach outperforms the standard multi-head attention with aggregated domain-slot query.