Yong Qin
2026
SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation
Hui Wang | Jinghua Zhao | Yifan Yang | Shujie Liu | Junyang Chen | Yanzhe Zhang | Shiwan Zhao | Jinyu Li | Jiaming Zhou | Haoqin Sun | Yan Lu | Yong Qin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hui Wang | Jinghua Zhao | Yifan Yang | Shujie Liu | Junyang Chen | Yanzhe Zhang | Shiwan Zhao | Jinyu Li | Jiaming Zhou | Haoqin Sun | Yan Lu | Yong Qin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Generative speech technologies are progressing rapidly, but evaluating the perceptual quality of synthetic speech remains a core challenge. Existing methods typically rely on scalar scores or binary decisions, which lack interpretability and generalization across tasks and languages. We present SpeechLLM-as-Judges, a new paradigm for enabling large language models (LLMs) to conduct structured and explanation-based speech quality evaluation. To support this direction, we introduce SpeechEval, a large-scale dataset containing 32,207 multilingual speech clips and 128,754 annotations spanning four tasks: quality assessment, pairwise comparison, improvement suggestion, and deepfake detection. Based on this resource, we develop SQ-LLM, a speech-quality-aware LLM trained with chain-of-thought reasoning and reward optimization to improve capability. Experimental results show that SQ-LLM delivers strong performance across tasks and languages, revealing the potential of this paradigm for advancing speech quality evaluation. The relevant code, models, and data are publicly available at https://github.com/NKU-HLT/SpeechLLM-as-Judges.
RealTalk-CN: A Realistic Chinese Speech Task-Oriented Dialogue Benchmark with Cross-Modal Analysis
Enzhi Wang | Jiaming Zhou | Yuhang Jia | Aobo Kong | Qicheng Li | Yong Qin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Enzhi Wang | Jiaming Zhou | Yuhang Jia | Aobo Kong | Qicheng Li | Yong Qin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advances in speech large language models (e.g., GPT-4o) have enabled end-to-end spoken interactions, yet their robustness in real-world applications remains unclear, where systems must assist users in completing specific tasks under complex conditions such as multi-turn, ambiguous, and often spontaneous speech, as well as natural alternation between speech and text. Task-oriented dialogue (TOD) offers a realistic scenario to evaluate whether models can effectively help users accomplish such task-oriented goals, but existing benchmarks are mainly text-based, and the few speech datasets are limited to English and often neglect spontaneous disfluencies and speaker diversity. To address this gap, we introduce RealTalk-CN, the first Chinese multi-turn, multi-domain speech–text TOD dataset, containing 5.4k dialogues (60K turns, ~150 hours) of real human-to-human recordings with detailed annotations for dialogue states, disfluency types, and speaker characteristics. Based on this dataset, we propose a cross-modal interaction task supporting dynamic speech-text switching and a comprehensive evaluation protocol assessing robustness to disfluencies, sensitivity to speaker variation, and cross-domain generalization. Experiments on state-of-the-art models demonstrate the challenges posed by RealTalk-CN and establish its value as a benchmark for developing reliable and fair Speech LLMs in real-world deployments. The dataset and evaluation framework are available to encourage further research.
2025
ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5
Jiaming Zhou | Shiyao Wang | Shiwan Zhao | Jiabei He | Haoqin Sun | Hui Wang | Cheng Liu | Aobo Kong | Yujie Guo | Xi Yang | Yequan Wang | Yonghua Lin | Yong Qin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiaming Zhou | Shiyao Wang | Shiwan Zhao | Jiabei He | Haoqin Sun | Hui Wang | Cheng Liu | Aobo Kong | Yujie Guo | Xi Yang | Yequan Wang | Yonghua Lin | Yong Qin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Automatic speech recognition (ASR) systems have advanced significantly with models like Whisper, Conformer, and self-supervised frameworks such as Wav2vec 2.0 and HuBERT. However, developing robust ASR models for young children’s speech remains challenging due to differences in pronunciation, tone, and pace compared to adult speech. In this paper, we introduce a new Mandarin speech dataset focused on children aged 3 to 5, addressing the scarcity of resources in this area. The dataset comprises 41.25 hours of speech with carefully crafted manual transcriptions, collected from 397 speakers across various provinces in China, with balanced gender representation. We provide a comprehensive analysis of speaker demographics, speech duration distribution and geographic coverage. Additionally, we evaluate ASR performance on models trained from scratch, such as Conformer, as well as fine-tuned pre-trained models like HuBERT and Whisper, where fine-tuning demonstrates significant performance improvements. Furthermore, we assess speaker verification (SV) on our dataset, showing that, despite the challenges posed by the unique vocal characteristics of young children, the dataset effectively supports both ASR and SV tasks. This dataset is a valuable contribution to Mandarin child speech research and holds potential for applications in educational technology and child-computer interaction. It will be open-source and freely available for all academic purposes.
SDPO: Segment-Level Direct Preference Optimization for Social Agents
Aobo Kong | Wentao Ma | Shiwan Zhao | Yongbin Li | Yuchuan Wu | Ke Wang | Xiaoqian Liu | Qicheng Li | Yong Qin | Fei Huang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Aobo Kong | Wentao Ma | Shiwan Zhao | Yongbin Li | Yuchuan Wu | Ke Wang | Xiaoqian Liu | Qicheng Li | Yong Qin | Fei Huang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex social dialogues. Direct Preference Optimization (DPO) has proven effective in aligning LLM behavior with human preferences across various agent tasks. However, standard DPO focuses solely on individual turns, which limits its effectiveness in multi-turn social interactions. Several DPO-based multi-turn alignment methods with session-level data have shown potential in addressing this problem. While these methods consider multiple turns across entire sessions, they are often overly coarse-grained, introducing training noise, and lack robust theoretical support. To resolve these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which dynamically select key segments within interactions to optimize multi-turn agent behavior. SDPO minimizes training noise and is grounded in a rigorous theoretical framework. Evaluations on the SOTOPIA benchmark demonstrate that SDPO-tuned agents consistently outperform both existing DPO-based methods and proprietary LLMs like GPT-4o, underscoring SDPO’s potential to advance the social intelligence of LLM-based agents. We release our code and data at https://anonymous.4open.science/r/SDPO-CE8F.
2024
Better Zero-Shot Reasoning with Role-Play Prompting
Aobo Kong | Shiwan Zhao | Hao Chen | Qicheng Li | Yong Qin | Ruiqi Sun | Xin Zhou | Enzhi Wang | Xiaohang Dong
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Aobo Kong | Shiwan Zhao | Hao Chen | Qicheng Li | Yong Qin | Ruiqi Sun | Xin Zhou | Enzhi Wang | Xiaohang Dong
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Modern large language models (LLMs) exhibit a remarkable capacity for role-playing, enabling them to embody not only human characters but also non-human entities. This versatility allows them to simulate complex human-like interactions and behaviors within various contexts, as well as to emulate specific objects or systems. While these capabilities have enhanced user engagement and introduced novel modes of interaction, the influence of role-playing on LLMs’ reasoning abilities remains underexplored. In this study, we introduce a strategically designed role-play prompting methodology and assess its performance under the zero-shot setting across twelve diverse reasoning benchmarks. Our empirical results illustrate that role-play prompting consistently surpasses the standard zero-shot approach across most datasets. Notably, in experiments conducted using ChatGPT, accuracy on AQuA rises from 53.5% to 63.8%, and on Last Letter from 23.8% to 84.2%. Upon further comparison with the Zero-Shot-CoT technique, which prompts the model to “think step by step”, our study demonstrates that role-play prompting acts as a more effective trigger for the CoT process.This highlights its potential to augment the reasoning capabilities of LLMs. We release our code at https://github.com/NKU-HLT/Role-Play-Prompting.
2023
PromptRank: Unsupervised Keyphrase Extraction Using Prompt
Aobo Kong | Shiwan Zhao | Hao Chen | Qicheng Li | Yong Qin | Ruiqi Sun | Xiaoyan Bai
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Aobo Kong | Shiwan Zhao | Hao Chen | Qicheng Li | Yong Qin | Ruiqi Sun | Xiaoyan Bai
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The keyphrase extraction task refers to the automatic selection of phrases from a given document to summarize its core content. State-of-the-art (SOTA) performance has recently been achieved by embedding-based algorithms, which rank candidates according to how similar their embeddings are to document embeddings. However, such solutions either struggle with the document and candidate length discrepancies or fail to fully utilize the pre-trained language model (PLM) without further fine-tuning. To this end, in this paper, we propose a simple yet effective unsupervised approach, PromptRank, based on the PLM with an encoder-decoder architecture. Specifically, PromptRank feeds the document into the encoder and calculates the probability of generating the candidate with a designed prompt by the decoder. We extensively evaluate the proposed PromptRank on six widely used benchmarks. PromptRank outperforms the SOTA approach MDERank, improving the F1 score relatively by 34.18%, 24.87%, and 17.57% for 5, 10, and 15 returned results, respectively. This demonstrates the great potential of using prompt for unsupervised keyphrase extraction. We release our code at https://github.com/HLT-NLP/PromptRank.
2014
Search
Fix author
Co-authors
- Aobo Kong 5
- Shiwan Zhao 5
- Qicheng Li 4
- Jiaming Zhou 3
- Hao Chen 2
- Haoqin Sun 2
- Ruiqi Sun 2
- Enzhi Wang 2
- Hui Wang 2
- Xiaoyan Bai 1
- Junyang Chen 1
- Liwei Chen 1
- Xiaohang Dong 1
- Yansong Feng 1
- Yujie Guo 1
- Jiabei He 1
- Fei Huang 1
- Songfang Huang 1
- Yuhang Jia 1
- Jinyu Li 1
- Yongbin Li 1
- Yonghua Lin 1
- Cheng Liu 1
- Shujie Liu 1
- Xiaoqian Liu 1
- Yan Lu 1
- Wentao Ma 1
- Ke Wang 1
- Shiyao Wang 1
- Yequan Wang 1
- Yuchuan Wu 1
- Xi Yang 1
- Yifan Yang 1
- Yanzhe Zhang 1
- Dongyan Zhao 1
- Jinghua Zhao 1
- Xin Zhou 1