Jiaming Zhou

2026

RealTalk-CN: A Realistic Chinese Speech Task-Oriented Dialogue Benchmark with Cross-Modal Analysis
Enzhi Wang | Jiaming Zhou | Yuhang Jia | Aobo Kong | Qicheng Li | Yong Qin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent advances in speech large language models (e.g., GPT-4o) have enabled end-to-end spoken interactions, yet their robustness in real-world applications remains unclear, where systems must assist users in completing specific tasks under complex conditions such as multi-turn, ambiguous, and often spontaneous speech, as well as natural alternation between speech and text. Task-oriented dialogue (TOD) offers a realistic scenario to evaluate whether models can effectively help users accomplish such task-oriented goals, but existing benchmarks are mainly text-based, and the few speech datasets are limited to English and often neglect spontaneous disfluencies and speaker diversity. To address this gap, we introduce RealTalk-CN, the first Chinese multi-turn, multi-domain speech–text TOD dataset, containing 5.4k dialogues (60K turns, ~150 hours) of real human-to-human recordings with detailed annotations for dialogue states, disfluency types, and speaker characteristics. Based on this dataset, we propose a cross-modal interaction task supporting dynamic speech-text switching and a comprehensive evaluation protocol assessing robustness to disfluencies, sensitivity to speaker variation, and cross-domain generalization. Experiments on state-of-the-art models demonstrate the challenges posed by RealTalk-CN and establish its value as a benchmark for developing reliable and fair Speech LLMs in real-world deployments. The dataset and evaluation framework are available to encourage further research.

pdf bib abs

Speech signals convey abundant speaker-related metadata, yet current privacy research predominantly focuses on identity-centric voiceprint protection, leaving sensitive Speaker Attribute Privacy (SAP) largely underexplored. This paper introduces AudioPrivacy, a large-scale Chinese dataset designed to systematically evaluate SAP leakage in realistic, everyday scenarios. Comprising 227.3 hours of audio from 1,000 speakers, it uniquely encompasses four parallel modalities: speech, singing, paralinguistic expressions, and non-vocal acoustic signals (e.g., footsteps). Annotated with 11 diverse attributes, including fine-grained physiological traits often overlooked in traditional corpora, AudioPrivacy enables a granular analysis of acoustic privacy risks. Our evaluations reveal significant leakage across multiple attributes, even when inferred from non-vocal signals. Furthermore, we demonstrate that state-of-the-art Multimodal Large Language Models (MM LLMs) can precisely profile speakers and exacerbate these risks, underscores the urgent need to rethink privacy-preserving mechanisms in the era of powerful audio foundation models.

pdf bib abs

The advancement of Multimodal Emotion Recognition (MER) in Chinese is significantly hindered by the scarcity of high-quality, spontaneous dialogue datasets compared to their English counterparts. In this work, we introduce EmotionTalk, the first interactive Chinese multimodal dataset designed to capture the nuance of authentic emotional interplay. Collected from 19 professional actors, the dataset spans 23.6 hours of dyadic conversations across diverse scenarios. A key contribution of EmotionTalk is its multi-grained annotation system, which integrates standard categorical and dimensional labels with fine-grained emotional speaking style captions, enabling research into interpretable emotion analysis. We establish comprehensive benchmarks for emotion recognition and captioning tasks, verifying the dataset’s effectiveness and the necessity of multimodal fusion. EmotionTalk serves as a critical resource for bridging the gap in non-English affective computing and is publicly released for the research community.

pdf bib abs

Automatic speech recognition (ASR) for children remains challenging due to developmental variability and the scarcity of high-quality corpora, especially for Mandarin and its dialects. In this paper, we present ChildTalk, a large-scale Chinese child speech corpus designed to address this gap. It contains 112.5 hours of speech from 498 children (aged 2–8) and 500 caregivers, recorded as natural child–caregiver conversations. Unlike prior Mandarin child ASR corpora that mainly release isolated utterances, ChildTalk provides full-length dialogues with complete transcriptions, preserving turn-taking and discourse context. To our knowledge, it is the first publicly available Mandarin child speech corpus with full-length dialogues and systematic coverage of standard Mandarin, eight Mandarin dialect subgroups, and two additional dialects (Southern Min and Jin). We benchmark end-to-end models trained from scratch, large pre-trained ASR models fine-tuned on ChildTalk, omni-modal LLMs in a zero-shot setting, and commercial speech transcription APIs. Fine-tuning on ChildTalk consistently improves both in-domain and cross-domain performance. These results indicate that ChildTalk provides a challenging, broad-coverage testbed for Chinese child ASR, dialect robustness, and dialogue-level modeling. The dataset will be made freely available for all academic purposes.

pdf bib abs

Generative speech technologies are progressing rapidly, but evaluating the perceptual quality of synthetic speech remains a core challenge. Existing methods typically rely on scalar scores or binary decisions, which lack interpretability and generalization across tasks and languages. We present SpeechLLM-as-Judges, a new paradigm for enabling large language models (LLMs) to conduct structured and explanation-based speech quality evaluation. To support this direction, we introduce SpeechEval, a large-scale dataset containing 32,207 multilingual speech clips and 128,754 annotations spanning four tasks: quality assessment, pairwise comparison, improvement suggestion, and deepfake detection. Based on this resource, we develop SQ-LLM, a speech-quality-aware LLM trained with chain-of-thought reasoning and reward optimization to improve capability. Experimental results show that SQ-LLM delivers strong performance across tasks and languages, revealing the potential of this paradigm for advancing speech quality evaluation. The relevant code, models, and data are publicly available at https://github.com/NKU-HLT/SpeechLLM-as-Judges.

pdf bib abs

Autoregressive (AR) large audio language models (LALMs) such as Qwen-2.5-Omni have achieved strong performance on audio understanding and interaction, but scaling them remains costly in data and computation, and strictly sequential decoding limits inference efficiency. Diffusion large language models (dLLMs) have recently been shown to make effective use of limited training data, and prior work on DIFFA indicates that replacing an AR backbone with a diffusion counterpart can substantially improve audio understanding under matched settings, albeit at a proof-of-concept scale without large-scale instruction tuning, preference alignment, or practical decoding schemes. We introduce DIFFA-2, a practical diffusion-based LALM for general audio understanding. DIFFA-2 upgrades the speech encoder, employs dual semantic and acoustic adapters, and is trained with a four-stage curriculum that combines semantic and acoustic alignment, large-scale supervised fine-tuning, and variance-reduced preference optimization, using only fully open-source corpora. Experiments on MMSU, MMAU, and MMAR show that DIFFA-2 consistently improves over DIFFA and is competitive to strong AR LALMs under practical training budgets, supporting diffusion-based modeling is a viable backbone for large-scale audio understanding.

2025

pdf bib abs

Automatic speech recognition (ASR) systems have advanced significantly with models like Whisper, Conformer, and self-supervised frameworks such as Wav2vec 2.0 and HuBERT. However, developing robust ASR models for young children’s speech remains challenging due to differences in pronunciation, tone, and pace compared to adult speech. In this paper, we introduce a new Mandarin speech dataset focused on children aged 3 to 5, addressing the scarcity of resources in this area. The dataset comprises 41.25 hours of speech with carefully crafted manual transcriptions, collected from 397 speakers across various provinces in China, with balanced gender representation. We provide a comprehensive analysis of speaker demographics, speech duration distribution and geographic coverage. Additionally, we evaluate ASR performance on models trained from scratch, such as Conformer, as well as fine-tuned pre-trained models like HuBERT and Whisper, where fine-tuning demonstrates significant performance improvements. Furthermore, we assess speaker verification (SV) on our dataset, showing that, despite the challenges posed by the unique vocal characteristics of young children, the dataset effectively supports both ASR and SV tasks. This dataset is a valuable contribution to Mandarin child speech research and holds potential for applications in educational technology and child-computer interaction. It will be open-source and freely available for all academic purposes.