Jianwu Dang
2026
Evaluating the Expressive Appropriateness of Speech in Rich Contexts
Tianrui Wang | Ziyang Ma | Yizhou Peng | Haoyu Wang | Zhikang Niu | Zikang Huang | Yihao Wu | Yi-Wen Chao | Yu Jiang | Yuheng Lu | Guanrou Yang | Xuanchen Li | Hexin Liu | Chunyu Qiang | Cheng Gong | Yifan Yang | Tianchi Liu | Junyu Wang | Nana Hou | Meng Ge | Fuming You | Yang Wei | Zhongqian Sun | Hu Haifeng | Xiaobao Wang | Eng Siong Chng | Xie Chen | Longbiao Wang | Jianwu Dang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tianrui Wang | Ziyang Ma | Yizhou Peng | Haoyu Wang | Zhikang Niu | Zikang Huang | Yihao Wu | Yi-Wen Chao | Yu Jiang | Yuheng Lu | Guanrou Yang | Xuanchen Li | Hexin Liu | Chunyu Qiang | Cheng Gong | Yifan Yang | Tianchi Liu | Junyu Wang | Nana Hou | Meng Ge | Fuming You | Yang Wei | Zhongqian Sun | Hu Haifeng | Xiaobao Wang | Eng Siong Chng | Xie Chen | Longbiao Wang | Jianwu Dang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Evaluating expressive speech remains challenging, as existing methods mainly assess emotional intensity and overlook whether a speech sample is expressively appropriate for its contextual setting. This limitation hinders reliable evaluation of speech systems used in narrative-driven and interactive applications, such as audiobooks and conversational agents. We introduce CEAEval, a Context-rich framework for Evaluating Expressive Appropriateness in speech, which assesses whether a speech sample expressively aligns with the underlying communicative intent implied by its discourse-level narrative context. To support this task, we construct CEAEval-D, the first context-rich speech dataset with real human performances in Mandarin conversational speech, providing narrative descriptions together with fifteen dimensions of human annotations covering expressive attributes and expressive appropriateness. We further develop CEAEval-M, a model that integrates knowledge distillation, planner-based multi-model collaboration, adaptive audio attention bias, and reinforcement learning to perform context-rich expressive appropriateness evaluation. Experiments on a human-annotated test set demonstrate that CEAEval-M substantially outperforms existing speech evaluation and analysis systems.
UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions
Chunyu Qiang | Xiaopeng Wang | Kang Yin | Yuzhe Liang | Yuxin Guo | Teng Ma | Ziyu Zhang | Tianrui Wang | Cheng Gong | Yushen Chen | Ruibo Fu | Longbiao Wang | Jianwu Dang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chunyu Qiang | Xiaopeng Wang | Kang Yin | Yuzhe Liang | Yuxin Guo | Teng Ma | Ziyu Zhang | Tianrui Wang | Cheng Gong | Yushen Chen | Ruibo Fu | Longbiao Wang | Jianwu Dang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce UniSonate, a unified flow-matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference-free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT). Coupled with a multi-stage curriculum learning strategy, this approach effectively mitigates cross-modal optimization conflicts. Extensive experiments demonstrate that UniSonate achieves state-of-the-art performance in instruction-based TTS (WER 1.47%) and TTM (SongEval Coherence 3.18), while maintaining competitive fidelity in TTA. Crucially, we observe positive transfer, where joint training on diverse audio data significantly enhances structural coherence and prosodic expressiveness compared to single-task baselines.
2018
Interaction-Aware Topic Model for Microblog Conversations through Network Embedding and User Attention
Ruifang He | Xuefei Zhang | Di Jin | Longbiao Wang | Jianwu Dang | Xiangang Li
Proceedings of the 27th International Conference on Computational Linguistics
Ruifang He | Xuefei Zhang | Di Jin | Longbiao Wang | Jianwu Dang | Xiangang Li
Proceedings of the 27th International Conference on Computational Linguistics
Traditional topic models are insufficient for topic extraction in social media. The existing methods only consider text information or simultaneously model the posts and the static characteristics of social media. They ignore that one discusses diverse topics when dynamically interacting with different people. Moreover, people who talk about the same topic have different effects on the topic. In this paper, we propose an Interaction-Aware Topic Model (IATM) for microblog conversations by integrating network embedding and user attention. A conversation network linking users based on reposting and replying relationship is constructed to mine the dynamic user behaviours. We model dynamic interactions and user attention so as to learn interaction-aware edge embeddings with social context. Then they are incorporated into neural variational inference for generating the more consistent topics. The experiments on three real-world datasets show that our proposed model is effective.
Implicit Discourse Relation Recognition using Neural Tensor Network with Interactive Attention and Sparse Learning
Fengyu Guo | Ruifang He | Di Jin | Jianwu Dang | Longbiao Wang | Xiangang Li
Proceedings of the 27th International Conference on Computational Linguistics
Fengyu Guo | Ruifang He | Di Jin | Jianwu Dang | Longbiao Wang | Xiangang Li
Proceedings of the 27th International Conference on Computational Linguistics
Implicit discourse relation recognition aims to understand and annotate the latent relations between two discourse arguments, such as temporal, comparison, etc. Most previous methods encode two discourse arguments separately, the ones considering pair specific clues ignore the bidirectional interactions between two arguments and the sparsity of pair patterns. In this paper, we propose a novel neural Tensor network framework with Interactive Attention and Sparse Learning (TIASL) for implicit discourse relation recognition. (1) We mine the most correlated word pairs from two discourse arguments to model pair specific clues, and integrate them as interactive attention into argument representations produced by the bidirectional long short-term memory network. Meanwhile, (2) the neural tensor network with sparse constraint is proposed to explore the deeper and the more important pair patterns so as to fully recognize discourse relations. The experimental results on PDTB show that our proposed TIASL framework is effective.
Search
Fix author
Co-authors
- Longbiao Wang 4
- Cheng Gong 2
- Ruifang He 2
- Di Jin 2
- Xiangang Li 2
- Chunyu Qiang 2
- Tianrui Wang 2
- Yi-Wen Chao 1
- Xie Chen 1
- Yushen Chen 1
- Eng Siong Chng 1
- Ruibo Fu 1
- Meng Ge 1
- Fengyu Guo 1
- Yuxin Guo 1
- Hu Haifeng 1
- Nana Hou 1
- Zikang Huang 1
- Yu Jiang 1
- Xuanchen Li 1
- Yuzhe Liang 1
- Hexin Liu 1
- Tianchi Liu 1
- Yuheng Lu 1
- Teng Ma 1
- Ziyang Ma 1
- Zhikang Niu 1
- Yizhou Peng 1
- Zhongqian Sun 1
- Haoyu Wang 1
- Junyu Wang 1
- Xiaobao Wang 1
- Xiaopeng Wang 1
- Yang Wei 1
- Yihao Wu 1
- Guanrou Yang 1
- Yifan Yang 1
- Kang Yin 1
- Fuming You 1
- Xuefei Zhang 1
- Ziyu Zhang 1