Zhaoqing Li
2026
ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching
Han Zhu | Wei Kang | Liyong Guo | Zengwei Yao | Fangjun Kuang | Weiji Zhuang | Zhaoqing Li | Zhifeng Han | Dong Zhang | Xin Zhang | Xingchen Song | Lingxuan Ye | Long Lin | Daniel Povey
Findings of the Association for Computational Linguistics: ACL 2026
Han Zhu | Wei Kang | Liyong Guo | Zengwei Yao | Fangjun Kuang | Weiji Zhuang | Zhaoqing Li | Zhifeng Han | Dong Zhang | Xin Zhang | Xingchen Song | Lingxuan Ye | Long Lin | Daniel Povey
Findings of the Association for Computational Linguistics: ACL 2026
Generating spoken dialogue is inherently more complex than monologue text-to-speech (TTS), as it demands both realistic turn-taking and the maintenance of distinct speaker timbres. While existing autoregressive (AR) models have made progress, they often suffer from high inference latency and stability issues. To overcome these limitations, we propose ZipVoice-Dialog, a non-autoregressive (NAR) zero-shot spoken dialogue generation model based on flow-matching. Observing that applying vanilla flow-matching to dialogue generation leads to poor speech intelligibility and turn-taking precision, we introduce two simple yet effective methods to adapt flow-matching architectures for dialogue generation: (1) a curriculum learning strategy to ensure robust speech-text alignment, and (2) speaker-turn embeddings to govern precise speaker turn-taking. Additionally, we introduce dedicated strategies to support stereo dialogue generation.Recognizing the lack of training datasets in this field, we curate and release OpenDialog, the first large-scale (6.8k hours) open-source spoken dialogue dataset derived from in-the-wild speech data. Moreover, for fair and rigorous evaluations, we established a benchmark to comprehensively evaluate dialogue generation models. Experiments demonstrate the effectiveness of the proposed methods and dataset, showing that ZipVoice-Dialog achieves superior performance in inference speed, intelligibility, speaker turn-taking accuracy, and speaker similarity. Our code, model checkpoints, and the OpenDialog dataset are publicly available.
2024
DUTIR938 at SemEval-2024 Task 4: Semi-Supervised Learning and Model Ensemble for Persuasion Techniques Detection in Memes
Erchen Yu | Junlong Wang | Xuening Qiao | Jiewei Qi | Zhaoqing Li | Hongfei Lin | Linlin Zong | Bo Xu
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Erchen Yu | Junlong Wang | Xuening Qiao | Jiewei Qi | Zhaoqing Li | Hongfei Lin | Linlin Zong | Bo Xu
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
The development of social platforms has facilitated the proliferation of disinformation, with memes becoming one of the most popular types of propaganda for disseminating disinformation on the internet. Effectively detecting the persuasion techniques hidden within memes is helpful in understanding user-generated content and further promoting the detection of disinformation on the internet. This paper demonstrates the approach proposed by Team DUTIR938 in Subtask 2b of SemEval-2024 Task 4. We propose a dual-channel model based on semi-supervised learning and model ensemble. We utilize CLIP to extract image features, and employ various pretrained language models under task-adaptive pretraining for text feature extraction. To enhance the detection and generalization capabilities of the model, we implement sample data augmentation using semi-supervised pseudo-labeling methods, introduce adversarial training strategies, and design a two-stage global model ensemble strategy. Our proposed method surpasses the provided baseline method, with Macro/Micro F1 values of 0.80910/0.83667 in the English leaderboard. Our submission ranks 3rd/19 in terms of Macro F1 and 1st/19 in terms of Micro F1.