Han Zhu
2026
ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching
Han Zhu | Wei Kang | Liyong Guo | Zengwei Yao | Fangjun Kuang | Weiji Zhuang | Zhaoqing Li | Zhifeng Han | Dong Zhang | Xin Zhang | Xingchen Song | Lingxuan Ye | Long Lin | Daniel Povey
Findings of the Association for Computational Linguistics: ACL 2026
Han Zhu | Wei Kang | Liyong Guo | Zengwei Yao | Fangjun Kuang | Weiji Zhuang | Zhaoqing Li | Zhifeng Han | Dong Zhang | Xin Zhang | Xingchen Song | Lingxuan Ye | Long Lin | Daniel Povey
Findings of the Association for Computational Linguistics: ACL 2026
Generating spoken dialogue is inherently more complex than monologue text-to-speech (TTS), as it demands both realistic turn-taking and the maintenance of distinct speaker timbres. While existing autoregressive (AR) models have made progress, they often suffer from high inference latency and stability issues. To overcome these limitations, we propose ZipVoice-Dialog, a non-autoregressive (NAR) zero-shot spoken dialogue generation model based on flow-matching. Observing that applying vanilla flow-matching to dialogue generation leads to poor speech intelligibility and turn-taking precision, we introduce two simple yet effective methods to adapt flow-matching architectures for dialogue generation: (1) a curriculum learning strategy to ensure robust speech-text alignment, and (2) speaker-turn embeddings to govern precise speaker turn-taking. Additionally, we introduce dedicated strategies to support stereo dialogue generation.Recognizing the lack of training datasets in this field, we curate and release OpenDialog, the first large-scale (6.8k hours) open-source spoken dialogue dataset derived from in-the-wild speech data. Moreover, for fair and rigorous evaluations, we established a benchmark to comprehensively evaluate dialogue generation models. Experiments demonstrate the effectiveness of the proposed methods and dataset, showing that ZipVoice-Dialog achieves superior performance in inference speed, intelligibility, speaker turn-taking accuracy, and speaker similarity. Our code, model checkpoints, and the OpenDialog dataset are publicly available.
Benchmarking Fine-Grained Error Detection in Multimodal Reasoning
Chi-Min Chan | Han Zhu | Chunyang Jiang | Jiaming Ji | Juntao Dai | Wei Xue | Sirui Han | Yike Guo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chi-Min Chan | Han Zhu | Chunyang Jiang | Jiaming Ji | Juntao Dai | Wei Xue | Sirui Han | Yike Guo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multimodal Process Reward Models (MPRMs) have emerged as a pivotal framework for enhancing the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, the research community currently lacks a dedicated benchmark to rigorously assess the error discernment capabilities of these models.To address this gap, we introduce PRMBench-V, a novel benchmark specifically designed to evaluate MPRMs’ proficiency in detecting erroneous reasoning steps across diverse error categories. Leveraging a semi-automated annotation pipeline augmented with human verification, we construct a comprehensive dataset comprising 907 unique queries, each annotated with nine distinct error types, resulting in 8,163 test cases with fine-grained step-level error labels.Through extensive experiments involving over 15 open- and closed-source models, we uncover several key findings: (1) even the strongest existing MPRMs achieve only \textasciitilde30% accuracy in error identification; (2) while partial error detection achieves moderate precision and recall (\textasciitilde60%), overall accuracy remains low (\textasciitilde20%); and (3) benchmark scores exhibit a strong correlation with downstream task performance gains (r=0.86). Furthermore, we demonstrate that PRMBench-V can inform the development of more robust MPRMs: by introducing the Bayesian Rater Reliability Process Reward Model (BR2-PRM), we achieve up to a 4.8% performance improvement through test-time scaling.We believe that PRMBench-V will serve as a valuable resource for advancing MPRM research, enabling more rigorous evaluation and fostering the development of models with fine-grained multimodal reasoning capabilities.
SafeMT: Multi-turn Safety for Multimodal Language Models
Han Zhu | Juntao Dai | Jiaming Ji | Haoran Li | Chengkun Cai | Pengcheng Wen | Chi-Min Chan | Boyuan Chen | Yaodong Yang | Sirui Han | Yike Guo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Han Zhu | Juntao Dai | Jiaming Ji | Haoran Li | Chengkun Cai | Pengcheng Wen | Chi-Min Chan | Boyuan Chen | Yaodong Yang | Sirui Han | Yike Guo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
With the widespread use of multi-modal Large Language models (MLLMs), safety issues have become a growing concern. Multi-turn dialogues, which are more common in everyday interactions, pose a greater risk than single prompts; however, existing benchmarks do not adequately consider this situation. To encourage the community to focus on the safety issues of these models in multi-turn dialogues, we introduce SafeMT, a benchmark that features dialogues of varying lengths generated from harmful queries accompanied by images. This benchmark consists of 10,000 samples in total, encompassing 17 different scenarios and four jailbreak methods. Additionally, we propose Safety Index (SI) to evaluate the general safety of MLLMs during conversations. We assess the safety of 17 models using this benchmark and discover that the risk of successful attacks on these models increases as the number of turns in harmful dialogues rises. This observation indicates that the safety mechanisms of these models are inadequate for recognizing the hazard in dialogue interactions. We propose a dialogue safety moderator capable of detecting malicious intent concealed within conversations and providing MLLMs with relevant safety policies. Experimental results from several open-source models indicate that this moderator is more effective in reducing multi-turn Attack Success Rate (ASR) compared to existed guard models.
BCL: Bayesian In-Context Learning Framework for Information Extraction
Haoliang Liu | Chengkun Cai | Xu Zhao | Han Zhu | Shizhou Huang | Xinglin Zhang | Tao Chen | Jenq-Neng Hwang | Zhang Huaping | Lei Li
Findings of the Association for Computational Linguistics: ACL 2026
Haoliang Liu | Chengkun Cai | Xu Zhao | Han Zhu | Shizhou Huang | Xinglin Zhang | Tao Chen | Jenq-Neng Hwang | Zhang Huaping | Lei Li
Findings of the Association for Computational Linguistics: ACL 2026
Existing information extraction (IE) tasks increasingly adopt in-context learning (ICL) with large language models. However, current approaches either show inconsistent performance across model scales or lack systematic optimization and generalizability. Building on this, we propose BCL-IE (Bayesian In-Context Learning Framework for Information Extraction), the first optimization framework that uses particle filtering with Bayesian updates to systematically refine label representations across IE tasks. Through four steps—initialization, observation, weight update, and resampling, BCL-IE generalizes to both sequence labeling and relation classification paradigms. Extensive experiments demonstrate substantial improvements over existing approaches (up to 30%), achieving prior performance while other methods either fail to generalize or show limited effectiveness.
2025
SafeLawBench: Towards Safe Alignment of Large Language Models
Chuxue Cao | Han Zhu | Jiaming Ji | Qichao Sun | Zhenghao Zhu | Wu Yinyu | Josef Dai | Yaodong Yang | Sirui Han | Yike Guo
Findings of the Association for Computational Linguistics: ACL 2025
Chuxue Cao | Han Zhu | Jiaming Ji | Qichao Sun | Zhenghao Zhu | Wu Yinyu | Josef Dai | Yaodong Yang | Sirui Han | Yike Guo
Findings of the Association for Computational Linguistics: ACL 2025
With the growing prevalence of large language models (LLMs), the safety of LLMs has raised significant concerns. However, there is still a lack of definitive standards for evaluating their safety due to the subjective nature of current safety benchmarks. To address this gap, we conducted the first exploration of LLMs’ safety evaluation from a legal perspective by proposing the SafeLawBench benchmark. SafeLawBench categorizes safety risks into three levels based on legal standards, providing a systematic and comprehensive framework for evaluation. It comprises 24,860 multi-choice questions and 1,106 open-domain question-answering (QA) tasks. Our evaluation included 2 closed-source LLMs and 18 open-source LLMs using zero-shot and few-shot prompting, highlighting the safety features of each model. We also evaluated the LLMs’ safety-related reasoning stability and refusal behavior. Additionally, we found that a majority voting mechanism can enhance model performance. Notably, even leading SOTA models like Claude-3.5-Sonnet and GPT-4o have not exceeded 80.5% accuracy in multi-choice tasks on SafeLawBench, while the average accuracy of 20 LLMs remains at 68.8%. We urge the community to prioritize research on the safety of LLMs.
LegalReasoner: Step-wised Verification-Correction for Legal Judgment Reasoning
Weijie Shi | Han Zhu | Jiaming Ji | Mengze Li | Jipeng Zhang | Ruiyuan Zhang | Jia Zhu | Jiajie Xu | Sirui Han | Yike Guo
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Weijie Shi | Han Zhu | Jiaming Ji | Mengze Li | Jipeng Zhang | Ruiyuan Zhang | Jia Zhu | Jiajie Xu | Sirui Han | Yike Guo
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Legal judgment prediction (LJP) aims to function as a judge by making final rulings based on case claims and facts, which plays a vital role in the judicial domain for supporting court decision-making and improving judicial efficiency. However, existing methods often struggle with logical errors when conducting complex legal reasoning. We propose LegalReasoner, which enhances LJP reliability through step-wise verification and correction of the reasoning process. Specifically, it first identifies dispute points to decompose complex cases, and then conducts step-wise reasoning while employing a process verifier to validate each step’s logic from correctness, progressiveness, and potential perspectives. When errors are detected, expert-designed attribution and resolution strategies are applied for correction. To fine-tune LegalReasoner, we release the LegalHK dataset, containing 58,130 Hong Kong court cases with detailed annotations of dispute points, step-by-step reasoning chains, and process verification labels. Experiments demonstrate that LegalReasoner significantly improves concordance with court decisions from 72.37 to 80.27 on LLAMA-3.1-70B. The data is available at https://huggingface.co/datasets/weijiezz/LegalHK.
Search
Fix author
Co-authors
- Yike Guo 4
- Sirui Han 4
- Jiaming Ji 4
- Chengkun Cai 2
- Chi-Min Chan 2
- Juntao Dai 2
- Yaodong Yang (杨耀东) 2
- Chuxue Cao 1
- Boyuan Chen (陈博远) 1
- Tao Chen 1
- Josef Dai 1
- Liyong Guo 1
- Zhifeng Han 1
- Shizhou Huang 1
- Zhang Huaping 1
- Jenq-Neng Hwang 1
- Chunyang Jiang 1
- Wei Kang 1
- Fangjun Kuang 1
- Mengze Li 1
- Zhaoqing Li 1
- Haoran Li 1
- Lei Li 1
- Long Lin 1
- Haoliang Liu 1
- Daniel Povey 1
- Weijie Shi 1
- Xingchen Song 1
- Qichao Sun 1
- Pengcheng Wen 1
- Jiajie Xu 1
- Wei Xue 1
- Zengwei Yao 1
- Lingxuan Ye 1
- Wu Yinyu 1
- Jipeng Zhang 1
- Ruiyuan Zhang 1
- Dong Zhang 1
- Xin Zhang 1
- Xinglin Zhang 1
- Xu Zhao 1
- Zhenghao Zhu 1
- Jia Zhu 1
- Weiji Zhuang 1