Guoshuai Zhao

2026

Large language models exhibit significant potential for psychological support, yet they often generate fragmented and emotionally inconsistent dialogues that lack the therapeutic structure necessary for reliable assessment.To address these issues, we introduce **VeilEval**, a clinically grounded and privacy-preserving benchmark equipped with interpretable metrics for evaluating multi-turn psychological dialogues.Furthermore, we propose Emotion-Resonance (**EmoRes**), a multi-agent framework that boosts psychological reasoning via a Topic-Mining Emotional Agent and a multi-perspective Self-Reflection Agent, thereby jointly improving topic continuity, emotional coherence, and clinical interpretability.Experiments demonstrate that EmoRes achieves up to ∼ 3× improvement over strong baselines on VeilEval, with its effectiveness further validated by ablation studies and human evaluations.

pdf bib abs

Duplicate-Aware Controlled Code Generation: Enhancing Copyright Protection with Targeted Reordering Beam Search in LLMs
Junbo Fu | Guoshuai Zhao | Linkang Yang | Yunqi Mi | Xueming Qian
Findings of the Association for Computational Linguistics: ACL 2026

The increasing integration of large language models (LLMs) in code generation has raised critical copyright concerns, particularly regarding the verbatim repetition of copyrighted code. To address this challenge, we propose a novel task: Duplicate-Aware Controlled Code Generation (DACCG), which aims to mitigate verbatim repetition while preserving the quality of generated code. To this end, we introduce Targeted Reordering Beam Search (TRBS), a plug-and-play decoding method that dynamically reorders beam candidates to reduce direct copying. TRBS leverages the FM-index for efficient substring detection and employs a spike-entropy-based protection mechanism to safeguard structural anchors critical to code coherence. Experimental results on a multi-language code generation benchmark demonstrate that TRBS effectively reduces verbatim repetition while maintaining functional adequacy. Our research represents a pioneering effort in code copyright protection from the model user’s perspective, offering novel insights into responsible code generation practices.

pdf bib abs

Multi-hop Question Answering (MHQA) aims to answer questions that require multi-step reasoning. The complexity of user queries, coupled with potential knowledge deficiencies in Large Language Models (LLMs), gives rise to two pivotal challenges that underpin the performance on this task: the correct identification of the reasoning path and the accurate retrieval of essential knowledge. Existing approaches primarily rely on prompt-based methods to generate reasoning paths, which are further combined with traditional sparse or dense retrieval to produce the final answer. However, the generation of reasoning paths commonly lacks effective control over the generative process, thus leading the reasoning astray. Meanwhile, the retrieval methods over-rely on knowledge matching or similarity scores rather than evaluating the practical utility of the information, resulting in retrieving homogeneous or non-useful information. Therefore, we propose a Structured Entity-Aware Retrieval with Chain-of-Reasoning Navigator framework named SEARCH-R. Specifically, SEARCH-R trains an end-to-end reasoning path navigator, which is able to provide a powerful sub-question decomposer by fine-tuning the Llama3.1-8B model. Moreover, a novel dependency tree-based retrieval is designed to evaluate the informational contribution of the document quantitatively. Extensive experiments on three challenging multi-hop datasets validate the effectiveness of the proposed framework. The code and dataset are available at: https://github.com/Applied-Machine-Learning-Lab/ACL2026_SEARCH-R.

pdf bib abs

Diagnostic prediction and clinical reasoning are critical tasks in healthcare applications. While large language models have shown strong capabilities in commonsense reasoning, they still struggle with diagnostic reasoning due to limited domain knowledge. Existing approaches often rely on internal model knowledge or static knowledge bases, which are insufficient to support the knowledge demands of diagnostic reasoning. Moreover, these methods focus solely on the accuracy of final predictions, overlooking alignment with standard clinical reasoning trajectories. To this end, we propose MultiDx, a two-stage diagnostic reasoning framework that performs differential diagnosis by analyzing evidence collected from multiple knowledge sources. Specifically, it first generates suspected diagnoses and reasoning traces by leveraging knowledge from web search, SOAP-formatted case, and clinical case database. Then it integrates multi-perspective evidence through matching, voting, and differential diagnosis to generate the final prediction. Extensive experiments demonstrate the effectiveness of our approach.

pdf bib abs

Large language models have demonstrated strong reasoning capabilities in general knowledge question answering. However, their ability to handle temporal information remains limited. To address this limitation, existing approaches often design time-sensitive reasoning pipelines that rely on external tools or manual verification and are tailored to specific scenarios, leading to poor generalizability. Moreover, these methods apply a fixed pipeline to all questions, overlooking the fact that different types of temporal questions often require distinct reasoning strategies, which leads to unnecessary processing for simple cases and inadequate reasoning for more complex ones. To this end, we propose AdapTime, an adaptive temporal reasoning method that dynamically executes reasoning steps based on the input context and task requirements. Specifically, it involves three temporal reasoning actions: reformulate, rewrite and review, with an LLM planner guiding the reasoning process. AdapTime integrates seamlessly with state-of-the-art LLMs and significantly enhances their temporal reasoning capabilities without relying on external support. Extensive experiments on two temporal QA benchmarks demonstrate the effectiveness of our approach.

2025

pdf bib abs

Temporal knowledge graph reasoning aims to predict future events with knowledge of existing facts and plays a key role in various downstream tasks. Previous methods focused on either graph structure learning or semantic reasoning, failing to integrate dual reasoning perspectives to handle different prediction scenarios. Moreover, they lack the capability to capture the inherent differences between historical and non-historical events, which limits their generalization across different temporal contexts. To this end, we propose a **M**ulti-**E**xpert **S**tructural-**S**emantic **H**ybrid (MESH) framework that employs three kinds of expert modules to integrate both structural and semantic information, guiding the reasoning process for different events. Extensive experiments on three datasets demonstrate the effectiveness of our approach.

2024

pdf bib abs

Pseudo-Label Enhanced Prototypical Contrastive Learning for Uniformed Intent Discovery
Yimin Deng | Yuxia Wu | Guoshuai Zhao | Li Zhu | Xueming Qian
Findings of the Association for Computational Linguistics: EMNLP 2024

New intent discovery is a crucial capability for task-oriented dialogue systems. Existing methods focus on transferring in-domain (IND) prior knowledge to out-of-domain (OOD) data through pre-training and clustering stages. They either handle the two processes in a pipeline manner, which exhibits a gap between intent representation and clustering process or use typical contrastive clustering that overlooks the potential supervised signals from the whole data. Besides, they often deal with either open intent discovery or OOD settings individually. To this end, we propose a Pseudo-Label enhanced Prototypical Contrastive Learning (PLPCL) model for uniformed intent discovery. We iteratively utilize pseudo-labels to explore potential positive/negative samples for contrastive learning and bridge the gap between representation and clustering. To enable better knowledge transfer, we design a prototype learning method integrating the supervised and pseudo signals from IND and OOD samples. In addition, our method has been proven effective in two different settings of discovering new intents. Experiments on three benchmark datasets and two task settings demonstrate the effectiveness of our approach.

pdf bib abs

Learning to Paraphrase for Alignment with LLM Preference
Junbo Fu | Guoshuai Zhao | Yimin Deng | Yunqi Mi | Xueming Qian
Findings of the Association for Computational Linguistics: EMNLP 2024

Large Language Models (LLMs) exhibit the issue of paraphrase divergence. This means that when a question is phrased in a slightly different but semantically similar way, LLM may output a wrong response despite being able to answer the original question correctly. Previous research has regarded this issue as a problem of the model’s robustness to question paraphrase and proposed a retraining method to address it. However, retraining faces challenges in meeting the computational costs and privacy security demands of LLMs. In this paper, we regard this issue as a problem of alignment with model preferences and propose PEARL (Preference-drivEn pAraphRase Learning). This is a black-box method that enhances model performance by paraphrasing questions in expressions preferred by the model. We validate PEARL across six datasets spanning three tasks: open-domain QA, commonsense reasoning, and math word problem. Extensive experiments demonstrated not only the outstanding performance but also the composability, transferability, and immense potential of PEARL, shedding new light on the black-box tuning of LLMs.