Yunhe Pang

2026

HopWeaver: Cross-Document Synthesis of High-Quality and Authentic Multi-Hop Questions
Zhiyu Shen | Jiyuan Liu | Yunhe Pang | Yanghui Rao | Fu Lee Wang | Jianxing Yu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multi-Hop Question Answering (MHQA) is crucial for evaluating the model’s capability to integrate information from diverse sources. However, creating extensive and high-quality MHQA datasets is challenging: (i) manual annotation is expensive, and (ii) current synthesis methods often produce simplistic questions or require extensive manual guidance. This paper introduces HopWeaver, the first cross-document framework synthesizing authentic multi-hop questions without human intervention. HopWeaver synthesizes bridge and comparison questions through an innovative pipeline that identifies complementary documents and constructs authentic reasoning paths to ensure true multi-hop reasoning. We further present a comprehensive system for evaluating the synthesized multi-hop questions. Empirical evaluations demonstrate that the synthesized questions achieve comparable or superior quality to human-annotated datasets at a lower cost. Our framework provides a valuable tool for the research community: it can automatically generate challenging benchmarks from any raw corpus, which opens new avenues for both evaluation and targeted training to improve the reasoning capabilities of advanced question answering models, especially in domains with scarce resources.

pdf bib abs

Understanding research papers remains challenging for foundation models due to specialized scientific discourse and complex figures and tables, yet existing benchmarks offer limited fine-grained evaluation at scale. To address this gap, we introduce RPC-Bench, a large-scale question-answering benchmark built from review–rebuttal exchanges of high-quality computer science papers, containing 15K human-verified QA pairs. We design a fine-grained taxonomy aligned with the scientific research flow to assess models’ ability to understand and answer why, what, and how questions in scholarly contexts. We also define an elaborate LLM–human interaction annotation framework to support large-scale labeling and quality control. Following the LLM-as-a-Judge paradigm, we develop a scalable framework that evaluates models on correctness-completeness and conciseness, with high agreement to human judgment. Experiments reveal that even the strongest models (GPT-5) achieve only 68.2% correctness-completeness, dropping to 37.46% after conciseness adjustment, highlighting substantial gaps in precise academic paper understanding.

2025

pdf bib abs

CARE: A Disagreement Detection Framework with Concept Alignment and Reasoning Enhancement
Jiyuan Liu | Jielin Song | Yunhe Pang | Zhiyu Shen | Yanghui Rao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Disagreement detection is a crucial task in natural language processing (NLP), particularly in analyzing online discussions and social media content. Large language models (LLMs) have demonstrated significant advancements across various NLP tasks. However, the performance of LLM in disagreement detection is limited by two issues: *conceptual gap* and *reasoning gap*. In this paper, we propose a novel two-stage framework, Concept Alignment and Reasoning Enhancement (CARE), to tackle the issues. The first stage, Concept Alignment, addresses the gap between expert and model by performing **sub-concept taxonomy extraction**, aligning the model’s comprehension with human experts. The second stage, Reasoning Enhancement, improves the model’s reasoning capabilities by introducing curriculum learning workflow, which includes **rationale to critique** and **counterfactual to detection** for reducing spurious association. Extensive experiments on disagreement detection task demonstrate the effectiveness of our framework, showing superior performance in zero-shot and supervised learning settings, both within and across domains.

pdf bib abs

CoE: A Clue of Emotion Framework for Emotion Recognition in Conversations
Zhiyu Shen | Yunhe Pang | Yanghui Rao | Jianxing Yu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Emotion Recognition in Conversations (ERC) is crucial for machines to understand dynamic human emotions. While Large Language Models (LLMs) show promise, their performance is often limited by challenges in interpreting complex conversational streams. We introduce a Clue of Emotion (CoE) framework, which progressively integrates key conversational clues to enhance the ERC task. Building on CoE, we implement a multi-stage auxiliary learning strategy that incorporates role-playing, speaker identification, and emotion reasoning tasks, each targeting different aspects of conversational emotion understanding and enhancing the model’s ability to interpret emotional contexts. Our experiments on EmoryNLP, MELD, and IEMOCAP demonstrate that CoE consistently outperforms state-of-the-art methods, achieving a 2.92% improvement on EmoryNLP. These results underscore the effectiveness of clues and multi-stage auxiliary learning for ERC, offering valuable insights for future research.

2022

pdf bib abs

Learnable Dependency-based Double Graph Structure for Aspect-based Sentiment Analysis
Yinglong Ma | Yunhe Pang
Proceedings of the 29th International Conference on Computational Linguistics

Dependency tree-based methods might be susceptible to the dependency tree due to that they inevitably introduce noisy information and neglect the rich relation information between words. In this paper, we propose a learnable dependency-based double graph (LD2G) model for aspect-based sentiment classification. We use multi-task learning for domain adaptive pretraining, which combines Biaffine Attention and Mask Language Model by incorporating features such as structure, relations and linguistic features in the sentiment text. Then we utilize the dependency enhanced double graph-based MPNN to deeply fuse structure features and relation features that are affected with each other for ASC. Experiment on four benchmark datasets shows that our model is superior to the state-of-the-art approaches.

Co-authors

Lei Hou 1

Venues

Fix author