PIPER: Benchmarking and Prompting Event Reasoning Boundary of LLMs via Debiasing-Distillation Enhanced Tuning
Zhicong Lu, Changyuan Tian, PeiguangLi PeiguangLi, Li Jin, Sirui Wang, Wei Jia, Ying Shen, Guangluan Xu
Abstract
While Large Language Models (LLMs) excel in diverse domains, their validity in event reasoning remains underexplored. Most existing works merely stagnate at assessing LLMs’ event reasoning with a single event relational type or reasoning format, failing to conduct a complete evaluation and provide a practical solution for capability enhancement. In this paper, we propose PIPER, the first comprehensive benchmark for Probing Into the Performance boundary of LLMs in Event Reasoning. Motivated by our evaluation observations and error patterns analysis, we meticulously craft 10K diverse instruction-tuning demonstrations to alleviate event reasoning-oriented data scarcity. Additionally, a novel Debiasing and Distillation-Enhanced Supervised Fine-Tuning (D2E-SFT) strategy is presented, which facilitates adhering to context and fixating significant contextual event information to elevate the event reasoning capability. Specifically, D2E-SFT removes the given sample’s context to construct an imagined sample, subtracting its logits to mitigate the bias of neglecting context and improve contextual faithfulness. To guide the model in emphasizing significant contextual event information, D2E-SFT employs a context-refined sample to achieve self-distillation with the alignment of logits. Extensive experimental results demonstrate the effectiveness of our data and strategy in expanding the performance boundary of event reasoning.- Anthology ID:
- 2025.acl-long.1389
- Volume:
- Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 28591–28613
- Language:
- URL:
- https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1389/
- DOI:
- Cite (ACL):
- Zhicong Lu, Changyuan Tian, PeiguangLi PeiguangLi, Li Jin, Sirui Wang, Wei Jia, Ying Shen, and Guangluan Xu. 2025. PIPER: Benchmarking and Prompting Event Reasoning Boundary of LLMs via Debiasing-Distillation Enhanced Tuning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28591–28613, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- PIPER: Benchmarking and Prompting Event Reasoning Boundary of LLMs via Debiasing-Distillation Enhanced Tuning (Lu et al., ACL 2025)
- PDF:
- https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1389.pdf