Keyuan Cheng
2025
COMPKE: Complex Question Answering under Knowledge Editing
Keyuan Cheng
|
Zijian Kan
|
Zhuoran Zhang
|
Muhammad Asif Ali
|
Lijie Hu
|
Di Wang
Findings of the Association for Computational Linguistics: ACL 2025
Knowledge Editing-Efficiently modifying the knowledge in large language models has gathered great attention. Current benchmarks primarily use multi-hop question answering to assess and analyze newly injected or updated knowledge. However, we argue that these benchmarks fail to effectively evaluate how well the updated models apply this knowledge in real-life scenarios, particularly when questions require complex reasoning involving one-to-many relationships or multi-step logical intersections. To fill in this gap, we introduce a new benchmark, COMPKE: Complex Question Answering under Knowledge Editing, which includes 11,924 complex questions that reflect real-life situations. We perform a comprehensive evaluation of four different knowledge editing methods in COMPKE, and our results show that the performance of these methods varies between different models. For example, MeLLo achieves an accuracy of 39.47 on GPT-4o-mini but drops significantly to 3.83 on Qwen2.5-3B. We further analyze the reasons behind these results from both methodological and model perspectives. Our dataset will be publicly available on GitHub.
CODEMENV: Benchmarking Large Language Models on Code Migration
Keyuan Cheng
|
Xudong Shen
|
Yihao Yang
|
TengyueWang TengyueWang
|
Yang Cao
|
Muhammad Asif Ali
|
Hanbin Wang
|
Lijie Hu
|
Di Wang
Findings of the Association for Computational Linguistics: ACL 2025
Large language models (LLMs) have demonstrated remarkable proficiency in handling a wide range of tasks within the software engineering domain, but their ability to perform code migration—adapting code to different environments—remains underexplored. In this work, we propose a novel benchmark, : Code Migration Across Environment, designed to evaluate LLMs’ performance in handling code migration tasks. The benchmark comprises 922 data points across 19 Python and Java packages, offering three tasks to systematically evaluate code migration: identifying version-incompatible functions, determining function changes, and adapting code to target environments. Experimental evaluation of across seven LLMs revealed an average pass@1 rate of 26.50%, with GPT-4o performing best at 43.84%. We highlight our key findings as follows: (i) LLMs are more familiar with newer function versions, making them better at migrating legacy code, and (ii) a logical inconsistency where LLMs sometimes identify irrelevant function changes for the target migration environment.
Search
Fix author
Co-authors
- Muhammad Asif Ali 2
- Lijie Hu 2
- Di Wang 2
- Yang Cao 1
- Zijian Kan 1
- show all...