This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
XinyeLi
Fixing paper assignments
Please select all papers that do not belong to this person.
Indicate below which author they should be assigned to.
Knowledge Editing (KE) has gained increasing attention, yet current KE tasks remain relatively simple. Under current evaluation frameworks, many editing methods achieve exceptionally high scores, sometimes nearing perfection. However, few studies integrate KE into real-world application scenarios (e.g., recent interest in LLM-as-agent). To support our analysis, we introduce a novel script-based benchmark – ScEdit (Script-based Knowledge Editing Benchmark) – which encompasses both counterfactual and temporal edits. We integrate token-level and text-level evaluation methods, comprehensively analyzing existing KE techniques. The benchmark extends traditional fact-based (“What”-type question) evaluation to action-based (“How”-type question) evaluation. We observe that all KE methods exhibit a drop in performance on established metrics and face challenges on text-level metrics, indicating a challenging task. Our benchmark is available at https://github.com/asdfo123/ScEdit.
Deductive and inductive reasoning are fundamental components of human cognition, and in daily life, people often apply these types of reasoning unconsciously. While previous studies have extensively examined the deductive and inductive reasoning abilities of Large Language Models (LLMs) in rule-based and math-related tasks, little attention has been given to their role in procedural planning——an area that holds considerable relevance for real-world applications. To fill this gap, we present DIRPP (Deductive and Inductive Reasoning in Procedural Planning) in this paper, a benchmark designed to assess the deductive and inductive reasoning abilities of various LLMs within the context of procedural planning. Based on the benchmark, we initially observe that LLMs demonstrate excellent deductive reasoning capabilities in procedural planning but show suboptimal performance in inductive reasoning. To enhance their inductive reasoning abilities, we further propose a novel and effective method called IMSE (Induction through Multiple Similar Examples), which enables LLMs to generate multiple similar procedural plans and then perform inductive reasoning based on these examples. Through various experiments, we find that the proposed method can significantly improve the inductive reasoning capabilities of LLMs.
We present Team asdfo123’s submission to the XLLM@ACL 2025–LLM-SR shared task, which evaluates large language models on producing fine-grained, controllable, and interpretable reasoning processes. Systems must extract all problem conditions, decompose a chain of thought into statement–evidence pairs, and verify the logical validity of each pair. Leveraging only the off-the-shelf Meta-Llama-3-8B-Instruct, we craft a concise few-shot, multi-turn prompt that first enumerates all conditions and then guides the model to label, cite, and adjudicate every reasoning step. A lightweight post-processor based on regular expressions normalises spans and enforces the official JSON schema. Without fine-tuning, external retrieval, or ensembling, our method ranks 5th overall, achieving macro-F1 scores on par with substantially more complex and resource-consuming pipelines. We conclude by analysing the strengths and limitations of our approach and outlining directions for future research in structural reasoning with LLMs. Our code is available at https://github.com/asdfo123/LLMSR-asdfo123