Lili Zhao

2025

pdf bib abs
Following Occam’s Razor: Dynamic Combination of Structured Knowledge for Multi-Hop Question Answering using LLMs
Wei Chen | Zhi Zheng | Lili Zhao | Huijun Hou | Tong Xu
Findings of the Association for Computational Linguistics: EMNLP 2025

Multi-hop question answering is a challenging task that requires capturing information from different positions in multiple documents. Recently, several methods propose to enhance Large Language Models (LLMs) by incorporating structured knowledge, aiming to grasp key information for solving this task. Despite certain achievements, they still face the following challenges: 1) The neglect of text-based reasoning capabilities. 2) Information redundancy between text and triples. 3) Information loss during structured knowledge extraction. To solve the above challenges, in this paper, we propose Dynamic Combination of Structured Knowledge (DCSK), a novel framework for integrating text-based and triple-based paradigms. Following Occam’s Razor, DCSK dynamically determine the necessity of structured knowledge by the designed multi-faceted evaluation, which systematically assess the correctness, clarity, and informativeness of text-based prediction. For questions that require structured knowledge, we develop an iterative fact refiner that screens for question-relevant triples, verifies their factual adequacy, and thereby effectively excludes irrelevant and redundant information. Furthermore, based on the verification, we construct an adaptive knowledge reasoner that dynamically adjusts the need for text supplementation, thus mitigating the information deficiency in selected triples. Extensive experiments on three MHQA datasets demonstrate the efficiency and effectiveness of DCSK.

2024

pdf bib abs
Do LLMs Overcome Shortcut Learning? An Evaluation of Shortcut Challenges in Large Language Models
Yu Yuan | Lili Zhao | Kai Zhang | Guangting Zheng | Qi Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) have shown remarkable capabilities in various natural language processing tasks. However, LLMs may rely on dataset biases as shortcuts for prediction, which can significantly impair their robustness and generalization capabilities. This paper presents Shortcut Suite, a comprehensive test suite designed to evaluate the impact of shortcuts on LLMs’ performance, incorporating six shortcut types, five evaluation metrics, and four prompting strategies. Our extensive experiments yield several key findings: 1) LLMs demonstrate varying reliance on shortcuts for downstream tasks, which significantly impairs their performance. 2) Larger LLMs are more likely to utilize shortcuts under zero-shot and few-shot in-context learning prompts. 3) Chain-of-thought prompting notably reduces shortcut reliance and outperforms other prompting strategies, while few-shot prompts generally underperform compared to zero-shot prompts. 4) LLMs often exhibit overconfidence in their predictions, especially when dealing with datasets that contain shortcuts. 5) LLMs generally have a lower explanation quality in shortcut-laden datasets, with errors falling into three types: distraction, disguised comprehension, and logical fallacy. Our findings offer new insights for evaluating robustness and generalization in LLMs and suggest potential directions for mitigating the reliance on shortcuts.

Recently, few-shot Named Entity Recognition (NER) has attracted significant attention due to the high cost of obtaining high-quality labeled data. Decomposition-based methods have demonstrated remarkable performance on this task, which initially train a type-independent span detector and subsequently classify the detected spans based on their types. However, this framework has an evident drawback as a domain-agnostic detector cannot ensure the identification of only those entity spans that are specific to the target domain. To address this issue, we propose Double-Checker, which leverages collaboration between Large Language Models (LLMs) and small models. Specifically, we employ LLMs to verify candidate spans predicted by the small model and eliminate any spans that fall outside the scope of the target domain. Extensive experiments validate the effectiveness of our method, consistently yielding improvements over two baseline approaches. Our code is available at https://github.com/fanshu6hao/Double-Checker.

Co-authors

Qi Liu 1

Yang Wang 1

Yu Yuan 1

Kai Zhang (张楷) 1

Guangting Zheng 1

Venues

findings2
emnlp1

Fix author