Qiming Xie
2026
LoReFact: Bridging the Logic Gap in Fact-Checking
Qiming Xie | Wenjie Zheng | Xiangqing Shen | Rui Xia
Findings of the Association for Computational Linguistics: ACL 2026
Qiming Xie | Wenjie Zheng | Xiangqing Shen | Rui Xia
Findings of the Association for Computational Linguistics: ACL 2026
The rise of social media and generative AI has led to a surge of misinformation online, making reliable fact-checking increasingly critical.Most existing fact-checking research adheres to the decompose-then-verify paradigm, emphasizing verification of individual facts while overlooking the validity of logical dependencies among them. As a result, text containing logical errors may still be misjudged as factual. Moreover, existing datasets and metrics focus on fact completeness and coverage, failing to capture the logical dimension.To help bridge this gap, we propose a content–logic coupled factuality evaluation paradigm, which conceptualizes factuality along two complementary dimensions: content factuality and logic factuality. Under this paradigm, we introduce a holistic solution consisting of LoReFact, the first long-form fact-checking dataset that systematically incorporates the logical dimension; LoRe-Factcheck, a simple yet effective framework for joint content–logic evaluation; and a logic-aware metric named LoReFactScore for exposing and penalizing logical fallacies.Experiments demonstrate the importance of logical factuality and the effectiveness of our proposed paradigm for fact-checking.[Our data and code are publicly available at https://github.com/NUSTM/LoReFact]
2024
Ask Again, Then Fail: Large Language Models’ Vacillations in Judgment
Qiming Xie | Zengzhi Wang | Yi Feng | Rui Xia
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Qiming Xie | Zengzhi Wang | Yi Feng | Rui Xia
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We observe that current large language models often waver in their judgments when faced with follow-up questions, even if the original judgment was correct. This wavering presents a significant challenge for generating reliable responses and building user trust. To comprehensively assess this issue, we introduce a Follow-up Questioning Mechanism along with two metrics to quantify this inconsistency, confirming its widespread presence in current large language models. Furthermore, to mitigate this issue, we explore various prompting strategies for closed-source models, and develop a training-based framework Unwavering-FQ that teaches large language models to maintain their originally correct judgments through synthesized high-quality preference data. Our experimental results confirm the effectiveness of our framework and its ability to enhance the general capabilities of large language models.