Nan Yan
2025
TripleFact: Defending Data Contamination in the Evaluation of LLM-driven Fake News Detection
Cheng Xu
|
Nan Yan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The proliferation of large language models (LLMs) has introduced unprecedented challenges in fake news detection due to benchmark data contamination (BDC), where evaluation benchmarks are inadvertently memorized during the pre-training, leading to the inflated performance metrics. Traditional evaluation paradigms, reliant on static datasets and closed-world assumptions, fail to account the BDC risk in large-scale pre-training of current LLMs. This paper introduces TripleFact, a novel evaluation framework for fake news detection task, which designed to mitigate BDC risk while prioritizing real-world applicability. TripleFact integrates three components: (1) Human-Adversarial Preference Testing (HAPT) to assess robustness against human-crafted misinformation, (2) Real-Time Web Agent with Asynchronous Validation (RTW-AV) to evaluate temporal generalization using dynamically sourced claims, and (3) Entity-Controlled Virtual Environment (ECVE) to eliminate entity-specific biases. Through experiments on 17 state-of-the-art LLMs, including GPT, LLaMA, and DeepSeek variants, TripleFact demonstrates superior contamination resistance compared to traditional benchmarks. Results reveal that BDC artificially inflates performance by up to 23% in conventional evaluations, while TripleFact Score (TFS) remain stable within 4% absolute error under controlled contamination. The framework’s ability to disentangle genuine detection capabilities from memorization artifacts underscores its potential as a fake news detection benchmark for the LLM era.
2024
Advancing Arabic Sentiment Analysis: ArSen Benchmark and the Improved Fuzzy Deep Hybrid Network
Yang Fang
|
Cheng Xu
|
Shuhao Guan
|
Nan Yan
|
Yuke Mei
Proceedings of the 28th Conference on Computational Natural Language Learning
Sentiment analysis is pivotal in Natural Language Processing for understanding opinions and emotions in text. While advancements in Sentiment analysis for English are notable, Arabic Sentiment Analysis (ASA) lags, despite the growing Arabic online user base. Existing ASA benchmarks are often outdated and lack comprehensive evaluation capabilities for state-of-the-art models. To bridge this gap, we introduce ArSen, a meticulously annotated COVID-19-themed Arabic dataset, and the IFDHN, a novel model incorporating fuzzy logic for enhanced sentiment classification. ArSen provides a contemporary, robust benchmark, and IFDHN achieves state-of-the-art performance on ASA tasks. Comprehensive evaluations demonstrate the efficacy of IFDHN using the ArSen dataset, highlighting future research directions in ASA.