The proliferation of large language models (LLMs) has introduced unprecedented challenges in fake news detection due to benchmark data contamination (BDC), where evaluation benchmarks are inadvertently memorized during the pre-training, leading to the inflated performance metrics. Traditional evaluation paradigms, reliant on static datasets and closed-world assumptions, fail to account the BDC risk in large-scale pre-training of current LLMs. This paper introduces TripleFact, a novel evaluation framework for fake news detection task, which designed to mitigate BDC risk while prioritizing real-world applicability. TripleFact integrates three components: (1) Human-Adversarial Preference Testing (HAPT) to assess robustness against human-crafted misinformation, (2) Real-Time Web Agent with Asynchronous Validation (RTW-AV) to evaluate temporal generalization using dynamically sourced claims, and (3) Entity-Controlled Virtual Environment (ECVE) to eliminate entity-specific biases. Through experiments on 17 state-of-the-art LLMs, including GPT, LLaMA, and DeepSeek variants, TripleFact demonstrates superior contamination resistance compared to traditional benchmarks. Results reveal that BDC artificially inflates performance by up to 23% in conventional evaluations, while TripleFact Score (TFS) remain stable within 4% absolute error under controlled contamination. The framework’s ability to disentangle genuine detection capabilities from memorization artifacts underscores its potential as a fake news detection benchmark for the LLM era.
Benchmark data contamination (BDC) silently inflate the evaluation performance of large language models (LLMs), yet current work on BDC has centered on direct token overlap (data/label level), leaving the subtler and equally harmful semantic level BDC largely unexplored. This gap is critical in fake news detection task, where prior exposure to semantic BDC lets a model “remember” the answer instead of reasoning. In this work, (1) we are the first to formally define semantic contamination for this task and (2) introduce the Semantic Sensitivity Amplifier (SSA), a lightweight, model-agnostic framework that detects BDC risks across semantic to label level via an entity shift perturbation and a comprehensive interpretable metric, the SSA Factor. Evaluating 45 variants of nine LLMs (0.5B–72B parameters) across four BDC levels, we find LIAR2 accuracy climbs monotonically with injected contamination, while the SSA Factor escalates in near-perfect lock-step (r≥.97, for models ≥3B, p<.05; 𝜌 ≥.9 overall, p<.05). These results show that SSA provides a sensitive and scalable audit of comprehensive BDC risk and paves the way for a more integrity evaluation of the LLM-driven fake news detection task.
The rapid advancement of large language models (LLMs) has heightened concerns about benchmark data contamination (BDC), where models inadvertently memorize evaluation data during the training process, inflating performance metrics, and undermining genuine generalization assessment. This paper introduces the Data Contamination Risk (DCR) framework, a lightweight, interpretable pipeline designed to detect and quantify BDC risk across four granular levels: semantic, informational, data, and label. By synthesizing contamination scores via a fuzzy inference system, DCR produces a unified DCR Factor that adjusts raw accuracy to reflect contamination-aware performance. Validated on 9 LLMs (0.5B-72B) across sentiment analysis, fake news detection, and arithmetic reasoning tasks, the DCR framework reliably diagnoses contamination severity and with accuracy adjusted using the DCR Factor to within 4% average error across the three benchmarks compared to the uncontaminated baseline. Emphasizing computational efficiency and transparency, DCR provides a practical tool for integrating contamination assessment into routine evaluations, fostering fairer comparisons and enhancing the credibility of LLM benchmarking practices.
Sentiment analysis is pivotal in Natural Language Processing for understanding opinions and emotions in text. While advancements in Sentiment analysis for English are notable, Arabic Sentiment Analysis (ASA) lags, despite the growing Arabic online user base. Existing ASA benchmarks are often outdated and lack comprehensive evaluation capabilities for state-of-the-art models. To bridge this gap, we introduce ArSen, a meticulously annotated COVID-19-themed Arabic dataset, and the IFDHN, a novel model incorporating fuzzy logic for enhanced sentiment classification. ArSen provides a contemporary, robust benchmark, and IFDHN achieves state-of-the-art performance on ASA tasks. Comprehensive evaluations demonstrate the efficacy of IFDHN using the ArSen dataset, highlighting future research directions in ASA.