SSA: Semantic Contamination of LLM-Driven Fake News Detection

Cheng Xu, Nan Yan, Shuhao Guan, Yuke Mei, Tahar Kechadi


Abstract
Benchmark data contamination (BDC) silently inflate the evaluation performance of large language models (LLMs), yet current work on BDC has centered on direct token overlap (data/label level), leaving the subtler and equally harmful semantic level BDC largely unexplored. This gap is critical in fake news detection task, where prior exposure to semantic BDC lets a model “remember” the answer instead of reasoning. In this work, (1) we are the first to formally define semantic contamination for this task and (2) introduce the Semantic Sensitivity Amplifier (SSA), a lightweight, model-agnostic framework that detects BDC risks across semantic to label level via an entity shift perturbation and a comprehensive interpretable metric, the SSA Factor. Evaluating 45 variants of nine LLMs (0.5B–72B parameters) across four BDC levels, we find LIAR2 accuracy climbs monotonically with injected contamination, while the SSA Factor escalates in near-perfect lock-step (r≥.97, for models 3B, p<.05; 𝜌 ≥.9 overall, p<.05). These results show that SSA provides a sensitive and scalable audit of comprehensive BDC risk and paves the way for a more integrity evaluation of the LLM-driven fake news detection task.
Anthology ID:
2025.emnlp-main.744
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14748–14762
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.744/
DOI:
Bibkey:
Cite (ACL):
Cheng Xu, Nan Yan, Shuhao Guan, Yuke Mei, and Tahar Kechadi. 2025. SSA: Semantic Contamination of LLM-Driven Fake News Detection. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14748–14762, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
SSA: Semantic Contamination of LLM-Driven Fake News Detection (Xu et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.744.pdf
Checklist:
 2025.emnlp-main.744.checklist.pdf