Fake News Detection Strategies under Dataset Bias: Using Large-scale Coarse-grained Labels

Yuki Kishi; Yuji Arima; Hitoshi Iyatomi

Fake News Detection Strategies under Dataset Bias: Using Large-scale Coarse-grained Labels

Abstract

The spread of misinformation has prompted extensive research on machine-learning–based fake news detection. However, existing datasets differ substantially in content distributions and annotation policies, complicating fair evaluation and generalization assessment. We refer to these structural differences as dataset bias. In this study, we quantitatively analyze dataset bias across multiple public fake news datasets (Kaggle, FNN, ISOT, and NELA-GT-2019/2020) with different annotation granularities, including article-level and publisher-level labels. Using document embedding–based similarity analysis and article category distributions, we examine how such biases affect detection performance under in-dataset and cross-dataset evaluation settings. Furthermore, to leverage large-scale but coarse-grained publisher-level data, we compare proxy-label training with a semi-supervised learning approach based on Virtual Adversarial Training (VAT). Our results show that detection performance strongly depends on dataset-specific biases, and that proxy-label training and SSL exhibit complementary, and sometimes opposite, strengths depending on whether the evaluation emphasizes in-dataset performance or cross-dataset generalization. These findings highlight the importance of appropriate training strategies and evaluation protocols when using heterogeneous fake news datasets.

Anthology ID:: 2026.eacl-srw.47
Volume:: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Selene Baez Santamaria, Sai Ashish Somayajula, Atsuki Yamaguchi
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 612–621
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-srw.47/
DOI:
Bibkey:
Cite (ACL):: Yuki Kishi, Yuji Arima, and Hitoshi Iyatomi. 2026. Fake News Detection Strategies under Dataset Bias: Using Large-scale Coarse-grained Labels. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 612–621, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Fake News Detection Strategies under Dataset Bias: Using Large-scale Coarse-grained Labels (Kishi et al., EACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-srw.47.pdf

PDF Cite Search Fix data