A Data-Centric Approach to Generalizable Speech Deepfake Detection

Wen Huang; Yuchen Mao; Yanmin Qian

A Data-Centric Approach to Generalizable Speech Deepfake Detection

Abstract

Achieving robust generalization in speech deepfake detection (SDD) remains a primary challenge, as models often fail to detect unseen forgery methods. While research has focused on model-centric and algorithm-centric solutions, the impact of data composition is often underexplored. This paper proposes a data-centric approach, analyzing the SDD data landscape from two practical perspectives: constructing a single dataset and aggregating multiple datasets. To address the first perspective, we conduct a large-scale empirical study to characterize the data scaling laws for SDD, quantifying the impact of source and generator diversity. To address the second, we propose the Diversity-Optimized Sampling Strategy (DOSS), a principled framework for mixing heterogeneous data with two implementations: DOSS-Select (pruning) and DOSS-Weight (re-weighting). Our experiments show that DOSS-Select outperforms the naive aggregation baseline while using only 3% of the total available data. Furthermore, our final model, trained on a 12k-hour curated data pool using the optimal DOSS-Weight strategy, achieves state-of-the-art performance, outperforming large-scale baselines with greater data and model efficiency on both public benchmarks and a new challenge set of various commercial APIs.

Anthology ID:: 2026.acl-long.796
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 17520–17539
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.796/
DOI:
Bibkey:
Cite (ACL):: Wen Huang, Yuchen Mao, and Yanmin Qian. 2026. A Data-Centric Approach to Generalizable Speech Deepfake Detection. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17520–17539, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: A Data-Centric Approach to Generalizable Speech Deepfake Detection (Huang et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.796.pdf
Checklist:: 2026.acl-long.796.checklist.pdf

PDF Cite Search Checklist Fix data