A Unified Feature Mixture Framework for Joint Speech and Singing Deepfake Detection

Aastha Sharma, Guangjing Wang


Abstract
High-fidelity audio generation techniques, such as voice conversion and singing voice synthesis, have significantly increased the risk of audio deepfakes. Although existing methods perform well on conversational speech deepfake detection, they fail severely under the speech-to-singing domain shift. To address this limitation, we propose GenuVoice, a unified deepfake detector based on a multi-branch mixture-of-experts architecture that integrates three complementary feature views: Wav2Vec 2.0 representations, log-mel spectrograms, and mel-frequency cepstral coefficients (MFCC). Each expert is trained to remain independently discriminative, while a learned gating network dynamically weights expert contributions. A speech-retentive multi-domain fine-tuning strategy enables adaptation to singing without degrading speech performance. GenuVoice achieves 1.82% Equal Error Rate (EER) on CtrSVDD, compared to 37–62% for existing speech-trained detectors, while preserving strong speech performance (0.38% EER on ASVspoof 2019) and generalizing to unseen generators (8.89% EER on held-out ASVspoof 2021). Extensive ablations confirm the importance of multi-expert fusion and speech retention, establishing GenuVoice as an effective unified detector for speech and singing deepfakes. The implementation code is available at https://github.com/aastha-sharma/genuvoice
Anthology ID:
2026.findings-acl.1245
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
24853–24863
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1245/
DOI:
Bibkey:
Cite (ACL):
Aastha Sharma and Guangjing Wang. 2026. A Unified Feature Mixture Framework for Joint Speech and Singing Deepfake Detection. In Findings of the Association for Computational Linguistics: ACL 2026, pages 24853–24863, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
A Unified Feature Mixture Framework for Joint Speech and Singing Deepfake Detection (Sharma & Wang, Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1245.pdf
Checklist:
 2026.findings-acl.1245.checklist.pdf