Jiangyan Yi


2024

pdf bib
Distinguishing Neural Speech Synthesis Models Through Fingerprints in Speech Waveforms
ChuYuan Zhang | Jiangyan Yi | Jianhua Tao | Chenglong Wang | Xinrui Yan
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

“Recent advancements in neural speech synthesis technologies have brought aboutwidespread applications but have also raised concerns about potential misuse and abuse.Addressing these challenges is crucial, particularly in the realms of forensics and intellec-tual property protection. While previous research on source attribution of synthesizedspeech has its limitations, our study aims to fill these gaps by investigating the identifi-cation of sources in synthesized speech. We focus on analyzing speech synthesis modelfingerprints in generated speech waveforms, emphasizing the roles of the acoustic modeland vocoder. Our research, based on the multi-speaker LibriTTS dataset, reveals twokey insights: (1) both vocoders and acoustic models leave distinct, model-specific fin-gerprints on generated waveforms, and (2) vocoder fingerprints, being more dominant,may obscure those from the acoustic model. These findings underscore the presence ofmodel-specific fingerprints in both components, suggesting their potential significance insource identification applications.”

pdf bib
EmoFake: An Initial Dataset for Emotion Fake Audio Detection
Yan Zhao | Jiangyan Yi | Jianhua Tao | Chenglong Wang | Yongfeng Dong
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

“To enhance the effectiveness of fake audio detection techniques, researchers have developed mul-tiple datasets such as those for the ASVspoof and ADD challenges. These datasets typically focuson capturing non-emotional characteristics in speech, such as the identity of the speaker and theauthenticity of the content. However, they often overlook changes in the emotional state of theaudio, which is another crucial dimension affecting the authenticity of speech. Therefore, thisstudy reports our progress in developing such an emotion fake audio detection dataset involvingchanging emotion state of the origin audio named EmoFake. The audio samples in EmoFake aregenerated using open-source emotional voice conversion models, intended to simulate potentialemotional tampering scenarios in real-world settings. We conducted a series of benchmark ex-periments on this dataset, and the results show that even advanced fake audio detection modelstrained on the ASVspoof 2019 LA dataset and the ADD 2022 track 3.2 dataset face challengeswith EmoFake. The EmoFake is publicly available1 now.”

pdf bib
NLoPT: N-gram Enhanced Low-Rank Task Adaptive Pre-training for Efficient Language Model Adaption
Hao Gu | Jiangyan Yi | Zheng Lian | Jianhua Tao | Xinrui Yan
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Pre-trained Language Models (PLMs) like BERT have achieved superior performance on different downstream tasks, even when such a model is trained on a general domain. Moreover, recent studies have shown that continued pre-training on task-specific data, known as task adaptive pre-training (TAPT), can further improve downstream task performance. However, conventional TAPT adjusts all the parameters of the PLMs, which distorts the learned generic knowledge embedded in the original PLMs weights, and it is expensive to store a whole model copy for each downstream task. In this paper, we propose NLoPT, a two-step n-gram enhanced low-rank task adaptive pre-training method, to effectively and efficiently customize a PLM to the downstream task. Specifically, we first apply low-rank adaption (LoRA), a prevalent parameter-efficient technique, for efficient TAPT. We further explicitly incorporate the task-specific multi-granularity n-gram information via the cross-attention mechanism. Experimental results on six datasets from four domains illustrate the effectiveness of NLoPT, demonstrating the superiority of LoRA based TAPT and the necessity of incorporating task-specific n-gram information.