2024
pdf
abs
Cross-Domain Audio Deepfake Detection: Dataset and Analysis
Yuang Li
|
Min Zhang
|
Mengxin Ren
|
Xiaosong Qiao
|
Miaomiao Ma
|
Daimeng Wei
|
Hao Yang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Audio deepfake detection (ADD) is essential for preventing the misuse of synthetic voices that may infringe on personal rights and privacy. Recent zero-shot text-to-speech (TTS) models pose higher risks as they can clone voices with a single utterance. However, the existing ADD datasets are outdated, leading to suboptimal generalization of detection models. In this paper, we construct a new cross-domain ADD dataset comprising over 300 hours of speech data that is generated by five advanced zero-shot TTS models. To simulate real-world scenarios, we employ diverse attack methods and audio prompts from different datasets. Experiments show that, through novel attack-augmented training, the Wav2Vec2-large and Whisper-medium models achieve equal error rates of 4.1% and 6.5% respectively. Additionally, we demonstrate our models’ outstanding few-shot ADD ability by fine-tuning with just one minute of target-domain data. Nonetheless, neural codec compressors greatly affect the detection accuracy, necessitating further research. Our dataset is publicly available (https://github.com/leolya/CD-ADD).
pdf
abs
CB-Whisper: Contextual Biasing Whisper Using Open-Vocabulary Keyword-Spotting
Yuang Li
|
Yinglu Li
|
Min Zhang
|
Chang Su
|
Jiawei Yu
|
Mengyao Piao
|
Xiaosong Qiao
|
Miaomiao Ma
|
Yanqing Zhao
|
Hao Yang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
End-to-end automatic speech recognition (ASR) systems often struggle to recognize rare name entities, such as personal names, organizations and terminologies that are not frequently encountered in the training data. This paper presents Contextual Biasing Whisper (CB-Whisper), a novel ASR system based on OpenAI’s Whisper model that can recognize user-defined name entities by performing open-vocabulary keyword-spotting (KWS) before the decoder. The KWS module leverages text-to-speech (TTS) techniques and a convolutional neural network (CNN) classifier to match the features between the entities and the utterances. To integrate the recognized entities into the Whipser decoder and avoid hallucinations, we carefully crafted multiple prompts with spoken form hints. Experiments show that the KWS module based on Whisper encoder’s features can recognize unseen user-defined keywords effectively. More importantly, the proposed CB-Whisper substantially improves the mixed-error-rate (MER) and entity recall compared to the original Whisper model on three internal datasets and two publicly available datasets including Aishell and ACL datasets that cover English-only, Chinese-only, and code-switching scenarios.
2023
pdf
abs
HW-TSC at SemEval-2023 Task 7: Exploring the Natural Language Inference Capabilities of ChatGPT and Pre-trained Language Model for Clinical Trial
Xiaofeng Zhao
|
Min Zhang
|
Miaomiao Ma
|
Chang Su
|
Yilun Liu
|
Minghan Wang
|
Xiaosong Qiao
|
Jiaxin Guo
|
Yinglu Li
|
Wenbing Ma
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
In this paper, we describe the multi strategy system for SemEval-2022 Task 7, This task aims to determine whether a given statement is supported by one or two Clinical Trial reports, and to identify evidence that supports the statement. This is a task that requires high natural language inference capabilities. In Subtask 1, we compare our strategy based on prompt learning and ChatGPT with a baseline constructed using BERT in zero-shot setting, and validate the effectiveness of our strategy. In Subtask 2, we fine-tune DeBERTaV3 for classification without relying on the results from Subtask 1, and we observe that early stopping can effectively prevent model overfitting, which performs well in Subtask 2. In addition, we did not use any ensemble strategies. Ultimately, we achieved the 10th place in Subtask 1 and the 2nd place in Subtask 2.
2022
pdf
abs
HW-TSC’s Submissions to the WMT 2022 General Machine Translation Shared Task
Daimeng Wei
|
Zhiqiang Rao
|
Zhanglin Wu
|
Shaojun Li
|
Yuanchang Luo
|
Yuhao Xie
|
Xiaoyu Chen
|
Hengchao Shang
|
Zongyao Li
|
Zhengzhe Yu
|
Jinlong Yang
|
Miaomiao Ma
|
Lizhi Lei
|
Hao Yang
|
Ying Qin
Proceedings of the Seventh Conference on Machine Translation (WMT)
This paper presents the submissions of Huawei Translate Services Center (HW-TSC) to the WMT 2022 General Machine Translation Shared Task. We participate in 6 language pairs, including Zh↔En, Ru↔En, Uk↔En, Hr↔En, Uk↔Cs and Liv↔En. We use Transformer architecture and obtain the best performance via multiple variants with larger parameter sizes. We perform fine-grained pre-processing and filtering on the provided large-scale bilingual and monolingual datasets. For medium and highresource languages, we mainly use data augmentation strategies, including Back Translation, Self Training, Ensemble Knowledge Distillation, Multilingual, etc. For low-resource languages such as Liv, we use pre-trained machine translation models, and then continue training with Regularization Dropout (R-Drop). The previous mentioned data augmentation methods are also used. Our submissions obtain competitive results in the final evaluation.
pdf
abs
HW-TSC Translation Systems for the WMT22 Biomedical Translation Task
Zhanglin Wu
|
Jinlong Yang
|
Zhiqiang Rao
|
Zhengzhe Yu
|
Daimeng Wei
|
Xiaoyu Chen
|
Zongyao Li
|
Hengchao Shang
|
Shaojun Li
|
Ming Zhu
|
Yuanchang Luo
|
Yuhao Xie
|
Miaomiao Ma
|
Ting Zhu
|
Lizhi Lei
|
Song Peng
|
Hao Yang
|
Ying Qin
Proceedings of the Seventh Conference on Machine Translation (WMT)
This paper describes the translation systems trained by Huawei translation services center (HW-TSC) for the WMT22 biomedical translation task in five language pairs: English↔German (en↔de), English↔French (en↔fr), English↔Chinese (en↔zh), English↔Russian (en↔ru) and Spanish→English (es→en). Our primary systems are built on deep Transformer with a large filter size. We also utilize R-Drop, data diversification, forward translation, back translation, data selection, finetuning and ensemble to improve the system performance. According to the official evaluation results in OCELoT or CodaLab, our unconstrained systems in en→de, de→en, en→fr, fr→en, en→zh and es→en (clinical terminology sub-track) get the highest BLEU scores among all submissions for the WMT22 biomedical translation task.