Xuankai Chang


2024

pdf
Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners
Rongjie Huang | Chunlei Zhang | Yongqi Wang | Dongchao Yang | Jinchuan Tian | Zhenhui Ye | Luping Liu | Zehan Wang | Ziyue Jiang | Xuankai Chang | Jiatong Shi | Chao Weng | Zhou Zhao | Dong Yu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) have successfully served as a general-purpose interface across multiple tasks and languages, while the adaptation of voice LLMs is mostly designed for specific purposes (either single-task or monolingual), where the advantages of LLMs especially for low-resource language processing and zero-shot task generalization are less exploited in the audio community. To bridge the gap, we introduce Make-A-Voice as a multi-modal voice LLM and conduct a comprehensive study on its capability to deal with multiple tasks/languages. When trained on ~200K hours of 6-language data for 4 voice generation applications, Make-A-Voice emerges notable advantages: 1) as scalable learners to improve performance with end-to-end local and global multiscale transformers; and 2) as multitask learners by adjusting prompts to share common knowledge across modalities (speech/singing) and present in-context learning abilities by generalizing to unseen tasks not explicitly train on; 3) as multilingual learners to alleviate data scarcity of low-resource languages by including rich-resource language training data. Experimental results demonstrate that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models in monolingual/cross-lingual voice generation. Audio samples are available at https://M-Voice.github.io

2022

pdf
SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities
Hsiang-Sheng Tsai | Heng-Jui Chang | Wen-Chin Huang | Zili Huang | Kushal Lakhotia | Shu-wen Yang | Shuyan Dong | Andy Liu | Cheng-I Lai | Jiatong Shi | Xuankai Chang | Phil Hall | Hsuan-Jui Chen | Shang-Wen Li | Shinji Watanabe | Abdelrahman Mohamed | Hung-yi Lee
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Transfer learning has proven to be crucial in advancing the state of speech and natural language processing research in recent years. In speech, a model pre-trained by self-supervised learning transfers remarkably well on multiple tasks. However, the lack of a consistent evaluation methodology is limiting towards a holistic understanding of the efficacy of such models. SUPERB was a step towards introducing a common benchmark to evaluate pre-trained models across various speech tasks. In this paper, we introduce SUPERB-SG, a new benchmark focusing on evaluating the semantic and generative capabilities of pre-trained models by increasing task diversity and difficulty over SUPERB. We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain and quality across different types of tasks. It entails freezing pre-trained model parameters, only using simple task-specific trainable heads. The goal is to be inclusive of all researchers, and encourage efficient use of computational resources. We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.

2021

pdf
Highland Puebla Nahuatl Speech Translation Corpus for Endangered Language Documentation
Jiatong Shi | Jonathan D. Amith | Xuankai Chang | Siddharth Dalmia | Brian Yan | Shinji Watanabe
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

Documentation of endangered languages (ELs) has become increasingly urgent as thousands of languages are on the verge of disappearing by the end of the 21st century. One challenging aspect of documentation is to develop machine learning tools to automate the processing of EL audio via automatic speech recognition (ASR), machine translation (MT), or speech translation (ST). This paper presents an open-access speech translation corpus of Highland Puebla Nahuatl (glottocode high1278), an EL spoken in central Mexico. It then addresses machine learning contributions to endangered language documentation and argues for the importance of speech translation as a key element in the documentation process. In our experiments, we observed that state-of-the-art end-to-end ST models could outperform a cascaded ST (ASR > MT) pipeline when translating endangered language documentation materials.