William Chen


2024

pdf
Evaluating Self-Supervised Speech Representations for Indigenous American Languages
Chih-Chen Chen | William Chen | Rodolfo Joel Zevallos | John E. Ortega
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The application of self-supervision to speech representation learning has garnered significant interest in recent years, due to its scalability to large amounts of unlabeled data. However, much progress, both in terms of pre-training and downstream evaluation, has remained concentrated in monolingual models that only consider English. Few models consider other languages, and even fewer consider indigenous ones. In this work, benchmark the efficacy of large SSL models on 6 indigenous America languages: Quechua, Guarani , Bribri, Kotiria, Wa’ikhana, and Totonac on low-resource ASR. Our results show surprisingly strong performance by state-of-the-art SSL models, showing the potential generalizability of large-scale models to real-world data.

2023

pdf
Findings of the CoCo4MT 2023 Shared Task on Corpus Construction for Machine Translation
Ananya Ganesh | Marine Carpuat | William Chen | Katharina Kann | Constantine Lignos | John E. Ortega | Jonne Saleva | Shabnam Tafreshi | Rodolfo Zevallos
Proceedings of the Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation

This paper provides an overview of the first shared task on choosing beneficial instances for machine translation, conducted as part of the CoCo4MT 2023 Workshop at MTSummit. This shared task was motivated by the need to make the data annotation process for machine translation more efficient, particularly for low-resource languages for which collecting human translations may be difficult or expensive. The task involved developing methods for selecting the most beneficial instances for training a machine translation system without access to an existing parallel dataset in the target language, such that the best selected instances can then be manually translated. Two teams participated in the shared task, namely the Williams team and the AST team. Submissions were evaluated by training a machine translation model on each submission’s chosen instances, and comparing their performance with the chRF++ score. The system that ranked first is by the Williams team, that finds representative instances by clustering the training data.

pdf bib
FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN
Milind Agarwal | Sweta Agrawal | Antonios Anastasopoulos | Luisa Bentivogli | Ondřej Bojar | Claudia Borg | Marine Carpuat | Roldano Cattoni | Mauro Cettolo | Mingda Chen | William Chen | Khalid Choukri | Alexandra Chronopoulou | Anna Currey | Thierry Declerck | Qianqian Dong | Kevin Duh | Yannick Estève | Marcello Federico | Souhir Gahbiche | Barry Haddow | Benjamin Hsu | Phu Mon Htut | Hirofumi Inaguma | Dávid Javorský | John Judge | Yasumasa Kano | Tom Ko | Rishu Kumar | Pengwei Li | Xutai Ma | Prashant Mathur | Evgeny Matusov | Paul McNamee | John P. McCrae | Kenton Murray | Maria Nadejde | Satoshi Nakamura | Matteo Negri | Ha Nguyen | Jan Niehues | Xing Niu | Atul Kr. Ojha | John E. Ortega | Proyag Pal | Juan Pino | Lonneke van der Plas | Peter Polák | Elijah Rippeth | Elizabeth Salesky | Jiatong Shi | Matthias Sperber | Sebastian Stüker | Katsuhito Sudoh | Yun Tang | Brian Thompson | Kevin Tran | Marco Turchi | Alex Waibel | Mingxuan Wang | Shinji Watanabe | Rodolfo Zevallos
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

This paper reports on the shared tasks organized by the 20th IWSLT Conference. The shared tasks address 9 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, multilingual, dialect and low-resource speech translation, and formality control. The shared tasks attracted a total of 38 submissions by 31 teams. The growing interest towards spoken language translation is also witnessed by the constantly increasing number of shared task organizers and contributors to the overview paper, almost evenly distributed across industry and academia.

pdf
CMU’s IWSLT 2023 Simultaneous Speech Translation System
Brian Yan | Jiatong Shi | Soumi Maiti | William Chen | Xinjian Li | Yifan Peng | Siddhant Arora | Shinji Watanabe
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

This paper describes CMU’s submission to the IWSLT 2023 simultaneous speech translation shared task for translating English speech to both German text and speech in a streaming fashion. We first build offline speech-to-text (ST) models using the joint CTC/attention framework. These models also use WavLM front-end features and mBART decoder initialization. We adapt our offline ST models for simultaneous speech-to-text translation (SST) by 1) incrementally encoding chunks of input speech, re-computing encoder states for each new chunk and 2) incrementally decoding output text, pruning beam search hypotheses to 1-best after processing each chunk. We then build text-to-speech (TTS) models using the VITS framework and achieve simultaneous speech-to-speech translation (SS2ST) by cascading our SST and TTS models.

pdf
QUESPA Submission for the IWSLT 2023 Dialect and Low-resource Speech Translation Tasks
John E. Ortega | Rodolfo Zevallos | William Chen
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

This article describes the QUESPA team speech translation (ST) submissions for the Quechua to Spanish (QUE–SPA) track featured in the Evaluation Campaign of IWSLT 2023: low-resource and dialect speech translation. Two main submission types were supported in the campaign: constrained and unconstrained. We submitted six total systems of which our best (primary) constrained system consisted of an ST model based on the Fairseq S2T framework where the audio representations were created using log mel-scale filter banks as features and the translations were performed using a transformer. The best (primary) unconstrained system used a pipeline approach which combined automatic speech recognition (ASR) with machine translation (MT). The ASR transcriptions for the best unconstrained system were computed using a pre-trained XLS-R-based model along with a fine-tuned language model. Transcriptions were translated using a MT system based on a fine-tuned, pre-trained language model (PLM). The four other submissions are presented in this article (2 constrained and 2 unconstrained) for comparison because they consist of various architectures. Our results show that direct ST (ASR and MT combined together) can be more effective than a PLM in a low-resource (constrained) setting for Quechua to Spanish. On the other hand, we show that fine-tuning of any type on both the ASR and MT system is worthwhile, resulting in nearly 16 BLEU for the unconstrained task.

2022

pdf bib
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Workshop 2: Corpus Generation and Corpus Augmentation for Machine Translation)
John E. Ortega | Marine Carpuat | William Chen | Katharina Kann | Constantine Lignos | Maja Popovic | Shabnam Tafreshi
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Workshop 2: Corpus Generation and Corpus Augmentation for Machine Translation)

pdf bib
Introducing QuBERT: A Large Monolingual Corpus and BERT Model for Southern Quechua
Rodolfo Zevallos | John Ortega | William Chen | Richard Castro | Núria Bel | Cesar Toshio | Renzo Venturas | Hilario Aradiel | Nelsi Melgarejo
Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing

The lack of resources for languages in the Americas has proven to be a problem for the creation of digital systems such as machine translation, search engines, chat bots, and more. The scarceness of digital resources for a language causes a higher impact on populations where the language is spoken by millions of people. We introduce the first official large combined corpus for deep learning of an indigenous South American low-resource language spoken by millions called Quechua. Specifically, our curated corpus is created from text gathered from the southern region of Peru where a dialect of Quechua is spoken that has not traditionally been used for digital systems as a target dialect in the past. In order to make our work repeatable by others, we also offer a public, pre-trained, BERT model called QuBERT which is the largest linguistic model ever trained for any Quechua type, not just the southern region dialect. We furthermore test our corpus and its corresponding BERT model on two major tasks: (1) named-entity recognition (NER) and (2) part-of-speech (POS) tagging by using state-of-the-art techniques where we achieve results comparable to other work on higher-resource languages. In this article, we describe the methodology, challenges, and results from the creation of QuBERT which is on par with other state-of-the-art multilingual models for natural language processing achieving between 71 and 74% F1 score on NER and 84–87% on POS tasks.

2021

pdf
Morphologically-Guided Segmentation For Translation of Agglutinative Low-Resource Languages
William Chen | Brett Fazio
Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)

Neural Machine Translation (NMT) for Low Resource Languages (LRL) is often limited by the lack of available training data, making it necessary to explore additional techniques to improve translation quality. We propose the use of the Prefix-Root-Postfix-Encoding (PRPE) subword segmentation algorithm to improve translation quality for LRLs, using two agglutinative languages as case studies: Quechua and Indonesian. During the course of our experiments, we reintroduce a parallel corpus for Quechua-Spanish translation that was previously unavailable for NMT. Our experiments show the importance of appropriate subword segmentation, which can go as far as improving translation quality over systems trained on much larger quantities of data. We show this by achieving state-of-the-art results for both languages, obtaining higher BLEU scores than large pre-trained models with much smaller amounts of data.

pdf
The UCF Systems for the LoResMT 2021 Machine Translation Shared Task
William Chen | Brett Fazio
Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)

We present the University of Central Florida systems for the LoResMT 2021 Shared Task, participating in the English-Irish and English-Marathi translation pairs. We focused our efforts on constrained track of the task, using transfer learning and subword segmentation to enhance our models given small amounts of training data. Our models achieved the highest BLEU scores on the fully constrained tracks of English-Irish, Irish-English, and Marathi-English with scores of 13.5, 21.3, and 17.9 respectively