This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
MartinBär
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
In many languages, non-standardized varieties make the development of NLP models challenging. This paper explores various fine-tuning techniques and data setups for training Swiss German to Standard German speech-to-text translation models. While fine-tuning on all available Swiss German data yields the best results, ASR pre-training lowers performance by 1.48 BLEU points, and jointly training on Swiss and Standard German data reduces it by 2.29 BLEU. Our dialect transfer experiments suggest that an equivalent of the Curse of Multilinguality (Conneau et al., 2020) exists in dialectal speech processing, as training on multiple dialects jointly tends to decrease single-dialect performance. However, introducing small amounts of dialectal variability can improve the performance for low-resource dialects.
We present the LCT-LAP proposal for the shared task on Translation into Low-Resource Languages of Spain at WMT24 within the constrained submission category. Our work harnesses encoder-decoder models pretrained on higher-resource Iberian languages to facilitate MT model training for Asturian, Aranese and Aragonese. Furthermore, we explore the robustness of these models when fine-tuned on datasets with varying levels of alignment noise. We fine-tuned a Spanish-Galician model using Asturian data filtered by BLEU score thresholds of 5, 15, 30 and 60, identifying BLEU 15 as the most effective. This threshold was then applied to the Aranese and Aragonese datasets. Our findings indicate that filtering the corpora reduces computational costs and improves performance compared to using nearly raw data or data filtered with language identification. However, it still falls short of the performance achieved by the rule-based system Apertium in Aranese and Aragonese.
For the 2023 IWSLT Maltese Speech Translation Task, UM-DFKI jointly presents a cascade solution which achieves 0.6 BLEU. While this is the first time that a Maltese speech translation task has been released by IWSLT, this paper explores previous solutions for other speech translation tasks, focusing primarily on low-resource scenarios. Moreover, we present our method of fine-tuning XLS-R models for Maltese ASR using a collection of multi-lingual speech corpora as well as the fine-tuning of the mBART model for Maltese to English machine translation.