This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
AarónGaliano-Jiménez
Also published as:
Aaron Galiano Jimenez
Fixing paper assignments
Please select all papers that do not belong to this person.
Indicate below which author they should be assigned to.
This paper delves into sequence-level knowledge distillation (KD) of multilingual pre-trained translation models. We posit that, beyond the approximated mode obtained via beam search, the whole output distribution of the teacher contains valuable insights for students. We explore the potential of n-best lists from beam search to guide student’s learning and then investigate alternative decoding methods to address observed issues like low variability and under-representation of infrequent tokens. Our research in data-limited scenarios reveals that although sampling methods can slightly compromise the translation quality of the teacher output compared to beam search based methods, they enrich the generated corpora with increased variability and lexical richness, ultimately enhancing student model performance and reducing the gender bias amplification commonly associated with KD.
The LiLowLa (“Lightweight neural translation technologies for low-resource languages”) project aims to enhance machine translation (MT) and translation memory (TM) technologies, particularly for low-resource language pairs, where adequate linguistic resources are scarce. The project started in September 2022 and will run till August 2025.
In this paper, we describe the process of creating the FLORES+ datasets for several Romance languages spoken in Spain, namely Aragonese, Aranese, Asturian, and Valencian. The Aragonese and Aranese datasets are entirely new additions to the FLORES+ multilingual benchmark. An initial version of the Asturian dataset was already available in FLORES+, and our work focused on a thorough revision. Similarly, FLORES+ included a Catalan dataset, which we adapted to the Valencian variety spoken in the Valencian Community. The development of the Aragonese, Aranese, and revised Asturian FLORES+ datasets was undertaken as part of a WMT24 shared task on translation into low-resource languages of Spain.
This paper presents the results of the Ninth Conference on Machine Translation (WMT24) Shared Task “Translation into Low-Resource Languages of Spain”’. The task focused on the development of machine translation systems for three language pairs: Spanish-Aragonese, Spanish-Aranese, and Spanish-Asturian. 17 teams participated in the shared task with a total of 87 submissions. The baseline system for all language pairs was Apertium, a rule-based machine translation system that still performs competitively well, even in an era dominated by more advanced non-symbolic approaches. We report and discuss the results of the submitted systems, highlighting the strengths of both neural and rule-based approaches.
This paper describes the submissions of the Transducens group of the Universitat d’Alacant to the WMT 2024 Shared Task on Translation into Low-Resource Languages of Spain; in particular, the task focuses on the translation from Spanish into Aragonese, Aranese and Asturian. Our submissions use parallel and monolingual data to fine-tune the NLLB-1.3B model and to investigate the effectiveness of synthetic corpora and transfer-learning between related languages such as Catalan, Galician and Valencian. We also present a many-to-many multilingual neural machine translation model focused on the Romance languages of Spain.
Pre-trained models have drastically changed the field of natural language processing by providing a way to leverage large-scale language representations to various tasks. Some pre-trained models offer general-purpose representations, while others are specialized in particular tasks, like neural machine translation (NMT). Multilingual NMT-targeted systems are often fine-tuned for specific language pairs, but there is a lack of evidence-based best-practice recommendations to guide this process. Moreover, the trend towards even larger pre-trained models has made it challenging to deploy them in the computationally restrictive environments typically found in developing regions where low-resource languages are usually spoken. We propose a pipeline to tune the mBART50 pre-trained model to 8 diverse low-resource language pairs, and then distil the resulting system to obtain lightweight and more sustainable models. Our pipeline conveniently exploits back-translation, synthetic corpus filtering, and knowledge distillation to deliver efficient, yet powerful bilingual translation models 13 times smaller than the original pre-trained ones, but with close performance in terms of BLEU.
We present the most relevant results of the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages in its second year. To date, parallel and monolingual corpora have been produced for seven low-resourced European languages by crawling large amounts of textual data from selected top-level domains of the Internet; both human and automatic evaluation show its usefulness. In addition, several large language models pretrained on MaCoCu data have been published, as well as the code used to collect and curate the data.