Abstract
We present the LCT-EHU submission to the AmericasNLP 2023 low-resource machine translation shared task. We focus on the Spanish-Quechua language pair and explore the usage of different approaches: (1) Obtain new parallel corpora from the literature and legal domains, (2) Compare a high-resource Spanish-English pre-trained MT model with a Spanish-Finnish pre-trained model (with Finnish being chosen as a target language due to its morphological similarity to Quechua), and (3) Explore additional techniques such as copied corpus and back-translation. Overall, we show that the Spanish-Finnish pre-trained model outperforms other setups, while low-quality synthetic data reduces the performance.- Anthology ID:
- 2023.americasnlp-1.16
- Volume:
- Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Editors:
- Manuel Mager, Abteen Ebrahimi, Arturo Oncevay, Enora Rice, Shruti Rijhwani, Alexis Palmer, Katharina Kann
- Venue:
- AmericasNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 156–162
- Language:
- URL:
- https://aclanthology.org/2023.americasnlp-1.16
- DOI:
- 10.18653/v1/2023.americasnlp-1.16
- Cite (ACL):
- Nouman Ahmed, Natalia Flechas Manrique, and Antonije Petrović. 2023. Enhancing Spanish-Quechua Machine Translation with Pre-Trained Models and Diverse Data Sources: LCT-EHU at AmericasNLP Shared Task. In Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP), pages 156–162, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- Enhancing Spanish-Quechua Machine Translation with Pre-Trained Models and Diverse Data Sources: LCT-EHU at AmericasNLP Shared Task (Ahmed et al., AmericasNLP 2023)
- PDF:
- https://preview.aclanthology.org/jeptaln-2024-ingestion/2023.americasnlp-1.16.pdf