Awajun-OP: Multi-domain dataset for Spanish–Awajun Machine Translation

Oscar Moreno, Yanua Atamain, Arturo Oncevay


Abstract
We introduce a Spanish-Awajun parallel dataset of 22k high-quality sentence pairs with the help of the journalistic organization Company C. This dataset consists of parallel data obtained from various web sources such as poems, stories, laws, protocols, guidelines, handbooks, the Bible, and news published by Company C. The study also includes an analysis of the dataset’s performance for Spanish-Awajun translation using a Transformer architecture with transfer learning from a parent model, utilizing Spanish-English and Spanish-Finnish as high-resource language-pairs. As far as we know, this is the first Spanish-Awajun machine translation study, and we hope that this work will serve as a starting point for future research on this neglected Peruvian language.
Anthology ID:
2024.americasnlp-1.12
Volume:
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Manuel Mager, Abteen Ebrahimi, Shruti Rijhwani, Arturo Oncevay, Luis Chiruzzo, Robert Pugh, Katharina von der Wense
Venues:
AmericasNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
112–120
Language:
URL:
https://aclanthology.org/2024.americasnlp-1.12
DOI:
Bibkey:
Cite (ACL):
Oscar Moreno, Yanua Atamain, and Arturo Oncevay. 2024. Awajun-OP: Multi-domain dataset for Spanish–Awajun Machine Translation. In Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024), pages 112–120, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Awajun-OP: Multi-domain dataset for Spanish–Awajun Machine Translation (Moreno et al., AmericasNLP-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/jeptaln-2024-ingestion/2024.americasnlp-1.12.pdf
Supplementary material:
 2024.americasnlp-1.12.SupplementaryMaterial.zip