Yliana Rodríguez
2023
Initial Experiments for Building a Guarani WordNet
Luis Chiruzzo
|
Marvin Agüero-Torales
|
Aldo Alvarez
|
Yliana Rodríguez
Proceedings of the 12th Global Wordnet Conference
This paper presents a work in progress about creating a Guarani version of the WordNet database. Guarani is an indigenous South American language and is a low-resource language from the NLP perspective. Following the expand approach, we aim to find Guarani lemmas that correspond to the concepts defined in WordNet. We do this through three strategies that try to select the correct lemmas from Guarani-Spanish datasets. We ran them through three different bilingual dictionaries and had native speakers assess the results. This procedure found Guarani lemmas for about 6.5 thousand synsets, including 27% of the base WordNet concepts. However, more work on the quality of the selected words will be needed in order to create a final version of the dataset.
2022
Jojajovai: A Parallel Guarani-Spanish Corpus for MT Benchmarking
Luis Chiruzzo
|
Santiago Góngora
|
Aldo Alvarez
|
Gustavo Giménez-Lugo
|
Marvin Agüero-Torales
|
Yliana Rodríguez
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This work presents a parallel corpus of Guarani-Spanish text aligned at sentence level. The corpus contains about 30,000 sentence pairs, and is structured as a collection of subsets from different sources, further split into training, development and test sets. A sample of sentences from the test set was manually annotated by native speakers in order to incorporate meta-linguistic annotations about the Guarani dialects present in the corpus and also the correctness of the alignment and translation. We also present some baseline MT experiments and analyze the results in terms of the subsets. We hope this corpus can be used as a benchmark for testing Guarani-Spanish MT systems, and aim to expand and improve the quality of the corpus in future iterations.