Experiments on a Guarani Corpus of News and Social Media

Santiago Góngora, Nicolás Giossa, Luis Chiruzzo


Abstract
While Guarani is widely spoken in South America, obtaining a large amount of Guarani text from the web is hard. We present the building process of a Guarani corpus composed of a parallel Guarani-Spanish set of news articles, and a monolingual set of tweets. We perform some word embeddings experiments aiming at evaluating the quality of the Guarani split of the corpus, finding encouraging results but noticing that more diversity in text domains might be needed for further improvements.
Anthology ID:
2021.americasnlp-1.16
Volume:
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas
Month:
June
Year:
2021
Address:
Online
Venue:
AmericasNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
153–158
Language:
URL:
https://aclanthology.org/2021.americasnlp-1.16
DOI:
10.18653/v1/2021.americasnlp-1.16
Bibkey:
Cite (ACL):
Santiago Góngora, Nicolás Giossa, and Luis Chiruzzo. 2021. Experiments on a Guarani Corpus of News and Social Media. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 153–158, Online. Association for Computational Linguistics.
Cite (Informal):
Experiments on a Guarani Corpus of News and Social Media (Góngora et al., AmericasNLP 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2021.americasnlp-1.16.pdf