A Language Model Trained on Uruguayan Spanish News Text
Juan Pablo Filevich, Gonzalo Marco, Santiago Castro, Luis Chiruzzo, Aiala Rosá
Abstract
This paper presents a language model trained from scratch exclusively on a brand new corpus consisting of about 6 GiB of Uruguayan newspaper text. We trained the model for 30 days on a single Nvidia P100 using the RoBERTa-base architecture but with considerably fewer parameters than other standard RoBERTa models. We evaluated the model on two NLP tasks and found that it outperforms BETO, the widely used Spanish BERT pre-trained model. We also compared our model on the masked-word prediction task with two popular multilingual BERT-based models, Multilingual BERT and XLM-RoBERTa, obtaining outstanding results on sentences from the Uruguayan press domain. Our experiments show that training a language model on a domain-specific corpus can significantly improve performance even when the model is smaller and was trained with significantly less data than more standard pre-trained models.- Anthology ID:
- 2024.tdle-1.5
- Volume:
- Proceedings of the Second International Workshop Towards Digital Language Equality (TDLE): Focusing on Sustainability @ LREC-COLING 2024
- Month:
- May
- Year:
- 2024
- Address:
- Torino, Italia
- Editors:
- Federico Gaspari, Joss Moorkens, Itziar Aldabe, Aritz Farwell, Begona Altuna, Stelios Piperidis, Georg Rehm, German Rigau
- Venues:
- TDLE | WS
- SIG:
- Publisher:
- ELRA and ICCL
- Note:
- Pages:
- 53–60
- Language:
- URL:
- https://aclanthology.org/2024.tdle-1.5
- DOI:
- Cite (ACL):
- Juan Pablo Filevich, Gonzalo Marco, Santiago Castro, Luis Chiruzzo, and Aiala Rosá. 2024. A Language Model Trained on Uruguayan Spanish News Text. In Proceedings of the Second International Workshop Towards Digital Language Equality (TDLE): Focusing on Sustainability @ LREC-COLING 2024, pages 53–60, Torino, Italia. ELRA and ICCL.
- Cite (Informal):
- A Language Model Trained on Uruguayan Spanish News Text (Filevich et al., TDLE-WS 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2024.tdle-1.5.pdf