Abstract
In this paper, we improve on existing language resources for the low-resource Filipino language in two ways. First, we outline the construction of the TLUnified dataset, a large-scale pretraining corpus that serves as an improvement over smaller existing pretraining datasets for the language in terms of scale and topic variety. Second, we pretrain new Transformer language models following the RoBERTa pretraining technique to supplant existing models trained with small corpora. Our new RoBERTa models show significant improvements over existing Filipino models in three benchmark datasets with an average gain of 4.47% test accuracy across three classification tasks with varying difficulty.- Anthology ID:
- 2022.lrec-1.703
- Volume:
- Proceedings of the Thirteenth Language Resources and Evaluation Conference
- Month:
- June
- Year:
- 2022
- Address:
- Marseille, France
- Editors:
- Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 6548–6555
- Language:
- URL:
- https://aclanthology.org/2022.lrec-1.703
- DOI:
- Cite (ACL):
- Jan Christian Blaise Cruz and Charibeth Cheng. 2022. Improving Large-scale Language Models and Resources for Filipino. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6548–6555, Marseille, France. European Language Resources Association.
- Cite (Informal):
- Improving Large-scale Language Models and Resources for Filipino (Cruz & Cheng, LREC 2022)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2022.lrec-1.703.pdf
- Data
- CCAligned, NewsPH-NLI, WikiText-TL-39