Abstract
This paper presents an approach for training lightweight and robust language models for Bulgarian that mitigate gender, political, racial, and other biases in the data. Our method involves scraping content from major Bulgarian online media providers using a specialized procedure for source filtering, topic selection, and lexicon-based removal of inappropriate language during the pre-training phase. We continuously improve the models by incorporating new data from various domains, including social media, books, scientific literature, and linguistically modified corpora. Our motivation is to provide a solution that is sufficient for all natural language processing tasks in Bulgarian, and to address the lack of existing procedures for guaranteeing the robustness of such models.- Anthology ID:
- 2023.ranlp-1.77
- Volume:
- Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
- Month:
- September
- Year:
- 2023
- Address:
- Varna, Bulgaria
- Editors:
- Ruslan Mitkov, Galia Angelova
- Venue:
- RANLP
- SIG:
- Publisher:
- INCOMA Ltd., Shoumen, Bulgaria
- Note:
- Pages:
- 712–720
- Language:
- URL:
- https://aclanthology.org/2023.ranlp-1.77
- DOI:
- Cite (ACL):
- Iva Marinova, Kiril Simov, and Petya Osenova. 2023. Transformer-Based Language Models for Bulgarian. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 712–720, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
- Cite (Informal):
- Transformer-Based Language Models for Bulgarian (Marinova et al., RANLP 2023)
- PDF:
- https://preview.aclanthology.org/naacl24-info/2023.ranlp-1.77.pdf