NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural
Wilson Wongso, David Samuel Setiawan, Steven Limcorn, Ananto Joyoadikusumo
Abstract
We present NusaBERT, a multilingual model built on IndoBERT and tailored for Indonesia’s diverse languages. By expanding vocabulary and pre-training on a regional corpus, NusaBERT achieves state-of-the-art performance on Indonesian NLU benchmarks, enhancing IndoBERT’s multilingual capability. This study also addresses NusaBERT’s limitations and encourages further research on Indonesia’s underrepresented languages.- Anthology ID:
- 2025.sealp-1.2
- Volume:
- Proceedings of the Second Workshop in South East Asian Language Processing
- Month:
- January
- Year:
- 2025
- Address:
- Online
- Editors:
- Derry Wijaya, Alham Fikri Aji, Clara Vania, Genta Indra Winata, Ayu Purwarianti
- Venues:
- sealp | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 10–26
- Language:
- URL:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2025.sealp-1.2/
- DOI:
- Cite (ACL):
- Wilson Wongso, David Samuel Setiawan, Steven Limcorn, and Ananto Joyoadikusumo. 2025. NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural. In Proceedings of the Second Workshop in South East Asian Language Processing, pages 10–26, Online. Association for Computational Linguistics.
- Cite (Informal):
- NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural (Wongso et al., sealp 2025)
- PDF:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2025.sealp-1.2.pdf