Abstract
Although the Indonesian language is spoken by almost 200 million people and the 10th most spoken language in the world, it is under-represented in NLP research. Previous work on Indonesian has been hampered by a lack of annotated datasets, a sparsity of language resources, and a lack of resource standardization. In this work, we release the IndoLEM dataset comprising seven tasks for the Indonesian language, spanning morpho-syntax, semantics, and discourse. We additionally release IndoBERT, a new pre-trained language model for Indonesian, and evaluate it over IndoLEM, in addition to benchmarking it against existing resources. Our experiments show that IndoBERT achieves state-of-the-art performance over most of the tasks in IndoLEM.- Anthology ID:
- 2020.coling-main.66
- Volume:
- Proceedings of the 28th International Conference on Computational Linguistics
- Month:
- December
- Year:
- 2020
- Address:
- Barcelona, Spain (Online)
- Venue:
- COLING
- SIG:
- Publisher:
- International Committee on Computational Linguistics
- Note:
- Pages:
- 757–770
- Language:
- URL:
- https://aclanthology.org/2020.coling-main.66
- DOI:
- 10.18653/v1/2020.coling-main.66
- Cite (ACL):
- Fajri Koto, Afshin Rahimi, Jey Han Lau, and Timothy Baldwin. 2020. IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP. In Proceedings of the 28th International Conference on Computational Linguistics, pages 757–770, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Cite (Informal):
- IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP (Koto et al., COLING 2020)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2020.coling-main.66.pdf
- Data
- GLUE, IndoSum, SuperGLUE, XGLUE