Efficient Pre-training of Masked Language Model via Concept-based Curriculum Masking
Mingyu Lee, Jun-Hyung Park, Junho Kim, Kang-Min Kim, SangKeun Lee
Abstract
Self-supervised pre-training has achieved remarkable success in extensive natural language processing tasks. Masked language modeling (MLM) has been widely used for pre-training effective bidirectional representations but comes at a substantial training cost. In this paper, we propose a novel concept-based curriculum masking (CCM) method to efficiently pre-train a language model. CCM has two key differences from existing curriculum learning approaches to effectively reflect the nature of MLM. First, we introduce a novel curriculum that evaluates the MLM difficulty of each token based on a carefully-designed linguistic difficulty criterion. Second, we construct a curriculum that masks easy words and phrases first and gradually masks related ones to the previously masked ones based on a knowledge graph. Experimental results show that CCM significantly improves pre-training efficiency. Specifically, the model trained with CCM shows comparative performance with the original BERT on the General Language Understanding Evaluation benchmark at half of the training cost.- Anthology ID:
- 2022.emnlp-main.502
- Volume:
- Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
- Month:
- December
- Year:
- 2022
- Address:
- Abu Dhabi, United Arab Emirates
- Editors:
- Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 7417–7427
- Language:
- URL:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2022.emnlp-main.502/
- DOI:
- 10.18653/v1/2022.emnlp-main.502
- Cite (ACL):
- Mingyu Lee, Jun-Hyung Park, Junho Kim, Kang-Min Kim, and SangKeun Lee. 2022. Efficient Pre-training of Masked Language Model via Concept-based Curriculum Masking. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7417–7427, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Cite (Informal):
- Efficient Pre-training of Masked Language Model via Concept-based Curriculum Masking (Lee et al., EMNLP 2022)
- PDF:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2022.emnlp-main.502.pdf