Abstract
Pretrained language models require the use of consistent segmentation (e.g., subword- or character-level segmentation) in pretraining and finetuning. In NLP, many tasks are modeled by subword-level segmentation better than by character-level segmentation. However, because of their format, several tasks require the use of character-level segmentation. Thus, in order to tackle both types of NLP tasks, language models must be independently pretrained for both subword and character-level segmentation. However, this is an inefficient and costly procedure. Instead, this paper proposes a method for training a language model with unified segmentation. This means that the trained model can be finetuned on both subword- and character-level segmentation. The principle of the method is to apply the subword regularization technique to generate a mixture of subword- and character-level segmentation. Through experiment on BERT models, we demonstrate that our method can halve the computational cost of pretraining.- Anthology ID:
- 2023.ranlp-1.62
- Volume:
- Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
- Month:
- September
- Year:
- 2023
- Address:
- Varna, Bulgaria
- Editors:
- Ruslan Mitkov, Galia Angelova
- Venue:
- RANLP
- SIG:
- Publisher:
- INCOMA Ltd., Shoumen, Bulgaria
- Note:
- Pages:
- 568–577
- Language:
- URL:
- https://aclanthology.org/2023.ranlp-1.62
- DOI:
- Cite (ACL):
- Shun Kiyono, Sho Takase, Shengzhe Li, and Toshinori Sato. 2023. Bridging the Gap between Subword and Character Segmentation in Pretrained Language Models. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 568–577, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
- Cite (Informal):
- Bridging the Gap between Subword and Character Segmentation in Pretrained Language Models (Kiyono et al., RANLP 2023)
- PDF:
- https://preview.aclanthology.org/naacl24-info/2023.ranlp-1.62.pdf