Abstract
We present a subword regularization method for WordPiece, which uses a maximum matching algorithm for tokenization. The proposed method, MaxMatch-Dropout, randomly drops words in a search using the maximum matching algorithm. It realizes finetuning with subword regularization for popular pretrained language models such as BERT-base. The experimental results demonstrate that MaxMatch-Dropout improves the performance of text classification and machine translation tasks as well as other subword regularization methods. Moreover, we provide a comparative analysis of subword regularization methods: subword regularization with SentencePiece (Unigram), BPE-Dropout, and MaxMatch-Dropout.- Anthology ID:
- 2022.coling-1.430
- Volume:
- Proceedings of the 29th International Conference on Computational Linguistics
- Month:
- October
- Year:
- 2022
- Address:
- Gyeongju, Republic of Korea
- Editors:
- Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
- Venue:
- COLING
- SIG:
- Publisher:
- International Committee on Computational Linguistics
- Note:
- Pages:
- 4864–4872
- Language:
- URL:
- https://aclanthology.org/2022.coling-1.430
- DOI:
- Cite (ACL):
- Tatsuya Hiraoka. 2022. MaxMatch-Dropout: Subword Regularization for WordPiece. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4864–4872, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Cite (Informal):
- MaxMatch-Dropout: Subword Regularization for WordPiece (Hiraoka, COLING 2022)
- PDF:
- https://preview.aclanthology.org/proper-vol2-ingestion/2022.coling-1.430.pdf
- Code
- tathi/maxmatch_dropout
- Data
- GLUE, KLUE, QNLI, SST, SST-2