Atsushi Keyaki
2021
Joint Optimization of Tokenization and Downstream Model
Tatsuya Hiraoka
|
Sho Takase
|
Kei Uchiumi
|
Atsushi Keyaki
|
Naoaki Okazaki
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
2020
Optimizing Word Segmentation for Downstream Task
Tatsuya Hiraoka
|
Sho Takase
|
Kei Uchiumi
|
Atsushi Keyaki
|
Naoaki Okazaki
Findings of the Association for Computational Linguistics: EMNLP 2020
In traditional NLP, we tokenize a given sentence as a preprocessing, and thus the tokenization is unrelated to a target downstream task. To address this issue, we propose a novel method to explore a tokenization which is appropriate for the downstream task. Our proposed method, optimizing tokenization (OpTok), is trained to assign a high probability to such appropriate tokenization based on the downstream task loss. OpTok can be used for any downstream task which uses a vector representation of a sentence such as text classification. Experimental results demonstrate that OpTok improves the performance of sentiment analysis and textual entailment. In addition, we introduce OpTok into BERT, the state-of-the-art contextualized embeddings and report a positive effect.
Search