Optimizing Word Segmentation for Downstream Task
Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, Naoaki Okazaki
Abstract
In traditional NLP, we tokenize a given sentence as a preprocessing, and thus the tokenization is unrelated to a target downstream task. To address this issue, we propose a novel method to explore a tokenization which is appropriate for the downstream task. Our proposed method, optimizing tokenization (OpTok), is trained to assign a high probability to such appropriate tokenization based on the downstream task loss. OpTok can be used for any downstream task which uses a vector representation of a sentence such as text classification. Experimental results demonstrate that OpTok improves the performance of sentiment analysis and textual entailment. In addition, we introduce OpTok into BERT, the state-of-the-art contextualized embeddings and report a positive effect.- Anthology ID:
- 2020.findings-emnlp.120
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2020
- Month:
- November
- Year:
- 2020
- Address:
- Online
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1341–1351
- Language:
- URL:
- https://aclanthology.org/2020.findings-emnlp.120
- DOI:
- 10.18653/v1/2020.findings-emnlp.120
- Cite (ACL):
- Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, and Naoaki Okazaki. 2020. Optimizing Word Segmentation for Downstream Task. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1341–1351, Online. Association for Computational Linguistics.
- Cite (Informal):
- Optimizing Word Segmentation for Downstream Task (Hiraoka et al., Findings 2020)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/2020.findings-emnlp.120.pdf
- Code
- tatHi/optok
- Data
- SNLI