Tatsuya Hiraoka


2021

pdf bib
Joint Optimization of Tokenization and Downstream Model
Tatsuya Hiraoka | Sho Takase | Kei Uchiumi | Atsushi Keyaki | Naoaki Okazaki
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf bib
Optimizing Word Segmentation for Downstream Task
Tatsuya Hiraoka | Sho Takase | Kei Uchiumi | Atsushi Keyaki | Naoaki Okazaki
Findings of the Association for Computational Linguistics: EMNLP 2020

In traditional NLP, we tokenize a given sentence as a preprocessing, and thus the tokenization is unrelated to a target downstream task. To address this issue, we propose a novel method to explore a tokenization which is appropriate for the downstream task. Our proposed method, optimizing tokenization (OpTok), is trained to assign a high probability to such appropriate tokenization based on the downstream task loss. OpTok can be used for any downstream task which uses a vector representation of a sentence such as text classification. Experimental results demonstrate that OpTok improves the performance of sentiment analysis and textual entailment. In addition, we introduce OpTok into BERT, the state-of-the-art contextualized embeddings and report a positive effect.

2019

pdf bib
Stochastic Tokenization with a Language Model for Neural Text Classification
Tatsuya Hiraoka | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

For unsegmented languages such as Japanese and Chinese, tokenization of a sentence has a significant impact on the performance of text classification. Sentences are usually segmented with words or subwords by a morphological analyzer or byte pair encoding and then encoded with word (or subword) representations for neural networks. However, segmentation is potentially ambiguous, and it is unclear whether the segmented tokens achieve the best performance for the target task. In this paper, we propose a method to simultaneously learn tokenization and text classification to address these problems. Our model incorporates a language model for unsupervised tokenization into a text classifier and then trains both models simultaneously. To make the model robust against infrequent tokens, we sampled segmentation for each sentence stochastically during training, which resulted in improved performance of text classification. We conducted experiments on sentiment analysis as a text classification task and show that our method achieves better performance than previous methods.