Japanese Text Normalization with Encoder-Decoder Model

Taishi Ikeda; Hiroyuki Shindo; Yuji Matsumoto

Japanese Text Normalization with Encoder-Decoder Model

Taishi Ikeda, Hiroyuki Shindo, Yuji Matsumoto

Abstract

Text normalization is the task of transforming lexical variants to their canonical forms. We model the problem of text normalization as a character-level sequence to sequence learning problem and present a neural encoder-decoder model for solving it. To train the encoder-decoder model, many sentences pairs are generally required. However, Japanese non-standard canonical pairs are scarce in the form of parallel corpora. To address this issue, we propose a method of data augmentation to increase data size by converting existing resources into synthesized non-standard forms using handcrafted rules. We conducted an experiment to demonstrate that the synthesized corpus contributes to stably train an encoder-decoder model and improve the performance of Japanese text normalization.

Anthology ID:: W16-3918
Volume:: Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)
Month:: December
Year:: 2016
Address:: Osaka, Japan
Editors:: Bo Han, Alan Ritter, Leon Derczynski, Wei Xu, Tim Baldwin
Venue:: WNUT
SIG:
Publisher:: The COLING 2016 Organizing Committee
Note:
Pages:: 129–137
Language:
URL:: https://preview.aclanthology.org/add-emnlp-2024-awards/W16-3918/
DOI:
Bibkey:
Cite (ACL):: Taishi Ikeda, Hiroyuki Shindo, and Yuji Matsumoto. 2016. Japanese Text Normalization with Encoder-Decoder Model. In Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT), pages 129–137, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):: Japanese Text Normalization with Encoder-Decoder Model (Ikeda et al., WNUT 2016)
Copy Citation:
PDF:: https://preview.aclanthology.org/add-emnlp-2024-awards/W16-3918.pdf

PDF Cite Search Fix data