Graph-based Filtering of Out-of-Vocabulary Words for Encoder-Decoder Models
Satoru Katsumata, Yukio Matsumura, Hayahide Yamagishi, Mamoru Komachi
Abstract
Encoder-decoder models typically only employ words that are frequently used in the training corpus because of the computational costs and/or to exclude noisy words. However, this vocabulary set may still include words that interfere with learning in encoder-decoder models. This paper proposes a method for selecting more suitable words for learning encoders by utilizing not only frequency, but also co-occurrence information, which we capture using the HITS algorithm. The proposed method is applied to two tasks: machine translation and grammatical error correction. For Japanese-to-English translation, this method achieved a BLEU score that was 0.56 points more than that of a baseline. It also outperformed the baseline method for English grammatical error correction, with an F-measure that was 1.48 points higher.- Anthology ID:
 - P18-3016
 - Volume:
 - Proceedings of ACL 2018, Student Research Workshop
 - Month:
 - July
 - Year:
 - 2018
 - Address:
 - Melbourne, Australia
 - Venue:
 - ACL
 - SIG:
 - Publisher:
 - Association for Computational Linguistics
 - Note:
 - Pages:
 - 112–119
 - Language:
 - URL:
 - https://aclanthology.org/P18-3016
 - DOI:
 - 10.18653/v1/P18-3016
 - Cite (ACL):
 - Satoru Katsumata, Yukio Matsumura, Hayahide Yamagishi, and Mamoru Komachi. 2018. Graph-based Filtering of Out-of-Vocabulary Words for Encoder-Decoder Models. In Proceedings of ACL 2018, Student Research Workshop, pages 112–119, Melbourne, Australia. Association for Computational Linguistics.
 - Cite (Informal):
 - Graph-based Filtering of Out-of-Vocabulary Words for Encoder-Decoder Models (Katsumata et al., ACL 2018)
 - PDF:
 - https://preview.aclanthology.org/ingestion-script-update/P18-3016.pdf
 - Code
 - Katsumata420/HITS_Ranking