Language Modeling for Morphologically Rich Languages: Character-Aware Modeling for Word-Level Prediction
Daniela Gerz, Ivan Vulić, Edoardo Ponti, Jason Naradowsky, Roi Reichart, Anna Korhonen
Abstract
Neural architectures are prominent in the construction of language models (LMs). However, word-level prediction is typically agnostic of subword-level information (characters and character sequences) and operates over a closed vocabulary, consisting of a limited word set. Indeed, while subword-aware models boost performance across a variety of NLP tasks, previous work did not evaluate the ability of these models to assist next-word prediction in language modeling tasks. Such subword-level informed models should be particularly effective for morphologically-rich languages (MRLs) that exhibit high type-to-token ratios. In this work, we present a large-scale LM study on 50 typologically diverse languages covering a wide variety of morphological systems, and offer new LM benchmarks to the community, while considering subword-level information. The main technical contribution of our work is a novel method for injecting subword-level information into semantic word vectors, integrated into the neural language modeling training, to facilitate word-level prediction. We conduct experiments in the LM setting where the number of infrequent words is large, and demonstrate strong perplexity gains across our 50 languages, especially for morphologically-rich languages. Our code and data sets are publicly available.- Anthology ID:
- Q18-1032
- Volume:
- Transactions of the Association for Computational Linguistics, Volume 6
- Month:
- Year:
- 2018
- Address:
- Cambridge, MA
- Editors:
- Lillian Lee, Mark Johnson, Kristina Toutanova, Brian Roark
- Venue:
- TACL
- SIG:
- Publisher:
- MIT Press
- Note:
- Pages:
- 451–465
- Language:
- URL:
- https://preview.aclanthology.org/add_missing_videos/Q18-1032/
- DOI:
- 10.1162/tacl_a_00032
- Cite (ACL):
- Daniela Gerz, Ivan Vulić, Edoardo Ponti, Jason Naradowsky, Roi Reichart, and Anna Korhonen. 2018. Language Modeling for Morphologically Rich Languages: Character-Aware Modeling for Word-Level Prediction. Transactions of the Association for Computational Linguistics, 6:451–465.
- Cite (Informal):
- Language Modeling for Morphologically Rich Languages: Character-Aware Modeling for Word-Level Prediction (Gerz et al., TACL 2018)
- PDF:
- https://preview.aclanthology.org/add_missing_videos/Q18-1032.pdf