Morphology Matters: A Multilingual Language Modeling Analysis

Hyunji Hayley Park, Katherine J. Zhang, Coleman Haley, Kenneth Steimel, Han Liu, Lane Schwartz


Abstract
Abstract Prior studies in multilingual language modeling (e.g., Cotterell et al., 2018; Mielke et al., 2019) disagree on whether or not inflectional morphology makes languages harder to model. We attempt to resolve the disagreement and extend those studies. We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features.1 We fill in missing typological data for several languages and consider corpus-based measures of morphological complexity in addition to expert-produced typological features. We find that several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data. We also investigate linguistically motivated subword segmentation strategies like Morfessor and Finite-State Transducers (FSTs) and find that these segmentation strategies yield better performance and reduce the impact of a language’s morphology on language modeling.
Anthology ID:
2021.tacl-1.16
Volume:
Transactions of the Association for Computational Linguistics, Volume 9
Month:
Year:
2021
Address:
Cambridge, MA
Venue:
TACL
SIG:
Publisher:
MIT Press
Note:
Pages:
261–276
Language:
URL:
https://aclanthology.org/2021.tacl-1.16
DOI:
10.1162/tacl_a_00365
Bibkey:
Cite (ACL):
Hyunji Hayley Park, Katherine J. Zhang, Coleman Haley, Kenneth Steimel, Han Liu, and Lane Schwartz. 2021. Morphology Matters: A Multilingual Language Modeling Analysis. Transactions of the Association for Computational Linguistics, 9:261–276.
Cite (Informal):
Morphology Matters: A Multilingual Language Modeling Analysis (Park et al., TACL 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2021.tacl-1.16.pdf
Video:
 https://preview.aclanthology.org/auto-file-uploads/2021.tacl-1.16.mp4