Morphology Matters: A Multilingual Language Modeling Analysis

Hyunji Hayley Park, Katherine J. Zhang, Coleman Haley, Kenneth Steimel, Han Liu, Lane Schwartz


Abstract
Prior studies in multilingual language modeling (e.g., Cotterell et al., 2018; Mielke et al., 2019) disagree on whether or not inflectional morphology makes languages harder to model. We attempt to resolve the disagreement and extend those studies. We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features.1 We fill in missing typological data for several languages and consider corpus-based measures of morphological complexity in addition to expert-produced typological features. We find that several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data. We also investigate linguistically motivated subword segmentation strategies like Morfessor and Finite-State Transducers (FSTs) and find that these segmentation strategies yield better performance and reduce the impact of a language’s morphology on language modeling.
Anthology ID:
2021.tacl-1.16
Volume:
Transactions of the Association for Computational Linguistics, Volume 9
Month:
Year:
2021
Address:
Cambridge, MA
Editors:
Brian Roark, Ani Nenkova
Venue:
TACL
SIG:
Publisher:
MIT Press
Note:
Pages:
261–276
Language:
URL:
https://preview.aclanthology.org/icon-24-ingestion/2021.tacl-1.16/
DOI:
10.1162/tacl_a_00365
Bibkey:
Cite (ACL):
Hyunji Hayley Park, Katherine J. Zhang, Coleman Haley, Kenneth Steimel, Han Liu, and Lane Schwartz. 2021. Morphology Matters: A Multilingual Language Modeling Analysis. Transactions of the Association for Computational Linguistics, 9:261–276.
Cite (Informal):
Morphology Matters: A Multilingual Language Modeling Analysis (Park et al., TACL 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/icon-24-ingestion/2021.tacl-1.16.pdf
Video:
 https://preview.aclanthology.org/icon-24-ingestion/2021.tacl-1.16.mp4