Unsupervised Disambiguation of Syncretism in Inflected Lexicons

Ryan Cotterell, Christo Kirov, Sabrina J. Mielke, Jason Eisner


Abstract
Lexical ambiguity makes it difficult to compute useful statistics of a corpus. A given word form might represent any of several morphological feature bundles. One can, however, use unsupervised learning (as in EM) to fit a model that probabilistically disambiguates word forms. We present such an approach, which employs a neural network to smoothly model a prior distribution over feature bundles (even rare ones). Although this basic model does not consider a token’s context, that very property allows it to operate on a simple list of unigram type counts, partitioning each count among different analyses of that unigram. We discuss evaluation metrics for this novel task and report results on 5 languages.
Anthology ID:
N18-2087
Original:
N18-2087v1
Version 2:
N18-2087v2
Volume:
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)
Month:
June
Year:
2018
Address:
New Orleans, Louisiana
Editors:
Marilyn Walker, Heng Ji, Amanda Stent
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
548–553
Language:
URL:
https://aclanthology.org/N18-2087
DOI:
10.18653/v1/N18-2087
Bibkey:
Cite (ACL):
Ryan Cotterell, Christo Kirov, Sabrina J. Mielke, and Jason Eisner. 2018. Unsupervised Disambiguation of Syncretism in Inflected Lexicons. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 548–553, New Orleans, Louisiana. Association for Computational Linguistics.
Cite (Informal):
Unsupervised Disambiguation of Syncretism in Inflected Lexicons (Cotterell et al., NAACL 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/ml4al-ingestion/N18-2087.pdf
Poster:
 N18-2087.Poster.pdf
Data
Universal Dependencies