Improving Lemmatization of Non-Standard Languages with Joint Learning

Enrique Manjavacas; Ákos Kádár; Mike Kestemont

doi:10.18653/v1/N19-1153

Improving Lemmatization of Non-Standard Languages with Joint Learning

Enrique Manjavacas, Ákos Kádár, Mike Kestemont

Abstract

Lemmatization of standard languages is concerned with (i) abstracting over morphological differences and (ii) resolving token-lemma ambiguities of inflected words in order to map them to a dictionary headword. In the present paper we aim to improve lemmatization performance on a set of non-standard historical languages in which the difficulty is increased by an additional aspect (iii): spelling variation due to lacking orthographic standards. We approach lemmatization as a string-transduction task with an Encoder-Decoder architecture which we enrich with sentence information using a hierarchical sentence encoder. We show significant improvements over the state-of-the-art by fine-tuning the sentence encodings to jointly optimize a bidirectional language model loss. Crucially, our architecture does not require POS or morphological annotations, which are not always available for historical corpora. Additionally, we also test the proposed model on a set of typologically diverse standard languages showing results on par or better than a model without fine-tuned sentence representations and previous state-of-the-art systems. Finally, to encourage future work on processing of non-standard varieties, we release the dataset of non-standard languages underlying the present study, which is based on openly accessible sources.

Anthology ID:: N19-1153
Volume:: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Month:: June
Year:: 2019
Address:: Minneapolis, Minnesota
Editors:: Jill Burstein, Christy Doran, Thamar Solorio
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1493–1503
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/N19-1153/
DOI:: 10.18653/v1/N19-1153
Bibkey:
Cite (ACL):: Enrique Manjavacas, Ákos Kádár, and Mike Kestemont. 2019. Improving Lemmatization of Non-Standard Languages with Joint Learning. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1493–1503, Minneapolis, Minnesota. Association for Computational Linguistics.
Cite (Informal):: Improving Lemmatization of Non-Standard Languages with Joint Learning (Manjavacas et al., NAACL 2019)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/N19-1153.pdf
Code: emanjavacas/pie + additional community code
Data: Universal Dependencies

PDF Cite Search Code Fix data