Abstract
Lemmatization of standard languages is concerned with (i) abstracting over morphological differences and (ii) resolving token-lemma ambiguities of inflected words in order to map them to a dictionary headword. In the present paper we aim to improve lemmatization performance on a set of non-standard historical languages in which the difficulty is increased by an additional aspect (iii): spelling variation due to lacking orthographic standards. We approach lemmatization as a string-transduction task with an Encoder-Decoder architecture which we enrich with sentence information using a hierarchical sentence encoder. We show significant improvements over the state-of-the-art by fine-tuning the sentence encodings to jointly optimize a bidirectional language model loss. Crucially, our architecture does not require POS or morphological annotations, which are not always available for historical corpora. Additionally, we also test the proposed model on a set of typologically diverse standard languages showing results on par or better than a model without fine-tuned sentence representations and previous state-of-the-art systems. Finally, to encourage future work on processing of non-standard varieties, we release the dataset of non-standard languages underlying the present study, which is based on openly accessible sources.- Anthology ID:
 - N19-1153
 - Volume:
 - Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
 - Month:
 - June
 - Year:
 - 2019
 - Address:
 - Minneapolis, Minnesota
 - Venue:
 - NAACL
 - SIG:
 - Publisher:
 - Association for Computational Linguistics
 - Note:
 - Pages:
 - 1493–1503
 - Language:
 - URL:
 - https://aclanthology.org/N19-1153
 - DOI:
 - 10.18653/v1/N19-1153
 - Cite (ACL):
 - Enrique Manjavacas, Ákos Kádár, and Mike Kestemont. 2019. Improving Lemmatization of Non-Standard Languages with Joint Learning. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1493–1503, Minneapolis, Minnesota. Association for Computational Linguistics.
 - Cite (Informal):
 - Improving Lemmatization of Non-Standard Languages with Joint Learning (Manjavacas et al., NAACL 2019)
 - PDF:
 - https://preview.aclanthology.org/ingestion-script-update/N19-1153.pdf
 - Code
 - emanjavacas/pie + additional community code
 - Data
 - Universal Dependencies