How low is too low? A monolingual take on lemmatisation in Indian languages

Kumar Saunack, Kumar Saurav, Pushpak Bhattacharyya


Abstract
Lemmatization aims to reduce the sparse data problem by relating the inflected forms of a word to its dictionary form. Most prior work on ML based lemmatization has focused on high resource languages, where data sets (word forms) are readily available. For languages which have no linguistic work available, especially on morphology or in languages where the computational realization of linguistic rules is complex and cumbersome, machine learning based lemmatizers are the way togo. In this paper, we devote our attention to lemmatisation for low resource, morphologically rich scheduled Indian languages using neural methods. Here, low resource means only a small number of word forms are available. We perform tests to analyse the variance in monolingual models’ performance on varying the corpus size and contextual morphological tag data for training. We show that monolingual approaches with data augmentation can give competitive accuracy even in the low resource setting, which augurs well for NLP in low resource setting.
Anthology ID:
2021.naacl-main.322
Volume:
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
June
Year:
2021
Address:
Online
Editors:
Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, Yichao Zhou
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4088–4094
Language:
URL:
https://aclanthology.org/2021.naacl-main.322
DOI:
10.18653/v1/2021.naacl-main.322
Bibkey:
Cite (ACL):
Kumar Saunack, Kumar Saurav, and Pushpak Bhattacharyya. 2021. How low is too low? A monolingual take on lemmatisation in Indian languages. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4088–4094, Online. Association for Computational Linguistics.
Cite (Informal):
How low is too low? A monolingual take on lemmatisation in Indian languages (Saunack et al., NAACL 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl-24-ws-corrections/2021.naacl-main.322.pdf
Video:
 https://preview.aclanthology.org/naacl-24-ws-corrections/2021.naacl-main.322.mp4