Lemma Dilemma: On Lemma Generation Without Domain- or Language-Specific Training Data

Olia Toporkov, Alan Akbik, Rodrigo Agerri


Abstract
Lemmatization is the task of transforming all words in a given text to their dictionary forms. While large language models (LLMs) have demonstrated their ability to achieve competitive results across a wide range of NLP tasks, there is no prior evidence of how effective they are in the contextual lemmatization task. In this paper, we empirically investigate the capacity of the latest generation of LLMs to perform in-context lemmatization, comparing it to the traditional fully supervised approach. In particular, we consider the setting in which supervised training data is not available for a target domain or language, comparing (i) encoder-only supervised approaches, fine-tuned out-of-domain, and (ii) cross-lingual methods, against direct in-context lemma generation with LLMs. Our experimental investigation across 12 languages of different morphological complexity finds that, while encoders remain competitive in out-of-domain settings when fine-tuned on gold data, current LLMs reach state-of-the-art results for most languages by directly generating lemmas in-context without prior fine-tuning, provided just with a few examples. Data and code will be made available upon publication.
Anthology ID:
2025.findings-emnlp.988
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
18219–18232
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.988/
DOI:
10.18653/v1/2025.findings-emnlp.988
Bibkey:
Cite (ACL):
Olia Toporkov, Alan Akbik, and Rodrigo Agerri. 2025. Lemma Dilemma: On Lemma Generation Without Domain- or Language-Specific Training Data. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 18219–18232, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Lemma Dilemma: On Lemma Generation Without Domain- or Language-Specific Training Data (Toporkov et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.988.pdf
Checklist:
 2025.findings-emnlp.988.checklist.pdf