Lemmatization of Cuneiform Languages Using the ByT5 Model

Pengxiu Lu, Yonglong Huang, Jing Xu, Minxuan Feng, Chao Xu


Abstract
Lemmatization of cuneiform languages presents a unique challenge due to their complex writing system, which combines syllabic and logographic elements. In this study, we investigate the effectiveness of the ByT5 model in addressing this challenge by developing and evaluating a ByT5-based lemmatization system. Experimental results demonstrate that ByT5 outperforms mT5 in this task, achieving an accuracy of 80.55% on raw lemmas and 82.59% on generalized lemmas, where sense numbers are removed. These findings highlight the potential of ByT5 for lemmatizing cuneiform languages and provide useful insights for future work on ancient text lemmatization.
Anthology ID:
2025.alp-1.26
Volume:
Proceedings of the Second Workshop on Ancient Language Processing
Month:
May
Year:
2025
Address:
The Albuquerque Convention Center, Laguna
Editors:
Adam Anderson, Shai Gordin, Bin Li, Yudong Liu, Marco C. Passarotti, Rachele Sprugnoli
Venues:
ALP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
197–205
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.alp-1.26/
DOI:
Bibkey:
Cite (ACL):
Pengxiu Lu, Yonglong Huang, Jing Xu, Minxuan Feng, and Chao Xu. 2025. Lemmatization of Cuneiform Languages Using the ByT5 Model. In Proceedings of the Second Workshop on Ancient Language Processing, pages 197–205, The Albuquerque Convention Center, Laguna. Association for Computational Linguistics.
Cite (Informal):
Lemmatization of Cuneiform Languages Using the ByT5 Model (Lu et al., ALP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.alp-1.26.pdf