At the Crossroad of Cuneiform and NLP: Challenges for Fine-grained Part-of-speech Tagging

Gustav Ryberg Smidt, Els Lefever, Katrien de Graef


Abstract
The study of ancient Middle Eastern cultures is dominated by the vast number of cuneiform texts. Multiple languages and language families were expressed in cuneiform. The most dominant language written in cuneiform is the Semitic Akkadian, which is the focus of this paper. We are specifically focusing on letters written in the dialect used in modern-day Baghdad and south towards the Persian Gulf during the Old Babylonian period (c. 2000-1600 B.C.E.). The Akkadian language was rediscovered in the 19th century and is now being scrutinised by Natural Language Processing (NLP) methods. However, existing Akkadian text publications are not always suitable for digital editions. We therefore risk applying NLP methods onto renderings of Akkadian unfit for the purpose. In this paper we want to investigate the input material and try to initiate a discussion about best-practices in the crossroad where NLP meets cuneiform studies. Specifically, we want to question the use of pre-trained embeddings, sentence segmentation and the type of cuneiform input used to fine-tune language models for the task of fine-grained part-of-speech tagging. We examine the issues by theoretical and practical approaches in a way that we hope spurs discussions that are relevant for automatic processing of other ancient languages.
Anthology ID:
2024.lrec-main.154
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
1745–1755
Language:
URL:
https://aclanthology.org/2024.lrec-main.154
DOI:
Bibkey:
Cite (ACL):
Gustav Ryberg Smidt, Els Lefever, and Katrien de Graef. 2024. At the Crossroad of Cuneiform and NLP: Challenges for Fine-grained Part-of-speech Tagging. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 1745–1755, Torino, Italia. ELRA and ICCL.
Cite (Informal):
At the Crossroad of Cuneiform and NLP: Challenges for Fine-grained Part-of-speech Tagging (Smidt et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2024.lrec-main.154.pdf