HieroLM: Egyptian Hieroglyph Recovery with Next Word Prediction Language Model

Xuheng Cai, Erica Zhang


Abstract
Egyptian hieroglyphs are found on numerous ancient Egyptian artifacts, but it is common that they are blurry or even missing due to erosion. Existing efforts to restore blurry hieroglyphs adopt computer vision techniques such as CNNs and model hieroglyph recovery as an image classification task, which suffers from two major limitations: (i) They cannot handle severely damaged or completely missing hieroglyphs. (ii) They make predictions based on a single hieroglyph without considering contextual and grammatical information. This paper proposes a novel approach to model hieroglyph recovery as a next word prediction task and use language models to address it. We compare the performance of different SOTA language models and choose LSTM as the architecture of our HieroLM due to the strong local affinity of semantics in Egyptian hieroglyph texts. Experiments show that HieroLM achieves over 44% accuracy and maintains notable performance on multi-shot predictions and scarce data, which makes it a pragmatic tool to assist scholars in inferring missing hieroglyphs. It can also complement CV-based models to significantly reduce perplexity in recognizing blurry hieroglyphs. Ourcode is available at https://github.com/Rick-Cai/HieroLM/.
Anthology ID:
2025.latechclfl-1.4
Volume:
Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)
Month:
May
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Anna Kazantseva, Stan Szpakowicz, Stefania Degaetano-Ortlieb, Yuri Bizzoni, Janis Pagel
Venues:
LaTeCHCLfL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
25–31
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.latechclfl-1.4/
DOI:
Bibkey:
Cite (ACL):
Xuheng Cai and Erica Zhang. 2025. HieroLM: Egyptian Hieroglyph Recovery with Next Word Prediction Language Model. In Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025), pages 25–31, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
HieroLM: Egyptian Hieroglyph Recovery with Next Word Prediction Language Model (Cai & Zhang, LaTeCHCLfL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.latechclfl-1.4.pdf