Filling the Gaps in Ancient Akkadian Texts: A Masked Language Modelling Approach

Koren Lazar, Benny Saret, Asaf Yehudai, Wayne Horowitz, Nathan Wasserman, Gabriel Stanovsky


Abstract
We present models which complete missing text given transliterations of ancient Mesopotamian documents, originally written on cuneiform clay tablets (2500 BCE - 100 CE). Due to the tablets’ deterioration, scholars often rely on contextual cues to manually fill in missing parts in the text in a subjective and time-consuming process. We identify that this challenge can be formulated as a masked language modelling task, used mostly as a pretraining objective for contextualized language models. Following, we develop several architectures focusing on the Akkadian language, the lingua franca of the time. We find that despite data scarcity (1M tokens) we can achieve state of the art performance on missing tokens prediction (89% hit@5) using a greedy decoding scheme and pretraining on data from other languages and different time periods. Finally, we conduct human evaluations showing the applicability of our models in assisting experts to transcribe texts in extinct languages.
Anthology ID:
2021.emnlp-main.384
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4682–4691
Language:
URL:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2021.emnlp-main.384/
DOI:
10.18653/v1/2021.emnlp-main.384
Bibkey:
Cite (ACL):
Koren Lazar, Benny Saret, Asaf Yehudai, Wayne Horowitz, Nathan Wasserman, and Gabriel Stanovsky. 2021. Filling the Gaps in Ancient Akkadian Texts: A Masked Language Modelling Approach. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4682–4691, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Filling the Gaps in Ancient Akkadian Texts: A Masked Language Modelling Approach (Lazar et al., EMNLP 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2021.emnlp-main.384.pdf
Video:
 https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2021.emnlp-main.384.mp4