An Annotated Dataset of Errors in Premodern Greek and Baselines for Detecting Them

Creston Brooks, Johannes Haubold, Charlie Cowen-Breen, Jay White, Desmond DeVaul, Frederick Riemenschneider, Karthik R Narasimhan, Barbara Graziosi


Abstract
As premodern texts are passed down over centuries, errors inevitably accrue. These errors can be challenging to identify, as some have survived undetected for so long precisely because they are so elusive. While prior work has evaluated error detection methods on artificially-generated errors, we introduce the first dataset of real errors in premodern Greek, enabling the evaluation of error detection methods on errors that genuinely accumulated at some stage in the centuries-long copying process. To create this dataset, we use metrics derived from BERT conditionals to sample 1,000 words more likely to contain errors, which are then annotated and labeled by a domain expert as errors or not. We then propose and evaluate new error detection methods and find that our discriminator-based detector outperforms all other methods, improving the true positive rate for classifying real errors by 5%. We additionally observe that scribal errors are more difficult to detect than print or digitization errors. Our dataset enables the evaluation of error detection methods on real errors in premodern texts for the first time, providing a benchmark for developing more effective error detection algorithms to assist scholars in restoring premodern works.
Anthology ID:
2025.findings-naacl.401
Volume:
Findings of the Association for Computational Linguistics: NAACL 2025
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7188–7202
Language:
URL:
https://preview.aclanthology.org/moar-dois/2025.findings-naacl.401/
DOI:
10.18653/v1/2025.findings-naacl.401
Bibkey:
Cite (ACL):
Creston Brooks, Johannes Haubold, Charlie Cowen-Breen, Jay White, Desmond DeVaul, Frederick Riemenschneider, Karthik R Narasimhan, and Barbara Graziosi. 2025. An Annotated Dataset of Errors in Premodern Greek and Baselines for Detecting Them. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 7188–7202, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
An Annotated Dataset of Errors in Premodern Greek and Baselines for Detecting Them (Brooks et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/moar-dois/2025.findings-naacl.401.pdf