Abstract
The following article proposes a method employing the Tesseract OCR engine to aid palaeographic analysis and scribal identification. Repurposing the so-called confidence score provided by the OCR engine, different methods of visualization are used to surface differences between font families, script types and manuscript hands.- Anthology ID:
- 2023.nlp4dh-1.20
- Volume:
- Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages
- Month:
- December
- Year:
- 2023
- Address:
- Tokyo, Japan
- Editors:
- Mika Hämäläinen, Emily Öhman, Flammie Pirinen, Khalid Alnajjar, So Miyagawa, Yuri Bizzoni, Niko Partanen, Jack Rueter
- Venues:
- NLP4DH | IWCLUL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 162–171
- Language:
- URL:
- https://aclanthology.org/2023.nlp4dh-1.20
- DOI:
- Cite (ACL):
- Antonia Karaisl. 2023. A Question of Confidence: Using OCR Technology for Script analysis. In Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages, pages 162–171, Tokyo, Japan. Association for Computational Linguistics.
- Cite (Informal):
- A Question of Confidence: Using OCR Technology for Script analysis (Karaisl, NLP4DH-IWCLUL 2023)
- PDF:
- https://preview.aclanthology.org/fix-dup-bibkey/2023.nlp4dh-1.20.pdf