Learning the Character Inventories of Undeciphered Scripts Using Unsupervised Deep Clustering

Logan Born; M. Willis Monroe; Kathryn Kelley; Anoop Sarkar

doi:10.18653/v1/2023.cawl-1.11

Learning the Character Inventories of Undeciphered Scripts Using Unsupervised Deep Clustering

Logan Born, M. Willis Monroe, Kathryn Kelley, Anoop Sarkar

Abstract

A crucial step in deciphering a text is to identify what set of characters were used to write it. This requires grouping character tokens according to visual and contextual features, which can be challenging for human analysts when the number of tokens or underlying types is large. Prior work has shown that this process can be automated by clustering dense representations of character images, in a task which we call “script clustering”. In this work, we present novel architectures which exploit varying degrees of contextual and visual information to learn representations for use in script clustering. We evaluate on a range of modern and ancient scripts, and find that our models produce representations which are more effective for script recovery than the current state-of-the-art, despite using just ~2% as many parameters. Our analysis fruitfully applies these models to assess hypotheses about the character inventory of the partially-deciphered proto-Elamite script.

Anthology ID:: 2023.cawl-1.11
Volume:: Proceedings of the Workshop on Computation and Written Language (CAWL 2023)
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Kyle Gorman, Richard Sproat, Brian Roark
Venue:: CAWL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 92–104
Language:
URL:: https://aclanthology.org/2023.cawl-1.11
DOI:: 10.18653/v1/2023.cawl-1.11
Bibkey:
Cite (ACL):: Logan Born, M. Willis Monroe, Kathryn Kelley, and Anoop Sarkar. 2023. Learning the Character Inventories of Undeciphered Scripts Using Unsupervised Deep Clustering. In Proceedings of the Workshop on Computation and Written Language (CAWL 2023), pages 92–104, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: Learning the Character Inventories of Undeciphered Scripts Using Unsupervised Deep Clustering (Born et al., CAWL 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/dois-2013-emnlp/2023.cawl-1.11.pdf

PDF Search