Abstract
A crucial step in deciphering a text is to identify what set of characters were used to write it. This requires grouping character tokens according to visual and contextual features, which can be challenging for human analysts when the number of tokens or underlying types is large. Prior work has shown that this process can be automated by clustering dense representations of character images, in a task which we call “script clustering”. In this work, we present novel architectures which exploit varying degrees of contextual and visual information to learn representations for use in script clustering. We evaluate on a range of modern and ancient scripts, and find that our models produce representations which are more effective for script recovery than the current state-of-the-art, despite using just ~2% as many parameters. Our analysis fruitfully applies these models to assess hypotheses about the character inventory of the partially-deciphered proto-Elamite script.- Anthology ID:
- 2023.cawl-1.11
- Volume:
- Proceedings of the Workshop on Computation and Written Language (CAWL 2023)
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Venue:
- CAWL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 92–104
- Language:
- URL:
- https://aclanthology.org/2023.cawl-1.11
- DOI:
- 10.18653/v1/2023.cawl-1.11
- Cite (ACL):
- Logan Born, M. Willis Monroe, Kathryn Kelley, and Anoop Sarkar. 2023. Learning the Character Inventories of Undeciphered Scripts Using Unsupervised Deep Clustering. In Proceedings of the Workshop on Computation and Written Language (CAWL 2023), pages 92–104, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- Learning the Character Inventories of Undeciphered Scripts Using Unsupervised Deep Clustering (Born et al., CAWL 2023)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/2023.cawl-1.11.pdf