Joshua Otten
2025
Towards Ancient Meroitic Decipherment: A Computational Approach
Joshua Otten
|
Antonios Anastasopoulos
Proceedings of the Second Workshop on Ancient Language Processing
The discovery of the Rosetta Stone was one of the keys that helped unlock the secrets of Ancient Egypt and its hieroglyphic lan-guage. But what about languages with no such “Rosetta Stone?” Meroitic is an ancient lan-guage from what is now present-day Sudan, but even though it is connected to Egyptian in many ways, much of its grammar and vocabu-lary remains undeciphered. In this work, we in-troduce the challenge of Meroitic decipherment as a computational task, and present the first Meroitic machine-readable corpus. We then train embeddings and perform intrinsic evalu-ations, as well as cross-lingual alignment ex-periments between Meroitic and Late-Egyptian. We conclude by outlining open problems and potential research directions.
Script-Agnosticism and its Impact on Language Identification for Dravidian Languages
Milind Agarwal
|
Joshua Otten
|
Antonios Anastasopoulos
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Language identification is used as the first step in many data collection and crawling efforts because it allows us to sort online text into language-specific buckets. However, many modern languages, such as Konkani, Kashmiri, Punjabi etc., are synchronically written in several scripts. Moreover, languages with different writing systems do not share significant lexical, semantic, and syntactic properties in neural representation spaces, which is a disadvantage for closely related languages and low-resource languages, especially those from the Indian Subcontinent. To counter this, we propose learning script-agnostic representations using several different experimental strategies (upscaling, flattening, and script mixing) focusing on four major Dravidian languages (Tamil, Telugu, Kannada, and Malayalam). We find that word-level script randomization and exposure to a language written in multiple scripts is extremely valuable for downstream script-agnostic language identification, while also maintaining competitive performance on naturally occurring text.