Script-Agnosticism and its Impact on Language Identification for Dravidian Languages

Milind Agarwal, Joshua Otten, Antonios Anastasopoulos


Abstract
Language identification is used as the first step in many data collection and crawling efforts because it allows us to sort online text into language-specific buckets. However, many modern languages, such as Konkani, Kashmiri, Punjabi etc., are synchronically written in several scripts. Moreover, languages with different writing systems do not share significant lexical, semantic, and syntactic properties in neural representation spaces, which is a disadvantage for closely related languages and low-resource languages, especially those from the Indian Subcontinent. To counter this, we propose learning script-agnostic representations using several different experimental strategies (upscaling, flattening, and script mixing) focusing on four major Dravidian languages (Tamil, Telugu, Kannada, and Malayalam). We find that word-level script randomization and exposure to a language written in multiple scripts is extremely valuable for downstream script-agnostic language identification, while also maintaining competitive performance on naturally occurring text.
Anthology ID:
2025.naacl-long.377
Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7364–7384
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.naacl-long.377/
DOI:
Bibkey:
Cite (ACL):
Milind Agarwal, Joshua Otten, and Antonios Anastasopoulos. 2025. Script-Agnosticism and its Impact on Language Identification for Dravidian Languages. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7364–7384, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Script-Agnosticism and its Impact on Language Identification for Dravidian Languages (Agarwal et al., NAACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.naacl-long.377.pdf