Kabilan Prasanna


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2024

pdf bib
IruMozhi: Automatically classifying diglossia in Tamil
Kabilan Prasanna | Aryaman Arora
Findings of the Association for Computational Linguistics: NAACL 2024

Tamil, a Dravidian language of South Asia, is a highly diglossic language with two very different registers in everyday use: Literary Tamil (preferred in writing and formal communication) and Spoken Tamil (confined to speech and informal media). Spoken Tamil is under-studied in modern NLP systems compared to Literary Tamil written in the Tamil script, as evidenced by a lack of datasets explicitly targetting the Spoken variety. In this paper, we release IruMozhi, a human-translated dataset of parallel text in Literary and Spoken Tamil. Using IruMozhi, we train classifiers on the task of identifying which Tamil variety a text belongs to. We use these models to gauge the availability of pretraining data in Spoken Tamil, to audit the composition of existing labelled datasets for Tamil, and to encourage future work on the variety.