Abstract
Tamil, a Dravidian language of South Asia, is a highly diglossic language with two very different registers in everyday use: Literary Tamil (preferred in writing and formal communication) and Spoken Tamil (confined to speech and informal media). Spoken Tamil is under-studied in modern NLP systems compared to Literary Tamil written in the Tamil script, as evidenced by a lack of datasets explicitly targetting the Spoken variety. In this paper, we release IruMozhi, a human-translated dataset of parallel text in Literary and Spoken Tamil. Using IruMozhi, we train classifiers on the task of identifying which Tamil variety a text belongs to. We use these models to gauge the availability of pretraining data in Spoken Tamil, to audit the composition of existing labelled datasets for Tamil, and to encourage future work on the variety.- Anthology ID:
- 2024.findings-naacl.195
- Volume:
- Findings of the Association for Computational Linguistics: NAACL 2024
- Month:
- June
- Year:
- 2024
- Address:
- Mexico City, Mexico
- Editors:
- Kevin Duh, Helena Gomez, Steven Bethard
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3096–3103
- Language:
- URL:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-naacl.195/
- DOI:
- 10.18653/v1/2024.findings-naacl.195
- Cite (ACL):
- Kabilan Prasanna and Aryaman Arora. 2024. IruMozhi: Automatically classifying diglossia in Tamil. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3096–3103, Mexico City, Mexico. Association for Computational Linguistics.
- Cite (Informal):
- IruMozhi: Automatically classifying diglossia in Tamil (Prasanna & Arora, Findings 2024)
- PDF:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-naacl.195.pdf