TSD: Towards Computational Processing of Tamil Similes - A Tamil Simile Dataset

Aathavan Nithiyananthan, Jathushan Raveendra, Uthayasanker Thayasivam


Abstract
A simile is a powerful figure of speech that makes a comparison between two different things via shared properties, often using words like “like” or “as” to create vivid imagery, convey emotions, and enhance understanding. However, computational research on similes is limited in low-resource languages like Tamil due to the lack of simile datasets. This work introduces a manually annotated Tamil Simile Dataset (TSD) comprising around 1.5k simile sentences drawn from various sources. Our data annotation guidelines ensure that all the simile sentences are annotated with the three components, namely tenor, vehicle, and context. We benchmark our dataset for simile interpretation and simile generation tasks using chosen pre-trained language models (PLMs) and present the results. Our findings highlight the challenges of simile tasks in Tamil, suggesting areas for further improvement. We believe that TSD will drive progress in computational simile processing for Tamil and other low-resource languages, further advancing simile related tasks in Natural Language Processing.
Anthology ID:
2025.dravidianlangtech-1.99
Volume:
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
Month:
May
Year:
2025
Address:
Acoma, The Albuquerque Convention Center, Albuquerque, New Mexico
Editors:
Bharathi Raja Chakravarthi, Ruba Priyadharshini, Anand Kumar Madasamy, Sajeetha Thavareesan, Elizabeth Sherly, Saranya Rajiakodi, Balasubramanian Palani, Malliga Subramanian, Subalalitha Cn, Dhivya Chinnappa
Venues:
DravidianLangTech | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
573–579
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.dravidianlangtech-1.99/
DOI:
Bibkey:
Cite (ACL):
Aathavan Nithiyananthan, Jathushan Raveendra, and Uthayasanker Thayasivam. 2025. TSD: Towards Computational Processing of Tamil Similes - A Tamil Simile Dataset. In Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages, pages 573–579, Acoma, The Albuquerque Convention Center, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
TSD: Towards Computational Processing of Tamil Similes - A Tamil Simile Dataset (Nithiyananthan et al., DravidianLangTech 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.dravidianlangtech-1.99.pdf