Building and Analysis of Tamil Lyric Corpus with Semantic Representation

Karthika Ranganathan, Geetha T V


Abstract
In the new era of modern technology, the cloud has become the library for many things including entertainment, i.e, the availability of lyrics. In order to create awareness about the language and to increase the interest in Tamil film lyrics, a computerized electronic format of Tamil lyrics corpus is necessary for mining the lyric documents. In this paper, the Tamil lyric corpus was collected from various books and lyric websites. Here, we also address the challenges faced while building this corpus. A corpus was created with 15286 documents and stored all the lyric information obtained in the XML format. In this paper, we also explained the Universal Networking Language (UNL) semantic representation that helps to represent the document in a language and domain independent ways. We evaluated this corpus by performing simple statistical analysis for characters, words and a few rhetorical effect analysis. We also evaluated our semantic representation with the existing work and the results are very encouraging.
Anthology ID:
2022.amta-coco4mt.3
Volume:
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Workshop 2: Corpus Generation and Corpus Augmentation for Machine Translation)
Month:
September
Year:
2022
Address:
Editors:
John E. Ortega, Marine Carpuat, William Chen, Katharina Kann, Constantine Lignos, Maja Popovic, Shabnam Tafreshi
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
18–27
Language:
URL:
https://aclanthology.org/2022.amta-coco4mt.3
DOI:
Bibkey:
Cite (ACL):
Karthika Ranganathan and Geetha T V. 2022. Building and Analysis of Tamil Lyric Corpus with Semantic Representation. In Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Workshop 2: Corpus Generation and Corpus Augmentation for Machine Translation), pages 18–27, None. Association for Machine Translation in the Americas.
Cite (Informal):
Building and Analysis of Tamil Lyric Corpus with Semantic Representation (Ranganathan & T V, AMTA 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2022.amta-coco4mt.3.pdf