Abstract
In this work, we explore generating morphologically enhanced word embeddings for Tamil, a highly agglutinative South Indian language with rich morphology that remains low-resource with regards to NLP tasks. We present here the first-ever word analogy dataset for Tamil, consisting of 4499 hand-curated word tetrads across 10 semantic and 13 morphological relation types. Using a rules-based segmenter to capture morphology as well as meta-embedding techniques, we train meta-embeddings that outperform existing baselines by 16% on our analogy task and appear to mitigate a previously observed trade-off between semantic and morphological accuracy.- Anthology ID:
- 2021.naacl-srw.13
- Volume:
- Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop
- Month:
- June
- Year:
- 2021
- Address:
- Online
- Editors:
- Esin Durmus, Vivek Gupta, Nelson Liu, Nanyun Peng, Yu Su
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 94–111
- Language:
- URL:
- https://aclanthology.org/2021.naacl-srw.13
- DOI:
- 10.18653/v1/2021.naacl-srw.13
- Cite (ACL):
- Arjun Sai Krishnan and Seyoon Ragavan. 2021. Morphology-Aware Meta-Embeddings for Tamil. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 94–111, Online. Association for Computational Linguistics.
- Cite (Informal):
- Morphology-Aware Meta-Embeddings for Tamil (Krishnan & Ragavan, NAACL 2021)
- PDF:
- https://preview.aclanthology.org/aacl-23-doi-ingestion/2021.naacl-srw.13.pdf
- Code
- arjun-sai-krishnan/tamil-morpho-embeddings