The Vedic Compound Dataset

Sven Sellmer; Oliver Hellwig

The Vedic Compound Dataset

Abstract

This paper introduces the Vedic Compound Dataset (VCD), the first resource providing annotated compounds from Vedic Sanskrit, a South Asian Indo-European language used from ca. 1500 to 500 BCE. The VCD aims at facilitating the study of language change in early Indo-Iranian and offers comparative material for quantitative cross-linguistic research on compounds. The process of annotating Vedic compounds is complex as they contain five of the six basic types of compounds defined by Scalise & Bisetto (2005), which are, however, not consistently marked in morphosyntax, making their automatic classification a significant challenge. The paper details the process of collecting and preprocessing the relevant data, with a particular focus on the question of how to distinguish exocentric from endocentric usage. It further discusses experiments with a simple ML classifier that uses compound internal syntactic relations, outlines the composition of the dataset, and sketches directions for future research.

Anthology ID:: 2024.mwe-1.8
Volume:: Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Archna Bhatia, Gosse Bouma, A. Seza Doğruöz, Kilian Evang, Marcos Garcia, Voula Giouli, Lifeng Han, Joakim Nivre, Alexandre Rademaker
Venues:: MWE | UDW | WS
SIGs:: SIGPARSE | SIGLEX
Publisher:: ELRA and ICCL
Note:
Pages:: 50–55
Language:
URL:: https://aclanthology.org/2024.mwe-1.8
DOI:
Bibkey:
Cite (ACL):: Sven Sellmer and Oliver Hellwig. 2024. The Vedic Compound Dataset. In Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024, pages 50–55, Torino, Italia. ELRA and ICCL.
Cite (Informal):: The Vedic Compound Dataset (Sellmer & Hellwig, MWE-UDW-WS 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-4/2024.mwe-1.8.pdf

PDF Search