Abstract
This paper introduces the Vedic Compound Dataset (VCD), the first resource providing annotated compounds from Vedic Sanskrit, a South Asian Indo-European language used from ca. 1500 to 500 BCE. The VCD aims at facilitating the study of language change in early Indo-Iranian and offers comparative material for quantitative cross-linguistic research on compounds. The process of annotating Vedic compounds is complex as they contain five of the six basic types of compounds defined by Scalise & Bisetto (2005), which are, however, not consistently marked in morphosyntax, making their automatic classification a significant challenge. The paper details the process of collecting and preprocessing the relevant data, with a particular focus on the question of how to distinguish exocentric from endocentric usage. It further discusses experiments with a simple ML classifier that uses compound internal syntactic relations, outlines the composition of the dataset, and sketches directions for future research.- Anthology ID:
- 2024.mwe-1.8
- Volume:
- Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024
- Month:
- May
- Year:
- 2024
- Address:
- Torino, Italia
- Editors:
- Archna Bhatia, Gosse Bouma, A. Seza Doğruöz, Kilian Evang, Marcos Garcia, Voula Giouli, Lifeng Han, Joakim Nivre, Alexandre Rademaker
- Venues:
- MWE | UDW | WS
- SIGs:
- SIGPARSE | SIGLEX
- Publisher:
- ELRA and ICCL
- Note:
- Pages:
- 50–55
- Language:
- URL:
- https://aclanthology.org/2024.mwe-1.8
- DOI:
- Cite (ACL):
- Sven Sellmer and Oliver Hellwig. 2024. The Vedic Compound Dataset. In Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024, pages 50–55, Torino, Italia. ELRA and ICCL.
- Cite (Informal):
- The Vedic Compound Dataset (Sellmer & Hellwig, MWE-UDW-WS 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2024.mwe-1.8.pdf