Abstract
This paper introduces a new Universal Dependencies treebank for the Tatar language named NMCTT. A significant feature of the corpus is that it includes code-switching (CS) information at a morpheme level, given the fact that Tatar texts contain intra-word CS between Tatar and Russian. We first outline NMCTT with a focus on differences from other treebanks of Turkic languages. Then, to evaluate the merit of the CS annotation, this study concisely reports the results of a language identification task implemented with Conditional Random Fields that considers POS tag information, which is readily available in treebanks in the CoNLL-U format. Experimenting on NMCTT and the Turkish-German CS treebank (SAGT), we demonstrate that the proposed annotation scheme introduced in NMCTT can improve the performance of the subword-level language identification. This annotation scheme for CS is not only universally applicable to languages with CS, but also shows a possibility to employ morphosyntactic information for CS-related downstream tasks.- Anthology ID:
- 2022.eurali-1.17
- Volume:
- Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference
- Month:
- June
- Year:
- 2022
- Address:
- Marseille, France
- Editors:
- Atul Kr. Ojha, Sina Ahmadi, Chao-Hong Liu, John P. McCrae
- Venue:
- EURALI
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 95–104
- Language:
- URL:
- https://aclanthology.org/2022.eurali-1.17
- DOI:
- Cite (ACL):
- Chihiro Taguchi, Sei Iwata, and Taro Watanabe. 2022. Universal Dependencies Treebank for Tatar: Incorporating Intra-Word Code-Switching Information. In Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference, pages 95–104, Marseille, France. European Language Resources Association.
- Cite (Informal):
- Universal Dependencies Treebank for Tatar: Incorporating Intra-Word Code-Switching Information (Taguchi et al., EURALI 2022)
- PDF:
- https://preview.aclanthology.org/add_acl24_videos/2022.eurali-1.17.pdf