Universal Dependencies Treebank for Tatar: Incorporating Intra-Word Code-Switching Information

Chihiro Taguchi, Sei Iwata, Taro Watanabe


Abstract
This paper introduces a new Universal Dependencies treebank for the Tatar language named NMCTT. A significant feature of the corpus is that it includes code-switching (CS) information at a morpheme level, given the fact that Tatar texts contain intra-word CS between Tatar and Russian. We first outline NMCTT with a focus on differences from other treebanks of Turkic languages. Then, to evaluate the merit of the CS annotation, this study concisely reports the results of a language identification task implemented with Conditional Random Fields that considers POS tag information, which is readily available in treebanks in the CoNLL-U format. Experimenting on NMCTT and the Turkish-German CS treebank (SAGT), we demonstrate that the proposed annotation scheme introduced in NMCTT can improve the performance of the subword-level language identification. This annotation scheme for CS is not only universally applicable to languages with CS, but also shows a possibility to employ morphosyntactic information for CS-related downstream tasks.
Anthology ID:
2022.eurali-1.17
Volume:
Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Atul Kr. Ojha, Sina Ahmadi, Chao-Hong Liu, John P. McCrae
Venue:
EURALI
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
95–104
Language:
URL:
https://aclanthology.org/2022.eurali-1.17
DOI:
Bibkey:
Cite (ACL):
Chihiro Taguchi, Sei Iwata, and Taro Watanabe. 2022. Universal Dependencies Treebank for Tatar: Incorporating Intra-Word Code-Switching Information. In Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference, pages 95–104, Marseille, France. European Language Resources Association.
Cite (Informal):
Universal Dependencies Treebank for Tatar: Incorporating Intra-Word Code-Switching Information (Taguchi et al., EURALI 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/add_acl24_videos/2022.eurali-1.17.pdf