Christian Faggionato
2022
NLP Pipeline for Annotating (Endangered) Tibetan and Newar Varieties
Christian Faggionato
|
Nathan Hill
|
Marieke Meelen
Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference
In this paper we present our work-in-progress on a fully-implemented pipeline to create deeply-annotated corpora of a number of historical and contemporary Tibetan and Newar varieties. Our off-the-shelf tools allow researchers to create corpora with five different layers of annotation, ranging from morphosyntactic to information-structural annotation. We build on and optimise existing tools (in line with FAIR principles), as well as develop new ones, and show how they can be adapted to other Tibetan and Newar languages, most notably modern endangered languages that are both extremely low-resourced and under-researched.
2019
Developing the Old Tibetan Treebank
Christian Faggionato
|
Marieke Meelen
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
This paper presents a full procedure for the development of a segmented, POS-tagged and chunkparsed corpus of Old Tibetan. As an extremely low-resource language, Old Tibetan poses non-trivial problems in every step towards the development of a searchable treebank. We demonstrate, however, that a carefully developed, semisupervised method of optimising and extending existing tools for Classical Tibetan, as well as creating specific ones for Old Tibetan can address these issues. We thus also present the first very Tibetan Treebank in a variety of formats to facilitate research in the fields of NLP, historical linguistics and Tibetan Studies.
Search