William Chandra Tjhi


2025

pdf bib
The Thai Universal Dependency Treebank
Panyut Sriwirote | Wei Qi Leong | Charin Polpanumas | Santhawat Thanyawong | William Chandra Tjhi | Wirote Aroonmanakun | Attapol T. Rutherford
Transactions of the Association for Computational Linguistics, Volume 13

Automatic dependency parsing of Thai sentences has been underexplored, as evidenced by the lack of large Thai dependency treebanks with complete dependency structures and the lack of a published evaluation of state-of-the-art models, especially transformer-based parsers. In this work, we addressed these gaps by introducing the Thai Universal Dependency Treebank (TUD), a new Thai treebank consisting of 3,627 trees annotated according to the Universal Dependencies (UD) framework. We then benchmarked 92 dependency parsing models that incorporate pretrained transformers on Thai-PUD and our TUD, achieving state-of-the-art results and shedding light on the optimal model components for Thai dependency parsing. Our error analysis of the models also reveals that polyfunctional words, serial verb construction, and lack of rich morphosyntactic features present main challenges for Thai dependency parsing.

2024

pdf bib
Kalahi: A handcrafted, grassroots cultural LLM evaluation suite for Filipino
Jann Railey Montalan | Jian Gang Ngui | Wei Qi Leong | Yosephine Susanto | Hamsawardhini Rengarajan | Alham Fikri Aji | William Chandra Tjhi
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation

pdf bib
Aalamaram: A Large-Scale Linguistically Annotated Treebank for the Tamil Language
A M Abirami | Wei Qi Leong | Hamsawardhini Rengarajan | D Anitha | R Suganya | Himanshu Singh | Kengatharaiyer Sarveswaran | William Chandra Tjhi | Rajiv Ratn Shah
Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation

Tamil is a relatively low-resource language in the field of Natural Language Processing (NLP). Recent years have seen a growth in Tamil NLP datasets in Natural Language Understanding (NLU) or Natural Language Generation (NLG) tasks, but high-quality linguistic resources remain scarce. In order to alleviate this gap in resources, this paper introduces Aalamaram, a treebank with rich linguistic annotations for the Tamil language. It is hitherto the largest publicly available Tamil treebank with almost 10,000 sentences from diverse sources and is annotated for the tasks of Part-of-speech (POS) tagging, Named Entity Recognition (NER), Morphological Parsing and Dependency Parsing. Close attention has also been paid to multi-word segmentation, especially in the context of Tamil clitics. Although the treebank is based largely on the Universal Dependencies (UD) specifications, significant effort has been made to adjust the annotation rules according to the idiosyncrasies and complexities of the Tamil language, thereby providing a valuable resource for linguistic research and NLP developments.