Santhawat Thanyawong
2025
The Thai Universal Dependency Treebank
Panyut Sriwirote
|
Wei Qi Leong
|
Charin Polpanumas
|
Santhawat Thanyawong
|
William Chandra Tjhi
|
Wirote Aroonmanakun
|
Attapol T. Rutherford
Transactions of the Association for Computational Linguistics, Volume 13
Automatic dependency parsing of Thai sentences has been underexplored, as evidenced by the lack of large Thai dependency treebanks with complete dependency structures and the lack of a published evaluation of state-of-the-art models, especially transformer-based parsers. In this work, we addressed these gaps by introducing the Thai Universal Dependency Treebank (TUD), a new Thai treebank consisting of 3,627 trees annotated according to the Universal Dependencies (UD) framework. We then benchmarked 92 dependency parsing models that incorporate pretrained transformers on Thai-PUD and our TUD, achieving state-of-the-art results and shedding light on the optimal model components for Thai dependency parsing. Our error analysis of the models also reveals that polyfunctional words, serial verb construction, and lack of rich morphosyntactic features present main challenges for Thai dependency parsing.
2019
Written on Leaves or in Stones?: Computational Evidence for the Era of Authorship of Old Thai Prose
Attapol Rutherford
|
Santhawat Thanyawong
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change
We aim to provide computational evidence for the era of authorship of two important old Thai texts: Traiphumikatha and Pumratchatham. The era of authorship of these two books is still an ongoing debate among Thai literature scholars. Analysis of old Thai texts present a challenge for standard natural language processing techniques, due to the lack of corpora necessary for building old Thai word and syllable segmentation. We propose an accurate and interpretable model to classify each segment as one of the three eras of authorship (Sukhothai, Ayuddhya, or Rattanakosin) without sophisticated linguistic preprocessing. Contrary to previous hypotheses, our model suggests that both books were written during the Sukhothai era. Moreover, the second half of the Pumratchtham is uncharacteristic of the Sukhothai era, which may have confounded literary scholars in the past. Further, our model reveals that the most indicative linguistic changes stem from unidirectional grammaticalized words and polyfunctional words, which show up as most dominant features in the model.