2025
pdf
bib
abs
WangchanThaiInstruct: An instruction-following Dataset for Culture-Aware, Multitask, and Multi-domain Evaluation in Thai
Peerat Limkonchotiwat
|
Pume Tuchinda
|
Lalita Lowphansirikul
|
Surapon Nonesung
|
Panuthep Tasawong
|
Alham Fikri Aji
|
Can Udomcharoenchaikit
|
Sarana Nutanong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models excel at instruction-following in English, but their performance in low-resource languages like Thai remains underexplored. Existing benchmarks often rely on translations, missing cultural and domain-specific nuances needed for real-world use. We present WangchanThaiInstruct, a human-authored Thai dataset for evaluation and instruction tuning, covering four professional domains and seven task types. Created through a multi-stage quality control process with annotators, domain experts, and AI researchers, WangchanThaiInstruct supports two studies: (1) a zero-shot evaluation showing performance gaps on culturally and professionally specific tasks, and (2) an instruction tuning study with ablations isolating the effect of native supervision. Models fine-tuned on WangchanThaiInstruct outperform those using translated data in both in-domain and out-of-domain benchmarks. These findings underscore the need for culturally and professionally grounded instruction data to improve LLM alignment in low-resource, linguistically diverse settings.
2024
pdf
bib
abs
McCrolin: Multi-consistency Cross-lingual Training for Retrieval Question Answering
Peerat Limkonchotiwat
|
Wuttikorn Ponwitayarat
|
Lalita Lowphansirikul
|
Potsawee Manakul
|
Can Udomcharoenchaikit
|
Ekapol Chuangsuwanich
|
Sarana Nutanong
Findings of the Association for Computational Linguistics: EMNLP 2024
Automated question answering (QA) systems are increasingly relying on robust cross-lingual retrieval to identify and utilize information from multilingual sources, ensuring comprehensive and contextually accurate responses. Existing approaches often struggle with consistency across multiple languages and multi-size input scenarios. To address these challenges, we propose McCrolin, a Multi-consistency Cross-lingual training framework, leveraging multi-task learning to enhance cross-lingual consistency, ranking stability, and input-size robustness. Experimental results demonstrate that McCrolin achieves state-of-the-art performance on standard cross-lingual retrieval QA datasets. Furthermore, McCrolin outperforms competitors when dealing with various input sizes on downstream tasks. In terms of generalizability, results from further analysis show that our method is effective for various encoder architectures and sizes.
2023
pdf
bib
abs
PyThaiNLP: Thai Natural Language Processing in Python
Wannaphong Phatthiyaphaibun
|
Korakot Chaovavanich
|
Charin Polpanumas
|
Arthit Suriyawongkul
|
Lalita Lowphansirikul
|
Pattarawat Chormai
|
Peerat Limkonchotiwat
|
Thanathip Suntorntip
|
Can Udomcharoenchaikit
Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)
We present PyThaiNLP, a free and open-source natural language processing (NLP) library for Thai language implemented in Python. It provides a wide range of software, models, and datasets for Thai language. We first provide a brief historical context of tools for Thai language prior to the development of PyThaiNLP. We then outline the functionalities it provided as well as datasets and pre-trained language models. We later summarize its development milestones and discuss our experience during its development. We conclude by demonstrating how industrial and research communities utilize PyThaiNLP in their work. The library is freely available at https://github.com/pythainlp/pythainlp.
pdf
bib
abs
An Efficient Self-Supervised Cross-View Training For Sentence Embedding
Peerat Limkonchotiwat
|
Wuttikorn Ponwitayarat
|
Lalita Lowphansirikul
|
Can Udomcharoenchaikit
|
Ekapol Chuangsuwanich
|
Sarana Nutanong
Transactions of the Association for Computational Linguistics, Volume 11
Self-supervised sentence representation learning is the task of constructing an embedding space for sentences without relying on human annotation efforts. One straightforward approach is to finetune a pretrained language model (PLM) with a representation learning method such as contrastive learning. While this approach achieves impressive performance on larger PLMs, the performance rapidly degrades as the number of parameters decreases. In this paper, we propose a framework called Self-supervised Cross-View Training (SCT) to narrow the performance gap between large and small PLMs. To evaluate the effectiveness of SCT, we compare it to 5 baseline and state-of-the-art competitors on seven Semantic Textual Similarity (STS) benchmarks using 5 PLMs with the number of parameters ranging from 4M to 340M. The experimental results show that STC outperforms the competitors for PLMs with less than 100M parameters in 18 of 21 cases.1
2022
pdf
bib
abs
ConGen: Unsupervised Control and Generalization Distillation For Sentence Representation
Peerat Limkonchotiwat
|
Wuttikorn Ponwitayarat
|
Lalita Lowphansirikul
|
Can Udomcharoenchaikit
|
Ekapol Chuangsuwanich
|
Sarana Nutanong
Findings of the Association for Computational Linguistics: EMNLP 2022
Sentence representations are essential in many NLP tasks operating at the sentence level.Recently, research attention has shifted towards learning how to represent sentences without any annotations, i.e., unsupervised representation learning. Despite the benefit of training without supervised data, there is still a performance penalty compared to supervised methods.Furthermore, the supervised-unsupervised performance gap widens as we reduce the model size. In this paper, we propose an unsupervised sentence representation method to reduce the supervised-unsupervised performance gap, especially for smaller models. Utilizing the concept for knowledge distillation, we derive a distillation framework comprising two training objectives, control and generalize, called ConGen. Experiments on semantic textual similarity (STS), text classification (transfer), and natural language inference (NLI) tasks show that ConGen is on par with supervised training even on smaller models.Furthermore, our method consistently outperformed competitors on multilingual STS.The code and models are available at https://github.com/KornWtp/ConGen.