Pakawat Nakwijit
2026
OasisSimp: An Open-source Asian-English Sentence Simplification Dataset
Hannah Liu | Murphy Tian | Iqra Ali | Haonan Gao | Qiaoyiwen Wu | Blair Yang | Uthayasanker Thayasivam | Annie En-Shiun Lee | Pakawat Nakwijit | Surangika Ranathunga | Ravi Shekhar
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Hannah Liu | Murphy Tian | Iqra Ali | Haonan Gao | Qiaoyiwen Wu | Blair Yang | Uthayasanker Thayasivam | Annie En-Shiun Lee | Pakawat Nakwijit | Surangika Ranathunga | Ravi Shekhar
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Text simplification aims to make complex text more accessible by reducing linguistic complexity while preserving the original meaning. However, progress in this area remains limited for mid-resource and low-resource languages due to the scarcity of high-quality data. To address this gap, we introduce OasisSimp, a multilingual dataset for sentence-level text simplification covering five languages: English, Sinhala, Tamil, Pashto, and Thai. Among these, no prior sentence simplification datasets exist for Thai, Pashto, and Tamil, while limited data is available for Sinhala. Each language simplification dataset was created through direct human annotation, where trained annotators followed detailed guidelines to simplify sentences while maintaining meaning, fluency, and grammatical correctness. We evaluate eight open-weight multilingual Large Language Models (LLMs) on OasisSimp and observe substantial performance disparities between high-resource and low-resource languages, highlighting the simplification challenges in multilingual settings. OasisSimp thus provides both a valuable multilingual resource and a challenging benchmark, revealing the limitations of current LLM-based simplification methods and paving the way for future research in low-resource text simplification. The dataset will be open-sourced upon acceptance.
2023
Lexicools at SemEval-2023 Task 10: Sexism Lexicon Construction via XAI
Pakawat Nakwijit | Mahmoud Samir | Matthew Purver
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
Pakawat Nakwijit | Mahmoud Samir | Matthew Purver
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
This paper presents our work on the SemEval-2023 Task 10 Explainable Detection of Online Sexism (EDOS) using lexicon-based models. Our approach consists of three main steps: lexicon construction based on Pointwise Mutual Information (PMI) and Shapley value, lexicon augmentation using an unannotated corpus and Large Language Models (LLMs), and, lastly, lexical incorporation for Bag-of-Word (BoW) logistic regression and fine-tuning LLMs. Our results demonstrate that our Shapley approach effectively produces a high-quality lexicon. We also show that by simply counting the presence of certain words in our lexicons and comparing the count can outperform a BoW logistic regression in task B/C and fine-tuning BERT in task C. In the end, our classifier achieved F1-scores of 53.34\% and 27.31\% on the official blind test sets for tasks B and C, respectively. We, additionally, provide in-depth analysis highlighting model limitation and bias. We also present our attempts to understand the model’s behaviour based on our constructed lexicons. Our code and the resulting lexicons are open-sourced in our GitHub repository https://github.com/SirBadr/SemEval2022-Task10.
2022
Misspelling Semantics in Thai
Pakawat Nakwijit | Matthew Purver
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Pakawat Nakwijit | Matthew Purver
Proceedings of the Thirteenth Language Resources and Evaluation Conference
User-generated content is full of misspellings. Rather than being just random noise, we hypothesise that many misspellings contain hidden semantics that can be leveraged for language understanding tasks. This paper presents a fine-grained annotated corpus of misspelling in Thai, together with an analysis of misspelling intention and its possible semantics to get a better understanding of the misspelling patterns observed in the corpus. In addition, we introduce two approaches to incorporate the semantics of misspelling: Misspelling Average Embedding (MAE) and Misspelling Semantic Tokens (MST). Experiments on a sentiment analysis task confirm our overall hypothesis: additional semantics from misspelling can boost the micro F1 score up to 0.4-2%, while blindly normalising misspelling is harmful and suboptimal.