2025
pdf
bib
abs
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
Samuel Cahyawijaya
|
Holy Lovenia
|
Joel Ruben Antony Moniz
|
Tack Hwa Wong
|
Mohammad Rifqi Farhansyah
|
Thant Thiri Maung
|
Frederikus Hudi
|
David Anugraha
|
Muhammad Ravi Shulthan Habibi
|
Muhammad Reza Qorib
|
Amit Agarwal
|
Joseph Marvin Imperial
|
Hitesh Laxmichand Patel
|
Vicky Feliren
|
Bahrul Ilmi Nasution
|
Manuel Antonio Rufino
|
Genta Indra Winata
|
Rian Adam Rajagede
|
Carlos Rafael Catalan
|
Mohamed Fazli Mohamed Imam
|
Priyaranjan Pattnayak
|
Salsabila Zahirah Pranida
|
Kevin Pratama
|
Yeshil Bangera
|
Adisai Na-Thalang
|
Patricia Nicole Monderin
|
Yueqi Song
|
Christian Simon
|
Lynnette Hui Xian Ng
|
Richardy Lobo Sapan
|
Taki Hasan Rafi
|
Bin Wang
|
Supryadi
|
Kanyakorn Veerakanjana
|
Piyalitt Ittichaiwong
|
Matthew Theodore Roque
|
Karissa Vincentio
|
Takdanai Kreangphet
|
Phakphum Artkaew
|
Kadek Hendrawan Palgunadi
|
Yanzhi Yu
|
Rochana Prih Hastuti
|
William Nixon
|
Mithil Bangera
|
Adrian Xuan Wei Lim
|
Aye Hninn Khine
|
Hanif Muhammad Zhafran
|
Teddy Ferdinan
|
Audra Aurora Izzani
|
Ayushman Singh
|
Evan Evan
|
Jauza Akbar Krito
|
Michael Anugraha
|
Fenal Ashokbhai Ilasariya
|
Haochen Li
|
John Amadeo Daniswara
|
Filbert Aurelian Tjiaranata
|
Eryawan Presma Yulianrifat
|
Can Udomcharoenchaikit
|
Fadil Risdian Ansori
|
Mahardika Krisna Ihsani
|
Giang Nguyen
|
Anab Maulana Barik
|
Dan John Velasco
|
Rifo Ahmad Genadi
|
Saptarshi Saha
|
Chengwei Wei
|
Isaiah Edri W. Flores
|
Kenneth Chen Ko Han
|
Anjela Gail D. Santos
|
Wan Shen Lim
|
Kaung Si Phyo
|
Tim Santos
|
Meisyarah Dwiastuti
|
Jiayun Luo
|
Jan Christian Blaise Cruz
|
Ming Shan Hee
|
Ikhlasul Akmal Hanif
|
M.Alif Al Hakim
|
Muhammad Rizky Sya’ban
|
Kun Kerdthaisong
|
Lester James Validad Miranda
|
Fajri Koto
|
Tirana Noor Fatyanosa
|
Alham Fikri Aji
|
Jostin Jerico Rosal
|
Jun Kevin
|
Robert Wijaya
|
Onno P. Kampman
|
Ruochen Zhang
|
Börje F. Karlsson
|
Peerat Limkonchotiwat
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite Southeast Asia’s (SEA) extraordinary linguistic and cultural diversity, the region remains significantly underrepresented in vision-language (VL) research, resulting in AI models that inadequately capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing culturally relevant high-quality datasets for SEA languages. By involving contributors from SEA countries, SEA-VL ensures better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages and cultural depictions in VL research. Our methodology employed three approaches: community-driven crowdsourcing with SEA contributors, automated image crawling, and synthetic image generation. We evaluated each method’s effectiveness in capturing cultural relevance. We found that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing, whereas synthetic image generation failed to accurately reflect SEA cultural nuances and contexts. Collectively, we gathered 1.28 million SEA culturally relevant images, more than 50 times larger than other existing datasets. This work bridges the representation gap in SEA, establishes a foundation for developing culturally aware AI systems for this region, and provides a replicable framework for addressing representation gaps in other underrepresented regions.
pdf
bib
abs
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Shivalika Singh
|
Angelika Romanou
|
Clémentine Fourrier
|
David Ifeoluwa Adelani
|
Jian Gang Ngui
|
Daniel Vila-Suero
|
Peerat Limkonchotiwat
|
Kelly Marchisio
|
Wei Qi Leong
|
Yosephine Susanto
|
Raymond Ng
|
Shayne Longpre
|
Sebastian Ruder
|
Wei-Yin Ko
|
Antoine Bosselut
|
Alice Oh
|
Andre Martins
|
Leshem Choshen
|
Daphne Ippolito
|
Enzo Ferrante
|
Marzieh Fadaee
|
Beyza Ermis
|
Sara Hooker
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reliable multilingual evaluation is difficult, and culturally appropriate evaluation is even harder to achieve.A common practice to fill this gap is to machine-translate English evaluation sets. However, translation introduces language bias and carries over cultural and regional assumptions from the original questions – often testing knowledge irrelevant to the target audience. In this work, we highlight the extent and impact of these biases and present a multilingual evaluation framework that aims to mitigate them through improved translations and annotation practices.Through a large-scale study involving professional and community translators and annotators, we show that state-of-the-art models excel primarily by learning Western-centric concepts. Notably, we find that model rankings on the full MMLU change when evaluated on a subset of questions explicitly marked as culturally sensitive.We release Global MMLU, a multilingual extension of MMLU across 42 languages, featuring improved translation quality, expanded language coverage, and designated subsets labeled as culturally sensitive and culturally agnostic to enable a more comprehensive and equitable benchmark for evaluating language models across diverse linguistic and cultural contexts.
pdf
bib
abs
SEA-HELM: Southeast Asian Holistic Evaluation of Language Models
Yosephine Susanto
|
Adithya Venkatadri Hulagadri
|
Jann Railey Montalan
|
Jian Gang Ngui
|
Xianbin Yong
|
Wei Qi Leong
|
Hamsawardhini Rengarajan
|
Peerat Limkonchotiwat
|
Yifan Mai
|
William Chandra Tjhi
Findings of the Association for Computational Linguistics: ACL 2025
With the rapid emergence of novel capabilities in Large Language Models (LLMs), the need for rigorous multilingual and multiculturalbenchmarks that are integrated has become more pronounced. Though existing LLM benchmarks are capable of evaluating specificcapabilities of LLMs in English as well as in various mid- to low-resource languages, including those in the Southeast Asian (SEA)region, a comprehensive and culturally representative evaluation suite for the SEA languages has not been developed thus far.Here, we present SEA-HELM, a holistic linguistic and cultural LLM evaluation suite that emphasises SEA languages, comprisingfive core pillars: (1) NLP CLASSICS, (2) LLM-SPECIFICS, (3) SEA LINGUISTICS, (4) SEA CULTURE, (5) SAFETY. SEA-HELMcurrently supports Filipino, Indonesian, Tamil, Thai, and Vietnamese. We also introduce the SEA-HELM leaderboard, which allows users to understand models’ multilingual and multicultural performance in a systematic and user-friendly manner. We make the SEA-HELM evaluation code publicly available.
pdf
bib
abs
Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments
Patomporn Payoungkhamdee
|
Pume Tuchinda
|
Jinheon Baek
|
Samuel Cahyawijaya
|
Can Udomcharoenchaikit
|
Potsawee Manakul
|
Peerat Limkonchotiwat
|
Ekapol Chuangsuwanich
|
Sarana Nutanong
Findings of the Association for Computational Linguistics: ACL 2025
Multi-step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain-of-Thought (CoT) prompting improves reasoning, it struggles with non-English languages due to the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge to generating programs from non-English questions. We propose a framework to evaluate PoT by separating multilingual reasoning from code execution to examine (i) the impact of fine-tuning on question-reasoning alignment and (ii) how reasoning quality affects answer correctness. Our findings demonstrate that PoT fine-tuning substantially enhances multilingual reasoning, outperforming CoT fine-tuned models. We further demonstrate a strong correlation between reasoning quality (measured through code quality) and answer accuracy, highlighting its potential as a test-time performance improvement heuristic.
pdf
bib
abs
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines
Genta Indra Winata
|
Frederikus Hudi
|
Patrick Amadeus Irawan
|
David Anugraha
|
Rifki Afina Putri
|
Wang Yutong
|
Adam Nohejl
|
Ubaidillah Ariq Prathama
|
Nedjma Ousidhoum
|
Afifa Amriani
|
Anar Rzayev
|
Anirban Das
|
Ashmari Pramodya
|
Aulia Adila
|
Bryan Wilie
|
Candy Olivia Mawalim
|
Cheng Ching Lam
|
Daud Abolade
|
Emmanuele Chersoni
|
Enrico Santus
|
Fariz Ikhwantri
|
Garry Kuwanto
|
Hanyang Zhao
|
Haryo Akbarianto Wibowo
|
Holy Lovenia
|
Jan Christian Blaise Cruz
|
Jan Wira Gotama Putra
|
Junho Myung
|
Lucky Susanto
|
Maria Angelica Riera Machin
|
Marina Zhukova
|
Michael Anugraha
|
Muhammad Farid Adilazuarda
|
Natasha Christabelle Santosa
|
Peerat Limkonchotiwat
|
Raj Dabre
|
Rio Alexander Audino
|
Samuel Cahyawijaya
|
Shi-Xiong Zhang
|
Stephanie Yulia Salim
|
Yi Zhou
|
Yinxuan Gui
|
David Ifeoluwa Adelani
|
En-Shiun Annie Lee
|
Shogo Okada
|
Ayu Purwarianti
|
Alham Fikri Aji
|
Taro Watanabe
|
Derry Tanti Wijaya
|
Alice Oh
|
Chong-Wah Ngo
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.
2024
pdf
bib
abs
Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai
Parinthapat Pengpun
|
Can Udomcharoenchaikit
|
Weerayut Buaphet
|
Peerat Limkonchotiwat
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
We present a synthetic data approach for instruction-tuning large language models (LLMs) for low-resource languages in a data-efficient manner, specifically focusing on Thai. We identify three key properties that contribute to the effectiveness of instruction-tuning datasets: fluency, diversity, and cultural context. We propose a seed-data-free framework for generating synthetic instruction-tuning data that incorporates these essential properties. Our framework employs an LLM to generate diverse topics, retrieve relevant contexts from Wikipedia, and create instructions for various tasks, such as question answering, summarization, and conversation. The experimental results show that our best-performing synthetic dataset, which incorporates all three key properties, achieves competitive performance using only 5,000 instructions when compared to state-of-the-art Thai LLMs trained on hundreds of thousands of instructions. Our code and dataset are publicly available at https://github.com/parinzee/seed-free-synthetic-instruct.
pdf
bib
abs
SEA-VQA: Southeast Asian Cultural Context Dataset For Visual Question Answering
Norawit Urailertprasert
|
Peerat Limkonchotiwat
|
Supasorn Suwajanakorn
|
Sarana Nutanong
Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR)
Visual Question Answering (VQA) is a critical task that requires the simultaneous understanding of visual and textual information. While significant advancements have been made with multilingual datasets, these often lack cultural specificity, especially in the context of Southeast Asia (SEA). In this paper, we introduce SEA-VQA aiming to highlight the challenges and gaps in existing VQA models when confronted with culturally specific content. Our dataset includes images from eight SEA countries, curated from the UNESCO Cultural Heritage collection. Our evaluation, comparing GPT-4 and GEMINI models, demonstrates substantial performance drops on culture-centric questions compared to the A-OKVQA dataset, a commonsense and world-knowledge VQA benchmark comprising approximately 25,000 questions. Our findings underscore the importance of cultural diversity in VQA datasets and reveal substantial gaps in the ability of current VQA models to handle culturally rich contexts. SEA-VQA serves as a crucial benchmark for identifying these gaps and guiding future improvements in VQA systems.
pdf
bib
abs
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
Holy Lovenia
|
Rahmad Mahendra
|
Salsabil Maulana Akbar
|
Lester James V. Miranda
|
Jennifer Santoso
|
Elyanah Aco
|
Akhdan Fadhilah
|
Jonibek Mansurov
|
Joseph Marvin Imperial
|
Onno P. Kampman
|
Joel Ruben Antony Moniz
|
Muhammad Ravi Shulthan Habibi
|
Frederikus Hudi
|
Railey Montalan
|
Ryan Ignatius
|
Joanito Agili Lopo
|
William Nixon
|
Börje F. Karlsson
|
James Jaya
|
Ryandito Diandaru
|
Yuze Gao
|
Patrick Amadeus
|
Bin Wang
|
Jan Christian Blaise Cruz
|
Chenxi Whitehouse
|
Ivan Halim Parmonangan
|
Maria Khelli
|
Wenyu Zhang
|
Lucky Susanto
|
Reynard Adha Ryanda
|
Sonny Lazuardi Hermawan
|
Dan John Velasco
|
Muhammad Dehan Al Kautsar
|
Willy Fitra Hendria
|
Yasmin Moslem
|
Noah Flynn
|
Muhammad Farid Adilazuarda
|
Haochen Li
|
Johanes Lee
|
R. Damanhuri
|
Shuo Sun
|
Muhammad Reza Qorib
|
Amirbek Djanibekov
|
Wei Qi Leong
|
Quyet V. Do
|
Niklas Muennighoff
|
Tanrada Pansuwan
|
Ilham Firdausi Putra
|
Yan Xu
|
Tai Ngee Chia
|
Ayu Purwarianti
|
Sebastian Ruder
|
William Tjhi
|
Peerat Limkonchotiwat
|
Alham Fikri Aji
|
Sedrick Keh
|
Genta Indra Winata
|
Ruochen Zhang
|
Fajri Koto
|
Zheng-Xin Yong
|
Samuel Cahyawijaya
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, through a collaborative movement, we introduce SEACrowd, a comprehensive resource center that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in Southeast Asia.
pdf
bib
abs
An Empirical Study of Multilingual Reasoning Distillation for Question Answering
Patomporn Payoungkhamdee
|
Peerat Limkonchotiwat
|
Jinheon Baek
|
Potsawee Manakul
|
Can Udomcharoenchaikit
|
Ekapol Chuangsuwanich
|
Sarana Nutanong
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Reasoning is one crucial capability in Large Language Models (LLMs), allowing them to perform complex tasks such as solving math problems and multi-step planning. While reasoning capability can emerge in larger models, smaller ones usually have to rely on distillation to transfer this capability from a larger model. However, recent efforts to distill reasoning capabilities have focused mainly on English, leaving multilingual distillation underexplored. To address this gap, this paper examines existing English reasoning distillation methods that utilize a variety of positive rationales in multilingual settings and proposes d-CoT-nR, a novel approach that incorporates incorrect rationales as additional guidance. Empirical results from multilingual high-school examinations show that d-CoT-nR significantly surpasses the baseline, improving accuracy in unseen languages and correctness in step-by-step reasoning.
pdf
bib
abs
Efficient Overshadowed Entity Disambiguation by Mitigating Shortcut Learning
Panuthep Tasawong
|
Peerat Limkonchotiwat
|
Potsawee Manakul
|
Can Udomcharoenchaikit
|
Ekapol Chuangsuwanich
|
Sarana Nutanong
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Entity disambiguation (ED) is crucial in natural language processing (NLP) for tasks such as question-answering and information extraction. A major challenge in ED is handling overshadowed entities—uncommon entities sharing mention surfaces with common entities. The current approach to enhance performance on these entities involves reasoning over facts in a knowledge base (KB), increasing computational overhead during inference. We argue that the ED performance on overshadowed entities can be enhanced during training by addressing shortcut learning, which does not add computational overhead at inference. We propose a simple yet effective debiasing technique to prevent models from shortcut learning during training. Experiments on a range of ED datasets show that our method achieves state-of-the-art performance without compromising inference speed. Our findings suggest a new research direction for improving entity disambiguation via shortcut learning mitigation.
pdf
bib
abs
Space Decomposition for Sentence Embedding
Wuttikorn Ponwitayarat
|
Peerat Limkonchotiwat
|
Ekapol Chuangsuwanich
|
Sarana Nutanong
Findings of the Association for Computational Linguistics: ACL 2024
Determining sentence pair similarity is crucial for various NLP tasks. A common technique to address this is typically evaluated on a continuous semantic textual similarity scale from 0 to 5. However, based on a linguistic observation in STS annotation guidelines, we found that the score in the range [4,5] indicates an upper-range sample, while the rest are lower-range samples. This necessitates a new approach to treating the upper-range and lower-range classes separately. In this paper, we introduce a novel embedding space decomposition method called MixSP utilizing a Mixture of Specialized Projectors, designed to distinguish and rank upper-range and lower-range samples accurately. The experimental results demonstrate that MixSP decreased the overlap representation between upper-range and lower-range classes significantly while outperforming competitors on STS and zero-shot benchmarks.
pdf
bib
abs
Identifying and Mitigating Annotation Bias in Natural Language Understanding using Causal Mediation Analysis
Sitiporn Sae Lim
|
Can Udomcharoenchaikit
|
Peerat Limkonchotiwat
|
Ekapol Chuangsuwanich
|
Sarana Nutanong
Findings of the Association for Computational Linguistics: ACL 2024
NLU models have achieved promising results on standard benchmarks. Despite state-of-the-art accuracy, analysis reveals that many models make predictions using annotation bias rather than the properties we intend the model to learn. Consequently, these models perform poorly on out-of-distribution datasets. Recent advances in bias mitigation show that annotation bias can be alleviated through fine-tuning debiasing objectives. In this paper, we apply causal mediation analysis to gauge how much each model component mediates annotation biases. Using the knowledge from the causal analysis, we improve the model’s robustness against annotation bias through two bias mitigation methods: causal-grounded masking and gradient unlearning. Causal analysis reveals that biases concentrated in specific components, even after employing other training-time debiasing techniques. Manipulating these components by masking out neurons’ activations or updating specific weight blocks both demonstrably improve robustness against annotation artifacts.
pdf
bib
abs
McCrolin: Multi-consistency Cross-lingual Training for Retrieval Question Answering
Peerat Limkonchotiwat
|
Wuttikorn Ponwitayarat
|
Lalita Lowphansirikul
|
Potsawee Manakul
|
Can Udomcharoenchaikit
|
Ekapol Chuangsuwanich
|
Sarana Nutanong
Findings of the Association for Computational Linguistics: EMNLP 2024
Automated question answering (QA) systems are increasingly relying on robust cross-lingual retrieval to identify and utilize information from multilingual sources, ensuring comprehensive and contextually accurate responses. Existing approaches often struggle with consistency across multiple languages and multi-size input scenarios. To address these challenges, we propose McCrolin, a Multi-consistency Cross-lingual training framework, leveraging multi-task learning to enhance cross-lingual consistency, ranking stability, and input-size robustness. Experimental results demonstrate that McCrolin achieves state-of-the-art performance on standard cross-lingual retrieval QA datasets. Furthermore, McCrolin outperforms competitors when dealing with various input sizes on downstream tasks. In terms of generalizability, results from further analysis show that our method is effective for various encoder architectures and sizes.
pdf
bib
abs
On Creating an English-Thai Code-switched Machine Translation in Medical Domain
Parinthapat Pengpun
|
Krittamate Tiankanon
|
Amrest Chinkamol
|
Jiramet Kinchagawat
|
Pitchaya Chairuengjitjaras
|
Pasit Supholkhan
|
Pubordee Aussavavirojekul
|
Chiraphat Boonnag
|
Kanyakorn Veerakanjana
|
Hirunkul Phimsiri
|
Boonthicha Sae-jia
|
Nattawach Sataudom
|
Piyalitt Ittichaiwong
|
Peerat Limkonchotiwat
Findings of the Association for Computational Linguistics: EMNLP 2024
Machine translation (MT) in the medical domain plays a pivotal role in enhancing healthcare quality and disseminating medical knowledge. Despite advancements in English-Thai MT technology, common MT approaches often underperform in the medical field due to their inability to precisely translate medical terminologies. Our research prioritizes not merely improving translation accuracy but also maintaining medical terminology in English within the translated text through code-switched (CS) translation. We developed a method to produce CS medical translation data, fine-tuned a CS translation model with this data, and evaluated its performance against strong baselines, such as Google Neural Machine Translation (NMT) and GPT-3.5/GPT-4. Our model demonstrated competitive performance in automatic metrics and was highly favored in human preference evaluations. Our evaluation result also shows that medical professionals significantly prefer CS translations that maintain critical English terms accurately, even if it slightly compromises fluency. Our code and test set are publicly available https://github.com/preceptorai-org/NLLB_CS_EM_NLP2024.
pdf
bib
abs
CHIE: Generative MRC Evaluation for in-context QA with Correctness, Helpfulness, Irrelevancy, and Extraneousness Aspects
Wannaphong Phatthiyaphaibun
|
Surapon Nonesung
|
Peerat Limkonchotiwat
|
Can Udomcharoenchaikit
|
Jitkapat Sawatphol
|
Ekapol Chuangsuwanich
|
Sarana Nutanong
Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP
The evaluation of generative models in Machine Reading Comprehension (MRC) presents distinct difficulties, as traditional metrics like BLEU, ROUGE, METEOR, Exact Match, and F1 score often struggle to capture the nuanced and diverse responses. While embedding-based metrics such as BERTScore and BARTScore focus on semantic similarity, they still fail to fully address aspects such as recognizing additional helpful information and rewarding contextual faithfulness. Recent advances in large language model (LLM) based metrics offer more fine-grained evaluations, but challenges such as score clustering remain. This paper introduces a multi-aspect evaluation framework, CHIE,incorporating aspects of Correctness, Helpfulness, Irrelevance, and Extraneousness. Our approach, which uses binary categorical values rather than continuous rating scales, aligns well with human judgments, indicating its potential as a comprehensive and effective evaluation method.
2023
pdf
bib
abs
Typo-Robust Representation Learning for Dense Retrieval
Panuthep Tasawong
|
Wuttikorn Ponwitayarat
|
Peerat Limkonchotiwat
|
Can Udomcharoenchaikit
|
Ekapol Chuangsuwanich
|
Sarana Nutanong
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Dense retrieval is a basic building block of information retrieval applications. One of the main challenges of dense retrieval in real-world settings is the handling of queries containing misspelled words. A popular approach for handling misspelled queries is minimizing the representations discrepancy between misspelled queries and their pristine ones. Unlike the existing approaches, which only focus on the alignment between misspelled and pristine queries, our method also improves the contrast between each misspelled query and its surrounding queries. To assess the effectiveness of our proposed method, we compare it against the existing competitors using two benchmark datasets and two base encoders. Our method outperforms the competitors in all cases with misspelled queries. Our code and models are available at
https://github.com/panuthept/DST-DenseRetrieval.
pdf
bib
abs
mReFinED: An Efficient End-to-End Multilingual Entity Linking System
Peerat Limkonchotiwat
|
Weiwei Cheng
|
Christos Christodoulopoulos
|
Amir Saffari
|
Jens Lehmann
Findings of the Association for Computational Linguistics: EMNLP 2023
End-to-end multilingual entity linking (MEL) is concerned with identifying multilingual entity mentions and their corresponding entity IDs in a knowledge base. Existing works assumed that entity mentions were given and skipped the entity mention detection step due to a lack of high-quality multilingual training corpora. To overcome this limitation, we propose mReFinED, the first end-to-end multilingual entity linking. Additionally, we propose a bootstrapping mention detection framework that enhances the quality of training corpora. Our experimental results demonstrated that mReFinED outperformed the best existing work in the end-to-end MEL task while being 44 times faster.
pdf
bib
abs
Cross-Lingual Data Augmentation For Thai Question-Answering
Parinthapat Pengpun
|
Can Udomcharoenchaikit
|
Weerayut Buaphet
|
Peerat Limkonchotiwat
Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP
This paper presents an innovative data augmentation framework with data quality control designed to enhance the robustness of Question Answering (QA) models in low-resource languages, particularly Thai. Recognizing the challenges posed by the scarcity and quality of training data, we leverage data augmentation techniques in both monolingual and cross-lingual settings. Our approach augments and enriches the original dataset, thereby increasing its linguistic diversity and robustness. We evaluate the robustness of our framework on Machine Reading Comprehension, and the experimental results illustrate the potential of data augmentation to effectively increase training data and improve model generalization in low-resource language settings, offering a promising direction for the data augmentation manner.
pdf
bib
abs
PyThaiNLP: Thai Natural Language Processing in Python
Wannaphong Phatthiyaphaibun
|
Korakot Chaovavanich
|
Charin Polpanumas
|
Arthit Suriyawongkul
|
Lalita Lowphansirikul
|
Pattarawat Chormai
|
Peerat Limkonchotiwat
|
Thanathip Suntorntip
|
Can Udomcharoenchaikit
Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)
We present PyThaiNLP, a free and open-source natural language processing (NLP) library for Thai language implemented in Python. It provides a wide range of software, models, and datasets for Thai language. We first provide a brief historical context of tools for Thai language prior to the development of PyThaiNLP. We then outline the functionalities it provided as well as datasets and pre-trained language models. We later summarize its development milestones and discuss our experience during its development. We conclude by demonstrating how industrial and research communities utilize PyThaiNLP in their work. The library is freely available at https://github.com/pythainlp/pythainlp.
pdf
bib
SEA-LION (Southeast Asian Languages In One Network): A Family of Southeast Asian Language Models
David Ong
|
Peerat Limkonchotiwat
Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)
pdf
bib
abs
An Efficient Self-Supervised Cross-View Training For Sentence Embedding
Peerat Limkonchotiwat
|
Wuttikorn Ponwitayarat
|
Lalita Lowphansirikul
|
Can Udomcharoenchaikit
|
Ekapol Chuangsuwanich
|
Sarana Nutanong
Transactions of the Association for Computational Linguistics, Volume 11
Self-supervised sentence representation learning is the task of constructing an embedding space for sentences without relying on human annotation efforts. One straightforward approach is to finetune a pretrained language model (PLM) with a representation learning method such as contrastive learning. While this approach achieves impressive performance on larger PLMs, the performance rapidly degrades as the number of parameters decreases. In this paper, we propose a framework called Self-supervised Cross-View Training (SCT) to narrow the performance gap between large and small PLMs. To evaluate the effectiveness of SCT, we compare it to 5 baseline and state-of-the-art competitors on seven Semantic Textual Similarity (STS) benchmarks using 5 PLMs with the number of parameters ranging from 4M to 340M. The experimental results show that STC outperforms the competitors for PLMs with less than 100M parameters in 18 of 21 cases.1
2022
pdf
bib
abs
Thai Nested Named Entity Recognition Corpus
Weerayut Buaphet
|
Can Udomcharoenchaikit
|
Peerat Limkonchotiwat
|
Attapol Rutherford
|
Sarana Nutanong
Findings of the Association for Computational Linguistics: ACL 2022
This paper presents the first Thai Nested Named Entity Recognition (N-NER) dataset. Thai N-NER consists of 264,798 mentions, 104 classes, and a maximum depth of 8 layers obtained from 4,894 documents in the domains of news articles and restaurant reviews. Our work, to the best of our knowledge, presents the largest non-English N-NER dataset and the first non-English one with fine-grained classes. To understand the new challenges our proposed dataset brings to the field, we conduct an experimental study on (i) cutting edge N-NER models with the state-of-the-art accuracy in English and (ii) baseline methods based on well-known language model architectures. From the experimental results, we obtained two key findings. First, all models produced poor F1 scores in the tail region of the class distribution. There is little or no performance improvement provided by these models with respect to the baseline methods with our Thai dataset. These findings suggest that further investigation is required to make a multilingual N-NER solution that works well across different languages.
pdf
bib
abs
CL-ReLKT: Cross-lingual Language Knowledge Transfer for Multilingual Retrieval Question Answering
Peerat Limkonchotiwat
|
Wuttikorn Ponwitayarat
|
Can Udomcharoenchaikit
|
Ekapol Chuangsuwanich
|
Sarana Nutanong
Findings of the Association for Computational Linguistics: NAACL 2022
Cross-Lingual Retrieval Question Answering (CL-ReQA) is concerned with retrieving answer documents or passages to a question written in a different language. A common approach to CL-ReQA is to create a multilingual sentence embedding space such that question-answer pairs across different languages are close to each other. In this paper, we propose a novel CL-ReQA method utilizing the concept of language knowledge transfer and a new cross-lingual consistency training technique to create a multilingual embedding space for ReQA. To assess the effectiveness of our work, we conducted comprehensive experiments on CL-ReQA and a downstream task, machine reading QA. We compared our proposed method with the current state-of-the-art solutions across three public CL-ReQA corpora. Our method outperforms competitors in 19 out of 21 settings of CL-ReQA. When used with a downstream machine reading QA task, our method outperforms the best existing language-model-based method by 10% in F1 while being 10 times faster in sentence embedding computation. The code and models are available at
https://github.com/mrpeerat/CL-ReLKT.
pdf
bib
abs
ConGen: Unsupervised Control and Generalization Distillation For Sentence Representation
Peerat Limkonchotiwat
|
Wuttikorn Ponwitayarat
|
Lalita Lowphansirikul
|
Can Udomcharoenchaikit
|
Ekapol Chuangsuwanich
|
Sarana Nutanong
Findings of the Association for Computational Linguistics: EMNLP 2022
Sentence representations are essential in many NLP tasks operating at the sentence level.Recently, research attention has shifted towards learning how to represent sentences without any annotations, i.e., unsupervised representation learning. Despite the benefit of training without supervised data, there is still a performance penalty compared to supervised methods.Furthermore, the supervised-unsupervised performance gap widens as we reduce the model size. In this paper, we propose an unsupervised sentence representation method to reduce the supervised-unsupervised performance gap, especially for smaller models. Utilizing the concept for knowledge distillation, we derive a distillation framework comprising two training objectives, control and generalize, called ConGen. Experiments on semantic textual similarity (STS), text classification (transfer), and natural language inference (NLI) tasks show that ConGen is on par with supervised training even on smaller models.Furthermore, our method consistently outperformed competitors on multilingual STS.The code and models are available at https://github.com/KornWtp/ConGen.
2021
pdf
bib
Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation
Peerat Limkonchotiwat
|
Wannaphong Phatthiyaphaibun
|
Raheem Sarwar
|
Ekapol Chuangsuwanich
|
Sarana Nutanong
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
pdf
bib
abs
Robust Fragment-Based Framework for Cross-lingual Sentence Retrieval
Nattapol Trijakwanich
|
Peerat Limkonchotiwat
|
Raheem Sarwar
|
Wannaphong Phatthiyaphaibun
|
Ekapol Chuangsuwanich
|
Sarana Nutanong
Findings of the Association for Computational Linguistics: EMNLP 2021
Cross-lingual Sentence Retrieval (CLSR) aims at retrieving parallel sentence pairs that are translations of each other from a multilingual set of comparable documents. The retrieved parallel sentence pairs can be used in other downstream NLP tasks such as machine translation and cross-lingual word sense disambiguation. We propose a CLSR framework called Robust Fragment-level Representation (RFR) CLSR framework to address Out-of-Domain (OOD) CLSR problems. In particular, we improve the sentence retrieval robustness by representing each sentence as a collection of fragments. In this way, we change the retrieval granularity from the sentence to the fragment level. We performed CLSR experiments based on three OOD datasets, four language pairs, and three base well-known sentence encoders: m-USE, LASER, and LaBSE. Experimental results show that RFR significantly improves the base encoders’ performance for more than 85% of the cases.
2020
pdf
bib
abs
Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble
Peerat Limkonchotiwat
|
Wannaphong Phatthiyaphaibun
|
Raheem Sarwar
|
Ekapol Chuangsuwanich
|
Sarana Nutanong
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Like many Natural Language Processing tasks, Thai word segmentation is domain-dependent. Researchers have been relying on transfer learning to adapt an existing model to a new domain. However, this approach is inapplicable to cases where we can interact with only input and output layers of the models, also known as “black boxes”. We propose a filter-and-refine solution based on the stacked-ensemble learning paradigm to address this black-box limitation. We conducted extensive experimental studies comparing our method against state-of-the-art models and transfer learning. Experimental results show that our proposed solution is an effective domain adaptation method and has a similar performance as the transfer learning method.