2025
pdf
bib
abs
FineReason: Evaluating and Improving LLMs’ Deliberate Reasoning through Reflective Puzzle Solving
Guizhen Chen
|
Weiwen Xu
|
Hao Zhang
|
Hou Pong Chan
|
Chaoqun Liu
|
Lidong Bing
|
Deli Zhao
|
Anh Tuan Luu
|
Yu Rong
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Many challenging reasoning tasks require not just rapid, intuitive responses, but a more deliberate, multi-step approach. Recent progress in large language models (LLMs) highlights an important shift from the “System 1” way of quick reactions to the “System 2” style of reflection-and-correction problem solving. However, current benchmarks heavily rely on the final-answer accuracy, leaving much of a model’s intermediate reasoning steps unexamined. This fails to assess the model’s ability to reflect and rectify mistakes within the reasoning process. To bridge this gap, we introduce FINEREASON, a logic-puzzle benchmark for systematic evaluation of LLMs’ reasoning capabilities. Each puzzle can be decomposed into atomic steps, making it ideal for rigorous validation of intermediate correctness. Building on this, we introduce two tasks: state checking and state transition, for a comprehensive evaluation of how models assess the current situation and plan the next move. To support broader research, we also provide a puzzle training set aimed at enhancing general reasoning. We show that models trained on our state checking and transition data demonstrate gains in mathematical reasoning by up to 5.1%.
pdf
bib
abs
COF: Adaptive Chain of Feedback for Comparative Opinion Quintuple Extraction
Qingting Xu
|
Kaisong Song
|
Chaoqun Liu
|
Yangyang Kang
|
Xiabing Zhou
|
Jun Lin
|
Yu Hong
Proceedings of the 31st International Conference on Computational Linguistics
Comparative Opinion Quintuple Extraction (COQE) aims to extract all comparative sentiment quintuples from product review text. Each quintuple comprises five elements: subject, object, aspect, opinion and preference. With the rise of Large Language Models (LLMs), existing work primarily focuses on enhancing the performance of COQE task through data augmentation, supervised fine-tuning and instruction tuning. Instead of the above pre-modeling and in-modeling design techniques, we focus on innovation in the post-processing. We introduce a model-unaware adaptive chain-of-feedback (COF) method from the perspective of inference feedback and extraction revision. This method comprises three core modules: dynamic example selection, self-critique and self-revision. By integrating LLMs, COF enables dynamic iterative self-optimization, making it applicable across different baselines. To validate the effectiveness of our approach, we utilize the outputs of two distinct baselines as inputs for COF: frozen parameters few-shot learning and the SOTA supervised fine-tuned model. We evaluate our approach on three benchmarks: Camera, Car and Ele. Experimental results show that, compared to the few-shot learning method, our approach achieves F1 score improvements of 3.51%, 2.65% and 5.28% for exact matching on the respective dataset. Even more impressively, our method further boosts performance, surpassing the current SOTA results, with additional gains of 0.76%, 6.54%, and 2.36% across the three datasets.
pdf
bib
abs
Zero-to-Strong Generalization: Eliciting Strong Capabilities of Large Language Models Iteratively without Gold Labels
Chaoqun Liu
|
Qin Chao
|
Wenxuan Zhang
|
Xiaobao Wu
|
Boyang Li
|
Anh Tuan Luu
|
Lidong Bing
Proceedings of the 31st International Conference on Computational Linguistics
Large Language Models (LLMs) have demonstrated remarkable performance through supervised fine-tuning or in-context learning using gold labels. However, this paradigm is limited by the availability of gold labels, while in certain scenarios, LLMs may need to perform tasks that are too complex for humans to provide such labels. To tackle this challenge, this study explores whether solely utilizing unlabeled data can elicit strong model capabilities. We propose a new paradigm termed zero-to-strong generalization. We iteratively prompt LLMs to annotate unlabeled data and retain high-quality labels by filtering. Surprisingly, we obverse that this iterative process gradually unlocks LLMs’ potential on downstream tasks. Our experiments on extensive classification and reasoning tasks confirm the effectiveness of our proposed framework. Our analysis indicates that this paradigm is effective for both in-context learning and fine-tuning, and for various model sizes.
pdf
bib
abs
SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia
Chaoqun Liu
|
Wenxuan Zhang
|
Jiahao Ying
|
Mahani Aljunied
|
Anh Tuan Luu
|
Lidong Bing
Findings of the Association for Computational Linguistics: NAACL 2025
This study introduces two novel benchmarks, SeaExam and SeaBench, designed to evaluate the capabilities of Large Language Models (LLMs) in Southeast Asian (SEA) application scenarios. Unlike existing multilingual datasets primarily derived from English translations, these benchmarks are constructed based on real-world scenarios from SEA regions. SeaExam draws from regional educational exams to form a comprehensive dataset that encompasses subjects such as local history and literature. In contrast, SeaBench is crafted around multi-turn, open-ended tasks that reflect daily interactions within SEA communities. Our evaluations demonstrate that SeaExam and SeaBench more effectively discern LLM performance on SEA language tasks compared to their translated benchmarks. This highlights the importance of using real-world queries to assess the multilingual capabilities of LLMs.
pdf
bib
abs
Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models
Chaoqun Liu
|
Wenxuan Zhang
|
Yiran Zhao
|
Anh Tuan Luu
|
Lidong Bing
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Large language models (LLMs) have demonstrated multilingual capabilities, yet they are mostly English-centric due to the imbalanced training corpora. While prior works have leveraged this bias to enhance multilingual performance through translation, they have been largely limited to natural language processing (NLP) tasks. In this work, we extend the evaluation to real-world user queries and non-English-centric LLMs, offering a broader examination of multilingual performance. Our key contribution lies in demonstrating that while translation into English can boost the performance of English-centric LLMs on NLP tasks, it is not universally optimal. For culture-related tasks that need deep language understanding, prompting in the native language proves more effective as it better captures the nuances of culture and language. Our experiments expose varied behaviors across LLMs and tasks in the multilingual context, underscoring the need for a more comprehensive approach to multilingual evaluation. Therefore, we call for greater efforts in developing and evaluating LLMs that go beyond English-centric paradigms.
pdf
bib
abs
SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages
Wenxuan Zhang
|
Hou Pong Chan
|
Yiran Zhao
|
Mahani Aljunied
|
Jianyu Wang
|
Chaoqun Liu
|
Yue Deng
|
Zhiqiang Hu
|
Weiwen Xu
|
Yew Ken Chia
|
Xin Li
|
Lidong Bing
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
Large Language Models (LLMs) have shown remarkable abilities across various tasks, yet their development has predominantly centered on high-resource languages like English and Chinese, leaving low-resource languages underserved. To address this disparity, we present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. This region, characterized by its rich linguistic diversity, has lacked adequate language technology support. SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Leveraging efficient language enhancement techniques and a specially constructed instruction tuning dataset, SeaLLMs 3 significantly reduces training costs while maintaining high performance and versatility. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models. Additionally, we prioritized safety and reliability by addressing both general and culture-specific considerations and incorporated mechanisms to reduce hallucinations. This work underscores the importance of inclusive AI, showing that advanced LLM capabilities can benefit underserved linguistic and cultural communities.
2024
pdf
bib
abs
SeaLLMs - Large Language Models for Southeast Asia
Xuan-Phi Nguyen
|
Wenxuan Zhang
|
Xin Li
|
Mahani Aljunied
|
Zhiqiang Hu
|
Chenhui Shen
|
Yew Ken Chia
|
Xingxuan Li
|
Jianyu Wang
|
Qingyu Tan
|
Liying Cheng
|
Guanzheng Chen
|
Yue Deng
|
Sen Yang
|
Chaoqun Liu
|
Hang Zhang
|
Lidong Bing
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Despite the remarkable achievements of large language models (LLMs) in various tasks, there remains a linguistic bias that favors high-resource languages, such as English, often at the expense of low-resource and regional languages. To address this imbalance, we introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages. SeaLLMs are built upon popular English-centric models through continued pre-training with an extended vocabulary, specialized instruction and alignment tuning to better capture the intricacies of regional languages. This allows them to respect and reflect local cultural norms, customs, stylistic preferences, and legal considerations. Our comprehensive evaluation demonstrates that SeaLLM models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities relative to comparable open-source models. Moreover, they outperform ChatGPT-3.5 in non-Latin languages, such as Thai, Khmer, Lao, and Burmese, by large margins while remaining lightweight and cost-effective to operate.
2023
pdf
bib
abs
Zero-Shot Text Classification via Self-Supervised Tuning
Chaoqun Liu
|
Wenxuan Zhang
|
Guizhen Chen
|
Xiaobao Wu
|
Anh Tuan Luu
|
Chip Hong Chang
|
Lidong Bing
Findings of the Association for Computational Linguistics: ACL 2023
Existing solutions to zero-shot text classification either conduct prompting with pre-trained language models, which is sensitive to the choices of templates, or rely on large-scale annotated data of relevant tasks for meta-tuning. In this work, we propose a new paradigm based on self-supervised learning to solve zero-shot text classification tasks by tuning the language models with unlabeled data, called self-supervised tuning. By exploring the inherent structure of free texts, we propose a new learning objective called first sentence prediction to bridge the gap between unlabeled data and text classification tasks. After tuning the model to learn to predict the first sentence in a paragraph based on the rest, the model is able to conduct zero-shot inference on unseen tasks such as topic classification and sentiment analysis. Experimental results show that our model outperforms the state-of-the-art baselines on 7 out of 10 tasks. Moreover, the analysis reveals that our model is less sensitive to the prompt design. Our code and pre-trained models are publicly available at 
https://github.com/DAMO-NLP-SG/SSTuning.