2025
pdf
bib
abs
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
Samuel Cahyawijaya
|
Holy Lovenia
|
Joel Ruben Antony Moniz
|
Tack Hwa Wong
|
Mohammad Rifqi Farhansyah
|
Thant Thiri Maung
|
Frederikus Hudi
|
David Anugraha
|
Muhammad Ravi Shulthan Habibi
|
Muhammad Reza Qorib
|
Amit Agarwal
|
Joseph Marvin Imperial
|
Hitesh Laxmichand Patel
|
Vicky Feliren
|
Bahrul Ilmi Nasution
|
Manuel Antonio Rufino
|
Genta Indra Winata
|
Rian Adam Rajagede
|
Carlos Rafael Catalan
|
Mohamed Fazli Mohamed Imam
|
Priyaranjan Pattnayak
|
Salsabila Zahirah Pranida
|
Kevin Pratama
|
Yeshil Bangera
|
Adisai Na-Thalang
|
Patricia Nicole Monderin
|
Yueqi Song
|
Christian Simon
|
Lynnette Hui Xian Ng
|
Richardy Lobo Sapan
|
Taki Hasan Rafi
|
Bin Wang
|
Supryadi
|
Kanyakorn Veerakanjana
|
Piyalitt Ittichaiwong
|
Matthew Theodore Roque
|
Karissa Vincentio
|
Takdanai Kreangphet
|
Phakphum Artkaew
|
Kadek Hendrawan Palgunadi
|
Yanzhi Yu
|
Rochana Prih Hastuti
|
William Nixon
|
Mithil Bangera
|
Adrian Xuan Wei Lim
|
Aye Hninn Khine
|
Hanif Muhammad Zhafran
|
Teddy Ferdinan
|
Audra Aurora Izzani
|
Ayushman Singh
|
Evan Evan
|
Jauza Akbar Krito
|
Michael Anugraha
|
Fenal Ashokbhai Ilasariya
|
Haochen Li
|
John Amadeo Daniswara
|
Filbert Aurelian Tjiaranata
|
Eryawan Presma Yulianrifat
|
Can Udomcharoenchaikit
|
Fadil Risdian Ansori
|
Mahardika Krisna Ihsani
|
Giang Nguyen
|
Anab Maulana Barik
|
Dan John Velasco
|
Rifo Ahmad Genadi
|
Saptarshi Saha
|
Chengwei Wei
|
Isaiah Edri W. Flores
|
Kenneth Chen Ko Han
|
Anjela Gail D. Santos
|
Wan Shen Lim
|
Kaung Si Phyo
|
Tim Santos
|
Meisyarah Dwiastuti
|
Jiayun Luo
|
Jan Christian Blaise Cruz
|
Ming Shan Hee
|
Ikhlasul Akmal Hanif
|
M.Alif Al Hakim
|
Muhammad Rizky Sya’ban
|
Kun Kerdthaisong
|
Lester James Validad Miranda
|
Fajri Koto
|
Tirana Noor Fatyanosa
|
Alham Fikri Aji
|
Jostin Jerico Rosal
|
Jun Kevin
|
Robert Wijaya
|
Onno P. Kampman
|
Ruochen Zhang
|
Börje F. Karlsson
|
Peerat Limkonchotiwat
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite Southeast Asia’s (SEA) extraordinary linguistic and cultural diversity, the region remains significantly underrepresented in vision-language (VL) research, resulting in AI models that inadequately capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing culturally relevant high-quality datasets for SEA languages. By involving contributors from SEA countries, SEA-VL ensures better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages and cultural depictions in VL research. Our methodology employed three approaches: community-driven crowdsourcing with SEA contributors, automated image crawling, and synthetic image generation. We evaluated each method’s effectiveness in capturing cultural relevance. We found that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing, whereas synthetic image generation failed to accurately reflect SEA cultural nuances and contexts. Collectively, we gathered 1.28 million SEA culturally relevant images, more than 50 times larger than other existing datasets. This work bridges the representation gap in SEA, establishes a foundation for developing culturally aware AI systems for this region, and provides a replicable framework for addressing representation gaps in other underrepresented regions.
pdf
bib
abs
MERaLiON-AudioLLM: Advancing Speech and Language Understanding for Singapore
Yingxu He
|
Zhuohan Liu
|
Geyu Lin
|
Shuo Sun
|
Bin Wang
|
Wenyu Zhang
|
Xunlong Zou
|
Nancy F. Chen
|
AiTi Aw
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
We introduce MERaLiON-AudioLLM, the first general-purpose audio-based large language model designed for multitask learning, with a particular focus on Singlish understanding. Trained on 62 million multimodal instruction samples comprising a total of 260k hours of audio, it exhibits strong generalization across a diverse set of tasks, including—but not limited to—automatic speech recognition, spoken question answering, speech translation, and paralinguistic analysis. Our results show significant improvements in local speech recognition and task-specific understanding, making MERaLiON-AudioLLM a leading solution for region-specific AI applications. An interactive demo has been developed to enable user-friendly interactions, supported by a backend with customized caching and load-balancing mechanisms. We benchmark the model across a broad range of multilingual and multitask scenarios, where it demonstrates competitive performance compared to other open-source models. The demo page, model weights and videos are publically accessible.
pdf
bib
abs
CoinMath: Harnessing the Power of Coding Instruction for Math LLM
Chengwei Wei
|
Bin Wang
|
Jung-jae Kim
|
Guimei Liu
|
Nancy F. Chen
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models (LLMs) have shown strong performance in solving mathematical problems, with code-based solutions proving particularly effective. However, the best practice to leverage coding instruction data to enhance mathematical reasoning remains underexplored. This study investigates three key questions: (1) How do different coding styles of mathematical code-based rationales impact LLMs’ learning performance? (2) Can general-domain coding instructions improve performance? (3) How does integrating textual rationales with code-based ones during training enhance mathematical reasoning abilities? Our findings reveal that code-based rationales with concise comments, descriptive naming, and hardcoded solutions are beneficial, while improvements from general-domain coding instructions and textual rationales are relatively minor. Based on these insights, we propose CoinMath, a learning strategy designed to enhance mathematical reasoning by diversifying the coding styles of code-based rationales. CoinMath generates a variety of code-based rationales incorporating concise comments, descriptive naming conventions, and hardcoded solutions. Experimental results demonstrate that CoinMath significantly outperforms its baseline model, MAmmoTH, one of the SOTA math LLMs.
pdf
bib
abs
AudioBench: A Universal Benchmark for Audio Large Language Models
Bin Wang
|
Xunlong Zou
|
Geyu Lin
|
Shuo Sun
|
Zhuohan Liu
|
Wenyu Zhang
|
Zhengyuan Liu
|
AiTi Aw
|
Nancy F. Chen
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
We introduce AudioBench, a universal benchmark designed to evaluate Audio Large Language Models (AudioLLMs). It encompasses 8 distinct tasks and 26 datasets, among which, 7 are newly proposed datasets. The evaluation targets three main aspects: speech understanding, audio scene understanding, and voice understanding (paralinguistic). Despite recent advancements, there lacks a comprehensive benchmark for AudioLLMs on instruction following capabilities conditioned on audio signals. AudioBench addresses this gap by setting up datasets as well as desired evaluation metrics. Besides, we also evaluated the capabilities of five popular models and found that no single model excels consistently across all tasks. We outline the research outlook for AudioLLMs and anticipate that our open-sourced evaluation toolkit, data, and leaderboard will offer a robust testbed for future model developments.
2024
pdf
bib
abs
CRAFT: Extracting and Tuning Cultural Instructions from the Wild
Bin Wang
|
Geyu Lin
|
Zhengyuan Liu
|
Chengwei Wei
|
Nancy Chen
Proceedings of the 2nd Workshop on Cross-Cultural Considerations in NLP
Large language models (LLMs) have rapidly evolved as the foundation of various natural language processing (NLP) applications. Despite their wide use cases, their understanding of culturally-related concepts and reasoning remains limited. Meantime, there is a significant need to enhance these models’ cultural reasoning capabilities, especially concerning underrepresented regions. This paper introduces a novel pipeline for extracting high-quality, culturally-related instruction tuning datasets from vast unstructured corpora. We utilize a self-instruction generation pipeline to identify cultural concepts and trigger instruction. By integrating with a general-purpose instruction tuning dataset, our model demonstrates enhanced capabilities in recognizing and understanding regional cultural nuances, thereby enhancing its reasoning capabilities. We conduct experiments across three regions: Singapore, the Philippines, and the United States, achieving performance improvement of up to 6%. Our research opens new avenues for extracting cultural instruction tuning sets directly from unstructured data, setting a precedent for future innovations in the field.
pdf
bib
abs
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
Holy Lovenia
|
Rahmad Mahendra
|
Salsabil Maulana Akbar
|
Lester James V. Miranda
|
Jennifer Santoso
|
Elyanah Aco
|
Akhdan Fadhilah
|
Jonibek Mansurov
|
Joseph Marvin Imperial
|
Onno P. Kampman
|
Joel Ruben Antony Moniz
|
Muhammad Ravi Shulthan Habibi
|
Frederikus Hudi
|
Railey Montalan
|
Ryan Ignatius
|
Joanito Agili Lopo
|
William Nixon
|
Börje F. Karlsson
|
James Jaya
|
Ryandito Diandaru
|
Yuze Gao
|
Patrick Amadeus
|
Bin Wang
|
Jan Christian Blaise Cruz
|
Chenxi Whitehouse
|
Ivan Halim Parmonangan
|
Maria Khelli
|
Wenyu Zhang
|
Lucky Susanto
|
Reynard Adha Ryanda
|
Sonny Lazuardi Hermawan
|
Dan John Velasco
|
Muhammad Dehan Al Kautsar
|
Willy Fitra Hendria
|
Yasmin Moslem
|
Noah Flynn
|
Muhammad Farid Adilazuarda
|
Haochen Li
|
Johanes Lee
|
R. Damanhuri
|
Shuo Sun
|
Muhammad Reza Qorib
|
Amirbek Djanibekov
|
Wei Qi Leong
|
Quyet V. Do
|
Niklas Muennighoff
|
Tanrada Pansuwan
|
Ilham Firdausi Putra
|
Yan Xu
|
Tai Ngee Chia
|
Ayu Purwarianti
|
Sebastian Ruder
|
William Tjhi
|
Peerat Limkonchotiwat
|
Alham Fikri Aji
|
Sedrick Keh
|
Genta Indra Winata
|
Ruochen Zhang
|
Fajri Koto
|
Zheng-Xin Yong
|
Samuel Cahyawijaya
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, through a collaborative movement, we introduce SEACrowd, a comprehensive resource center that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in Southeast Asia.
pdf
bib
abs
In2Core: Leveraging Influence Functions for Coreset Selection in Instruction Finetuning of Large Language Models
Ayrton San Joaquin
|
Bin Wang
|
Zhengyuan Liu
|
Nicholas Asher
|
Brian Lim
|
Philippe Muller
|
Nancy F. Chen
Findings of the Association for Computational Linguistics: EMNLP 2024
Despite advancements, fine-tuning Large Language Models (LLMs) remains costly due to the extensive parameter count and substantial data requirements for model generalization. Accessibility to computing resources remains a barrier for the open-source community. To address this challenge, we propose the In2Core algorithm, which selects a coreset by analyzing the correlation between training and evaluation samples with a trained model. Notably, we assess the model’s internal gradients to estimate this relationship, aiming to rank the contribution of each training point. To enhance efficiency, we propose an optimization to compute influence functions with a reduced number of layers while achieving similar accuracy. By applying our algorithm to instruction fine-tuning data of LLMs, we can achieve similar performance with just 50% of the training data. Meantime, using influence functions to analyze model coverage to certain testing samples could provide a reliable and interpretable signal on the training set’s coverage of those test points.
pdf
bib
abs
Resilience of Large Language Models for Noisy Instructions
Bin Wang
|
Chengwei Wei
|
Zhengyuan Liu
|
Geyu Lin
|
Nancy F. Chen
Findings of the Association for Computational Linguistics: EMNLP 2024
As the rapidly advancing domain of natural language processing (NLP), large language models (LLMs) have emerged as powerful tools for interpreting human commands and generating text across various tasks. Nonetheless, the resilience of LLMs to handle text containing inherent errors, stemming from human interactions and collaborative systems, has not been thoroughly explored. Our study investigates the resilience of LLMs against five common types of disruptions including 1) ASR (Automatic Speech Recognition) errors, 2) OCR (Optical Character Recognition) errors, 3) grammatical mistakes, 4) typographical errors, and 5) distractive content. We aim to investigate how these models react by deliberately embedding these errors into instructions. Our findings reveal that while some LLMs show a degree of resistance to certain types of noise, their overall performance significantly suffers. This emphasizes the importance of further investigation into enhancing model resilience. In response to the observed decline in performance, our study also evaluates a “re-pass” strategy, designed to purify the instructions of noise before the LLMs process them. Our analysis indicates that correcting noisy instructions, particularly for open-source LLMs, presents significant challenges.
2023
pdf
bib
abs
Compounding Geometric Operations for Knowledge Graph Completion
Xiou Ge
|
Yun Cheng Wang
|
Bin Wang
|
C.-C. Jay Kuo
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Geometric transformations including translation, rotation, and scaling are commonly used operations in image processing. Besides, some of them are successfully used in developing effective knowledge graph embedding (KGE). Inspired by the synergy, we propose a new KGE model by leveraging all three operations in this work. Since translation, rotation, and scaling operations are cascaded to form a composite one, the new model is named CompoundE. By casting CompoundE in the framework of group theory, we show that quite a few distanced-based KGE models are special cases of CompoundE. CompoundE extends the simple distance-based scoring functions to relation-dependent compound operations on head and/or tail entities. To demonstrate the effectiveness of CompoundE, we perform three prevalent KG prediction tasks including link prediction, path query answering, and entity typing, on a range of datasets. CompoundE outperforms extant models consistently, demonstrating its effectiveness and flexibility.
pdf
bib
abs
GreenKGC: A Lightweight Knowledge Graph Completion Method
Yun Cheng Wang
|
Xiou Ge
|
Bin Wang
|
C.-C. Jay Kuo
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Knowledge graph completion (KGC) aims to discover missing relationships between entities in knowledge graphs (KGs). Most prior KGC work focuses on learning embeddings for entities and relations through a simple score function. Yet, a higher-dimensional embedding space is usually required for a better reasoning capability, which leads to larger model size and hinders applicability to real-world problems (e.g., large-scale KGs or mobile/edge computing). A lightweight modularized KGC solution, called GreenKGC, is proposed in this work to address this issue. GreenKGC consists of three modules: representation learning, feature pruning, and decision learning, to extract discriminant KG features and make accurate predictions on missing relationships using classifiers and negative sampling. Experimental results demonstrate that, in low dimensions, GreenKGC can outperform SOTA methods in most datasets. In addition, low-dimensional GreenKGC can achieve competitive or even better performance against high-dimensional models with a much smaller model size.
2022
pdf
bib
abs
Just Rank: Rethinking Evaluation with Word and Sentence Similarities
Bin Wang
|
C.-C. Jay Kuo
|
Haizhou Li
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Word and sentence embeddings are useful feature representations in natural language processing. However, intrinsic evaluation for embeddings lags far behind, and there has been no significant update since the past decade. Word and sentence similarity tasks have become the de facto evaluation method. It leads models to overfit to such evaluations, negatively impacting embedding models’ development. This paper first points out the problems using semantic similarity as the gold standard for word and sentence embedding evaluations. Further, we propose a new intrinsic evaluation method called EvalRank, which shows a much stronger correlation with downstream tasks. Extensive experiments are conducted based on 60+ models and popular datasets to certify our judgments. Finally, the practical evaluation toolkit is released for future benchmarking purposes.