2025
pdf
bib
abs
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
Samuel Cahyawijaya
|
Holy Lovenia
|
Joel Ruben Antony Moniz
|
Tack Hwa Wong
|
Mohammad Rifqi Farhansyah
|
Thant Thiri Maung
|
Frederikus Hudi
|
David Anugraha
|
Muhammad Ravi Shulthan Habibi
|
Muhammad Reza Qorib
|
Amit Agarwal
|
Joseph Marvin Imperial
|
Hitesh Laxmichand Patel
|
Vicky Feliren
|
Bahrul Ilmi Nasution
|
Manuel Antonio Rufino
|
Genta Indra Winata
|
Rian Adam Rajagede
|
Carlos Rafael Catalan
|
Mohamed Fazli Mohamed Imam
|
Priyaranjan Pattnayak
|
Salsabila Zahirah Pranida
|
Kevin Pratama
|
Yeshil Bangera
|
Adisai Na-Thalang
|
Patricia Nicole Monderin
|
Yueqi Song
|
Christian Simon
|
Lynnette Hui Xian Ng
|
Richardy Lobo Sapan
|
Taki Hasan Rafi
|
Bin Wang
|
Supryadi
|
Kanyakorn Veerakanjana
|
Piyalitt Ittichaiwong
|
Matthew Theodore Roque
|
Karissa Vincentio
|
Takdanai Kreangphet
|
Phakphum Artkaew
|
Kadek Hendrawan Palgunadi
|
Yanzhi Yu
|
Rochana Prih Hastuti
|
William Nixon
|
Mithil Bangera
|
Adrian Xuan Wei Lim
|
Aye Hninn Khine
|
Hanif Muhammad Zhafran
|
Teddy Ferdinan
|
Audra Aurora Izzani
|
Ayushman Singh
|
Evan Evan
|
Jauza Akbar Krito
|
Michael Anugraha
|
Fenal Ashokbhai Ilasariya
|
Haochen Li
|
John Amadeo Daniswara
|
Filbert Aurelian Tjiaranata
|
Eryawan Presma Yulianrifat
|
Can Udomcharoenchaikit
|
Fadil Risdian Ansori
|
Mahardika Krisna Ihsani
|
Giang Nguyen
|
Anab Maulana Barik
|
Dan John Velasco
|
Rifo Ahmad Genadi
|
Saptarshi Saha
|
Chengwei Wei
|
Isaiah Edri W. Flores
|
Kenneth Chen Ko Han
|
Anjela Gail D. Santos
|
Wan Shen Lim
|
Kaung Si Phyo
|
Tim Santos
|
Meisyarah Dwiastuti
|
Jiayun Luo
|
Jan Christian Blaise Cruz
|
Ming Shan Hee
|
Ikhlasul Akmal Hanif
|
M.Alif Al Hakim
|
Muhammad Rizky Sya’ban
|
Kun Kerdthaisong
|
Lester James Validad Miranda
|
Fajri Koto
|
Tirana Noor Fatyanosa
|
Alham Fikri Aji
|
Jostin Jerico Rosal
|
Jun Kevin
|
Robert Wijaya
|
Onno P. Kampman
|
Ruochen Zhang
|
Börje F. Karlsson
|
Peerat Limkonchotiwat
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite Southeast Asia’s (SEA) extraordinary linguistic and cultural diversity, the region remains significantly underrepresented in vision-language (VL) research, resulting in AI models that inadequately capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing culturally relevant high-quality datasets for SEA languages. By involving contributors from SEA countries, SEA-VL ensures better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages and cultural depictions in VL research. Our methodology employed three approaches: community-driven crowdsourcing with SEA contributors, automated image crawling, and synthetic image generation. We evaluated each method’s effectiveness in capturing cultural relevance. We found that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing, whereas synthetic image generation failed to accurately reflect SEA cultural nuances and contexts. Collectively, we gathered 1.28 million SEA culturally relevant images, more than 50 times larger than other existing datasets. This work bridges the representation gap in SEA, establishes a foundation for developing culturally aware AI systems for this region, and provides a replicable framework for addressing representation gaps in other underrepresented regions.
pdf
bib
abs
Synthetic Socratic Debates: Examining Persona Effects on Moral Decision and Persuasion Dynamics
Jiarui Liu
|
Yueqi Song
|
Yunze Xiao
|
Mingqian Zheng
|
Lindia Tjuatja
|
Jana Schaich Borg
|
Mona T. Diab
|
Maarten Sap
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
As large language models (LLMs) are increasingly used in morally sensitive domains, it is crucial to understand how persona traits affect their moral reasoning and persuasive behavior. We present the first large-scale study of multi-dimensional persona effects in AI-AI debates over real-world moral dilemmas. Using a 6-dimensional persona space (age, gender, country, social class, ideology, and personality), we simulate structured debates between AI agents over 131 relationship-based cases. Our results show that personas affect initial moral stances and debate outcomes, with political ideology and personality traits exerting the strongest influence. Persuasive success varies across traits, with liberal and open personalities reaching higher consensus. While logit-based confidence grows during debates, emotional and credibility-based appeals diminish, indicating more tempered argumentation over time. These trends mirror findings from psychology and cultural studies, reinforcing the need for persona-aware evaluation frameworks for AI moral reasoning.
pdf
bib
abs
Grounding Multilingual Multimodal LLMs With Cultural Knowledge
Jean De Dieu Nyandwi
|
Yueqi Song
|
Simran Khanuja
|
Graham Neubig
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Multimodal Large Language Models excel in high-resource settings, but often misinterpret long-tail cultural entities and underperform in low-resource languages. To address this gap, we propose a data-centric approach that directly grounds MLLMs in cultural knowledge. Leveraging a large scale knowledge graph from Wikidata, we collect images that represent culturally significant entities, and generate synthetic multilingual visual question answering data. The resulting dataset, CulturalGround, comprises 22 million high-quality, culturally-rich VQA pairs spanning 42 countries and 39 languages. We train an open-source MLLM CulturalPangea on CulturalGround, interleaving standard multilingual instruction-tuning data to preserve general abilities. Cultural-Pangea achieves state-of-the-art performance among open models on various culture-focused multilingual multimodal benchmarks, outperforming prior models by an average of +5.0%without degrading results on mainstream vision–language tasks. Our findings show that our targeted, culturally grounded approach could substantially narrow the cultural gap in MLLMs and offer a practical path towards globally inclusive multimodal systems.
pdf
bib
abs
What Is Missing in Multilingual Visual Reasoning and How to Fix It
Yueqi Song
|
Simran Khanuja
|
Graham Neubig
Findings of the Association for Computational Linguistics: NAACL 2025
NLP models today strive for supporting multiple languages and modalities, improving accessibility for diverse users. In this paper, we evaluate their multilingual, multimodal capabilities by testing on a visual reasoning task. We observe that proprietary systems like GPT-4V obtain the best performance on this task now, but open models lag in comparison. Surprisingly, GPT-4V exhibits similar performance between English and other languages, indicating the potential for equitable system development across languages. Our analysis on model failures reveals three key aspects that make this task challenging: multilinguality, complex reasoning, and multimodality. To address these challenges, we propose three targeted interventions including a translate-test approach to tackle multilinguality, a visual programming approach to break down complex reasoning, and a method that leverages image captioning to address multimodality. Our interventions achieve the best open performance on this task in a zero-shot setting, boosting open models LLaVA-v1.5-13B by 13.4%, LLaVA-v1.6-34B by 20.3%, and Qwen-VL by 16.7%, while also minorly improving GPT-4V’s performance.
pdf
bib
abs
Beyond Browsing: API-Based Web Agents
Yueqi Song
|
Frank F. Xu
|
Shuyan Zhou
|
Graham Neubig
Findings of the Association for Computational Linguistics: ACL 2025
Web browsers are a portal to the internet, where much of human activity is undertaken. Thus, there has been significant research work in AI agents that interact with the internet through web browsing.However, there is also another interface designed specifically for machine interaction with online content: application programming interfaces (APIs). In this paper we ask – *what if we were to take tasks traditionally tackled by Browsing Agents, and give AI agents access to APIs*?To do so, we propose two varieties of agents: (1) an API-calling agent that attempts to perform online tasks through APIs only, similar to traditional coding agents, and (2) a Hybrid Agent that can interact with online data through both web browsing and APIs.In experiments on WebArena, a widely-used and realistic benchmark for web navigation tasks, we find that API-Based Agents outperform web Browsing Agents.Hybrid Agents out-perform both others nearly uniformly across tasks, resulting in a more than 24.0% absolute improvement over web browsing alone, achieving a success rate of 38.9%, the SOTA performance among task-agnostic agents.These results strongly suggest that when APIs are available, they present an attractive alternative to relying on web browsing alone.
2024
pdf
bib
abs
An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance
Simran Khanuja
|
Sathyanarayanan Ramamoorthy
|
Yueqi Song
|
Graham Neubig
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Given the rise of multimedia content, human translators increasingly focus on culturally adapting not only words but also other modalities such as images to convey the same meaning. While several applications stand to benefit from this, machine translation systems remain confined to dealing with language in speech and text. In this work, we introduce a new task of translating images to make them culturally relevant. First, we build three pipelines comprising state-of-the-art generative models to do the task. Next, we build a two-part evaluation dataset – (i) concept: comprising 600 images that are cross-culturally coherent, focusing on a single concept per image; and (ii) application: comprising 100 images curated from real-world applications. We conduct a multi-faceted human evaluation of translated images to assess for cultural relevance and meaning preservation. We find that as of today, image-editing models fail at this task, but can be improved by leveraging LLMs and retrievers in the loop. Best pipelines can only translate 5% of images for some countries in the easier concept dataset and no translation is successful for some countries in the application dataset, highlighting the challenging nature of the task. Our project webpage is here: https://machine-transcreation.github.io/image-transcreation and our code, data and model outputs can be found here: https://github.com/simran-khanuja/image-transcreation.
2023
pdf
bib
abs
GlobalBench: A Benchmark for Global Progress in Natural Language Processing
Yueqi Song
|
Simran Khanuja
|
Pengfei Liu
|
Fahim Faisal
|
Alissa Ostapenko
|
Genta Winata
|
Alham Fikri Aji
|
Samuel Cahyawijaya
|
Yulia Tsvetkov
|
Antonios Anastasopoulos
|
Graham Neubig
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Despite the major advances in NLP, significant disparities in NLP system performance across languages still exist. Arguably, these are due to uneven resource allocation and sub-optimal incentives to work on less resourced languages. To track and further incentivize the global development of equitable language technology, we introduce GlobalBench. Prior multilingual benchmarks are static and have focused on a limited number of tasks and languages. In contrast, GlobalBench is an ever-expanding collection that aims to dynamically track progress on all NLP datasets in all languages. Rather than solely measuring accuracy, GlobalBench also tracks the estimated per-speaker utility and equity of technology across all languages, providing a multi-faceted view of how language technology is serving people of the world. Furthermore, GlobalBench is designed to identify the most under-served languages, and rewards research efforts directed towards those languages. At present, the most under-served languages are the ones with a relatively high population, but nonetheless overlooked by composite multilingual benchmarks (like Punjabi, Portuguese, and Wu Chinese). Currently, GlobalBench covers 966 datasets in 190 languages, and has 1,128 system submissions spanning 62 languages.