Chong-Wah Ngo
2025
Seeing Culture: A Benchmark for Visual Reasoning and Grounding
Burak Satar | Zhixin Ma | Patrick Amadeus Irawan | Wilfried Ariel Mulyawan | Jing Jiang | Ee-Peng Lim | Chong-Wah Ngo
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Burak Satar | Zhixin Ma | Patrick Amadeus Irawan | Wilfried Ariel Mulyawan | Jing Jiang | Ee-Peng Lim | Chong-Wah Ngo
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Multimodal vision-language models (VLMs) have made substantial progress in various tasks that require a combined understanding of visual and textual content, particularly in cultural understanding tasks, with the emergence of new cultural datasets. However, these datasets frequently fall short of providing cultural reasoning while underrepresenting many cultures.In this paper, we introduce the Seeing Culture Benchmark (SCB), focusing on cultural reasoning with a novel approach that requires VLMs to reason on culturally rich images in two stages: i) selecting the correct visual option with multiple-choice visual question answering (VQA), and ii) segmenting the relevant cultural artifact as evidence of reasoning. Visual options in the first stage are systematically organized into three types: those originating from the same country, those from different countries, or a mixed group. Notably, all options are derived from a singular category for each type. Progression to the second stage occurs only after a correct visual option is chosen. The SCB benchmark comprises 1,065 images that capture 138 cultural artifacts across five categories from seven Southeast Asia countries, whose diverse cultures are often overlooked, accompanied by 3,178 questions, of which 1,093 are unique and meticulously curated by human annotators. Our evaluation of various VLMs reveals the complexities involved in cross-modal cultural reasoning and highlights the disparity between visual reasoning and spatial grounding in culturally nuanced scenarios. The SCB serves as a crucial benchmark for identifying these shortcomings, thereby guiding future developments in the field of cultural reasoning. https://github.com/buraksatar/SeeingCulture
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines
Genta Indra Winata | Frederikus Hudi | Patrick Amadeus Irawan | David Anugraha | Rifki Afina Putri | Wang Yutong | Adam Nohejl | Ubaidillah Ariq Prathama | Nedjma Ousidhoum | Afifa Amriani | Anar Rzayev | Anirban Das | Ashmari Pramodya | Aulia Adila | Bryan Wilie | Candy Olivia Mawalim | Cheng Ching Lam | Daud Abolade | Emmanuele Chersoni | Enrico Santus | Fariz Ikhwantri | Garry Kuwanto | Hanyang Zhao | Haryo Akbarianto Wibowo | Holy Lovenia | Jan Christian Blaise Cruz | Jan Wira Gotama Putra | Junho Myung | Lucky Susanto | Maria Angelica Riera Machin | Marina Zhukova | Michael Anugraha | Muhammad Farid Adilazuarda | Natasha Christabelle Santosa | Peerat Limkonchotiwat | Raj Dabre | Rio Alexander Audino | Samuel Cahyawijaya | Shi-Xiong Zhang | Stephanie Yulia Salim | Yi Zhou | Yinxuan Gui | David Ifeoluwa Adelani | En-Shiun Annie Lee | Shogo Okada | Ayu Purwarianti | Alham Fikri Aji | Taro Watanabe | Derry Tanti Wijaya | Alice Oh | Chong-Wah Ngo
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Genta Indra Winata | Frederikus Hudi | Patrick Amadeus Irawan | David Anugraha | Rifki Afina Putri | Wang Yutong | Adam Nohejl | Ubaidillah Ariq Prathama | Nedjma Ousidhoum | Afifa Amriani | Anar Rzayev | Anirban Das | Ashmari Pramodya | Aulia Adila | Bryan Wilie | Candy Olivia Mawalim | Cheng Ching Lam | Daud Abolade | Emmanuele Chersoni | Enrico Santus | Fariz Ikhwantri | Garry Kuwanto | Hanyang Zhao | Haryo Akbarianto Wibowo | Holy Lovenia | Jan Christian Blaise Cruz | Jan Wira Gotama Putra | Junho Myung | Lucky Susanto | Maria Angelica Riera Machin | Marina Zhukova | Michael Anugraha | Muhammad Farid Adilazuarda | Natasha Christabelle Santosa | Peerat Limkonchotiwat | Raj Dabre | Rio Alexander Audino | Samuel Cahyawijaya | Shi-Xiong Zhang | Stephanie Yulia Salim | Yi Zhou | Yinxuan Gui | David Ifeoluwa Adelani | En-Shiun Annie Lee | Shogo Okada | Ayu Purwarianti | Alham Fikri Aji | Taro Watanabe | Derry Tanti Wijaya | Alice Oh | Chong-Wah Ngo
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.
2023
CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding
Zhijian Hou | Wanjun Zhong | Lei Ji | Difei Gao | Kun Yan | W.k. Chan | Chong-Wah Ngo | Mike Zheng Shou | Nan Duan
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhijian Hou | Wanjun Zhong | Lei Ji | Difei Gao | Kun Yan | W.k. Chan | Chong-Wah Ngo | Mike Zheng Shou | Nan Duan
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper tackles an emerging and challenging problem of long video temporal grounding (VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., from 3.13 to 6.87% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at https://github.com/houzhijian/CONE.
2010
Search
Fix author
Co-authors
- Patrick Amadeus Irawan 2
- Daud Abolade 1
- David Ifeoluwa Adelani 1
- Aulia Adila 1
- Muhammad Farid Adilazuarda 1
- Alham Fikri Aji 1
- Afifa Amriani 1
- David Anugraha 1
- Michael Anugraha 1
- Rio Alexander Audino 1
- Samuel Cahyawijaya 1
- W.k. Chan 1
- Emmanuele Chersoni 1
- Tat-Seng Chua 1
- Jan Christian Blaise Cruz 1
- Raj Dabre 1
- Anirban Das 1
- Nan Duan 1
- Difei Gao 1
- Yinxuan Gui 1
- Zhijian Hou 1
- Frederikus Hudi 1
- Fariz Ikhwantri 1
- Lei Ji 1
- Jing Jiang 1
- Garry Kuwanto 1
- Cheng Ching Lam 1
- En-Shiun Annie Lee 1
- Ee-Peng Lim 1
- Peerat Limkonchotiwat 1
- Holy Lovenia 1
- Zhixin Ma 1
- Candy Olivia Mawalim 1
- Wilfried Ariel Mulyawan 1
- Junho Myung 1
- Adam Nohejl 1
- Alice Oh 1
- Shogo Okada 1
- Nedjma Ousidhoum 1
- Ashmari Pramodya 1
- Ubaidillah Ariq Prathama 1
- Ayu Purwarianti 1
- Jan Wira Gotama Putra 1
- Rifki Afina Putri 1
- Maria Angelica Riera Machin 1
- Anar Rzayev 1
- Stephanie Yulia Salim 1
- Natasha Christabelle Santosa 1
- Enrico Santus 1
- Burak Satar 1
- Mike Zheng Shou 1
- Lucky Susanto 1
- Gang Wang 1
- YongCheng Wang 1
- Taro Watanabe 1
- Haryo Akbarianto Wibowo 1
- Derry Tanti Wijaya 1
- Bryan Wilie 1
- Genta Indra Winata 1
- Kun Yan 1
- Wang Yutong 1
- Shi-Xiong Zhang 1
- Hanyang Zhao 1
- Wanjun Zhong 1
- Yi Zhou 1
- Marina Zhukova 1