Jauza Akbar Krito
2026
A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding
Dilara Torunoğlu-Selamet | Doğukan Arslan | Rodrigo Wilkens | Wei He | Doruk Eryiğit | Thomas Pickard | Adriana S. Pagano | Aline Villavicencio | Gülşen Eryiğit | Ágnes Abuczki | Aida Cardoso | Alesia Lazarenka | Dina Almassova | Amália Mendes | Anna Kanellopoulou | Antoni Brosa-Rodriguez | Baiba Valkovska | Beata Wojtowicz | Bolette Pedersen | Carlos Manuel Hidalgo-Ternero | Chaya Liebeskind | Danka Jokić | Diego Alves | Eleni Triantafyllidi | Erik Velldal | Fred Philippy | Giedre Valunaite Oleskeviciene | Ieva Rizgeliene | Inguna Skadina | Irina Lobzhanidze | Isabell Stinessen Haugen | Jauza Akbar Krito | Jelena M. Marković | Johanna Monti | Josue Alejandro Sauca | Kaja Dobrovoljc Zor | Kingsley O. Ugwuanyi | Laura Rituma | Lilja Øvrelid | Maha Tufail Agro | Manzura Abjalova | Maria Chatzigrigoriou | María del Mar Sánchez Ramos | Marija Pendevska | Masoumeh Seyyedrezaei | Mehrnoush Shamsfard | Momina Ahsan | Muhammad Ahsan Riaz Khan | Nathalie Carmen Hau Norman | Nilay Erdem Ayyıldız | Nina Hosseini-Kivanani | Noémi Ligeti-Nagy | Numaan Naeem | Olha Kanishcheva | Olha Yatsyshyna | Daniil Orel | Petra Giommarelli | Petya Osenova | Radovan Garabik | Regina E. Semou | Rozane Rebechi | Salsabila Zahirah Pranida | Samia Touileb | Sanni Nimb | Sarfraz Ahmad | Sarvinoz Sharipova | Shahar Golan | Shaoxiong Ji | Sopuruchi Christian Aboh | Srdjan Sucur | Stella Markantonatou | Sussi Olsen | Vahide Tajalli | Veronika Lipp | Voula Giouli | Yelda Yeşildal Eraydın | Zahra Saaberi | Zhuohan Xie
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Dilara Torunoğlu-Selamet | Doğukan Arslan | Rodrigo Wilkens | Wei He | Doruk Eryiğit | Thomas Pickard | Adriana S. Pagano | Aline Villavicencio | Gülşen Eryiğit | Ágnes Abuczki | Aida Cardoso | Alesia Lazarenka | Dina Almassova | Amália Mendes | Anna Kanellopoulou | Antoni Brosa-Rodriguez | Baiba Valkovska | Beata Wojtowicz | Bolette Pedersen | Carlos Manuel Hidalgo-Ternero | Chaya Liebeskind | Danka Jokić | Diego Alves | Eleni Triantafyllidi | Erik Velldal | Fred Philippy | Giedre Valunaite Oleskeviciene | Ieva Rizgeliene | Inguna Skadina | Irina Lobzhanidze | Isabell Stinessen Haugen | Jauza Akbar Krito | Jelena M. Marković | Johanna Monti | Josue Alejandro Sauca | Kaja Dobrovoljc Zor | Kingsley O. Ugwuanyi | Laura Rituma | Lilja Øvrelid | Maha Tufail Agro | Manzura Abjalova | Maria Chatzigrigoriou | María del Mar Sánchez Ramos | Marija Pendevska | Masoumeh Seyyedrezaei | Mehrnoush Shamsfard | Momina Ahsan | Muhammad Ahsan Riaz Khan | Nathalie Carmen Hau Norman | Nilay Erdem Ayyıldız | Nina Hosseini-Kivanani | Noémi Ligeti-Nagy | Numaan Naeem | Olha Kanishcheva | Olha Yatsyshyna | Daniil Orel | Petra Giommarelli | Petya Osenova | Radovan Garabik | Regina E. Semou | Rozane Rebechi | Salsabila Zahirah Pranida | Samia Touileb | Sanni Nimb | Sarfraz Ahmad | Sarvinoz Sharipova | Shahar Golan | Shaoxiong Ji | Sopuruchi Christian Aboh | Srdjan Sucur | Stella Markantonatou | Sussi Olsen | Vahide Tajalli | Veronika Lipp | Voula Giouli | Yelda Yeşildal Eraydın | Zahra Saaberi | Zhuohan Xie
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Potentially idiomatic expressions (PIEs) carry meanings inherently tied to the everyday experience of a given language community. As such, they constitute an interesting challenge for assessing the linguistic (and to some extent cultural) capabilities of NLP systems. In this paper, we present XMPIE, a parallel multilingual and multimodal dataset of potentially idiomatic expressions. The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects. This parallel dataset allows evaluation of language model performance for a given PIE in different languages and whether idiomatic understanding in one language can be transferred to another. Moreover, the dataset supports the study of PIEs across textual and visual modalities, to measure to what extent PIE understanding in one modality transfers or implies in understanding in another modality (text vs. image). The data was created by language experts, with both textual and visual components crafted under multilingual guidelines, and each PIE is accompanied by five images representing a spectrum from idiomatic to literal meanings, including semantically related and random distractors. The result is a high-quality benchmark for evaluating multilingual and multimodal idiomatic language understanding.
2025
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
Samuel Cahyawijaya | Holy Lovenia | Joel Ruben Antony Moniz | Tack Hwa Wong | Mohammad Rifqi Farhansyah | Thant Thiri Maung | Frederikus Hudi | David Anugraha | Muhammad Ravi Shulthan Habibi | Muhammad Reza Qorib | Amit Agarwal | Joseph Marvin Imperial | Hitesh Laxmichand Patel | Vicky Feliren | Bahrul Ilmi Nasution | Manuel Antonio Rufino | Genta Indra Winata | Rian Adam Rajagede | Carlos Rafael Catalan | Mohamed Fazli Mohamed Imam | Priyaranjan Pattnayak | Salsabila Zahirah Pranida | Kevin Pratama | Yeshil Bangera | Adisai Na-Thalang | Patricia Nicole Monderin | Yueqi Song | Christian Simon | Lynnette Hui Xian Ng | Richardy Lobo Sapan | Taki Hasan Rafi | Bin Wang | Supryadi | Kanyakorn Veerakanjana | Piyalitt Ittichaiwong | Matthew Theodore Roque | Karissa Vincentio | Takdanai Kreangphet | Phakphum Artkaew | Kadek Hendrawan Palgunadi | Yanzhi Yu | Rochana Prih Hastuti | William Nixon | Mithil Bangera | Adrian Xuan Wei Lim | Aye Hninn Khine | Hanif Muhammad Zhafran | Teddy Ferdinan | Audra Aurora Izzani | Ayushman Singh | Evan Evan | Jauza Akbar Krito | Michael Anugraha | Fenal Ashokbhai Ilasariya | Haochen Li | John Amadeo Daniswara | Filbert Aurelian Tjiaranata | Eryawan Presma Yulianrifat | Can Udomcharoenchaikit | Fadil Risdian Ansori | Mahardika Krisna Ihsani | Giang Nguyen | Anab Maulana Barik | Dan John Velasco | Rifo Ahmad Genadi | Saptarshi Saha | Chengwei Wei | Isaiah Edri W. Flores | Kenneth Chen Ko Han | Anjela Gail D. Santos | Wan Shen Lim | Kaung Si Phyo | Tim Santos | Meisyarah Dwiastuti | Jiayun Luo | Jan Christian Blaise Cruz | Ming Shan Hee | Ikhlasul Akmal Hanif | M.Alif Al Hakim | Muhammad Rizky Sya’ban | Kun Kerdthaisong | Lester James Validad Miranda | Fajri Koto | Tirana Noor Fatyanosa | Alham Fikri Aji | Jostin Jerico Rosal | Jun Kevin | Robert Wijaya | Onno P. Kampman | Ruochen Zhang | Börje F. Karlsson | Peerat Limkonchotiwat
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Samuel Cahyawijaya | Holy Lovenia | Joel Ruben Antony Moniz | Tack Hwa Wong | Mohammad Rifqi Farhansyah | Thant Thiri Maung | Frederikus Hudi | David Anugraha | Muhammad Ravi Shulthan Habibi | Muhammad Reza Qorib | Amit Agarwal | Joseph Marvin Imperial | Hitesh Laxmichand Patel | Vicky Feliren | Bahrul Ilmi Nasution | Manuel Antonio Rufino | Genta Indra Winata | Rian Adam Rajagede | Carlos Rafael Catalan | Mohamed Fazli Mohamed Imam | Priyaranjan Pattnayak | Salsabila Zahirah Pranida | Kevin Pratama | Yeshil Bangera | Adisai Na-Thalang | Patricia Nicole Monderin | Yueqi Song | Christian Simon | Lynnette Hui Xian Ng | Richardy Lobo Sapan | Taki Hasan Rafi | Bin Wang | Supryadi | Kanyakorn Veerakanjana | Piyalitt Ittichaiwong | Matthew Theodore Roque | Karissa Vincentio | Takdanai Kreangphet | Phakphum Artkaew | Kadek Hendrawan Palgunadi | Yanzhi Yu | Rochana Prih Hastuti | William Nixon | Mithil Bangera | Adrian Xuan Wei Lim | Aye Hninn Khine | Hanif Muhammad Zhafran | Teddy Ferdinan | Audra Aurora Izzani | Ayushman Singh | Evan Evan | Jauza Akbar Krito | Michael Anugraha | Fenal Ashokbhai Ilasariya | Haochen Li | John Amadeo Daniswara | Filbert Aurelian Tjiaranata | Eryawan Presma Yulianrifat | Can Udomcharoenchaikit | Fadil Risdian Ansori | Mahardika Krisna Ihsani | Giang Nguyen | Anab Maulana Barik | Dan John Velasco | Rifo Ahmad Genadi | Saptarshi Saha | Chengwei Wei | Isaiah Edri W. Flores | Kenneth Chen Ko Han | Anjela Gail D. Santos | Wan Shen Lim | Kaung Si Phyo | Tim Santos | Meisyarah Dwiastuti | Jiayun Luo | Jan Christian Blaise Cruz | Ming Shan Hee | Ikhlasul Akmal Hanif | M.Alif Al Hakim | Muhammad Rizky Sya’ban | Kun Kerdthaisong | Lester James Validad Miranda | Fajri Koto | Tirana Noor Fatyanosa | Alham Fikri Aji | Jostin Jerico Rosal | Jun Kevin | Robert Wijaya | Onno P. Kampman | Ruochen Zhang | Börje F. Karlsson | Peerat Limkonchotiwat
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite Southeast Asia’s (SEA) extraordinary linguistic and cultural diversity, the region remains significantly underrepresented in vision-language (VL) research, resulting in AI models that inadequately capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing culturally relevant high-quality datasets for SEA languages. By involving contributors from SEA countries, SEA-VL ensures better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages and cultural depictions in VL research. Our methodology employed three approaches: community-driven crowdsourcing with SEA contributors, automated image crawling, and synthetic image generation. We evaluated each method’s effectiveness in capturing cultural relevance. We found that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing, whereas synthetic image generation failed to accurately reflect SEA cultural nuances and contexts. Collectively, we gathered 1.28 million SEA culturally relevant images, more than 50 times larger than other existing datasets. This work bridges the representation gap in SEA, establishes a foundation for developing culturally aware AI systems for this region, and provides a replicable framework for addressing representation gaps in other underrepresented regions.
Search
Fix author
Co-authors
- Salsabila Zahirah Pranida 2
- Manzura Abjalova 1
- Sopuruchi Christian Aboh 1
- Ágnes Abuczki 1
- Amit Agarwal 1
- Maha Tufail Agro 1
- Sarfraz Ahmad 1
- Momina Ahsan 1
- Alham Fikri Aji 1
- Dina Almassova 1
- Diego Alves 1
- Fadil Risdian Ansori 1
- David Anugraha 1
- Michael Anugraha 1
- Doğukan Arslan 1
- Phakphum Artkaew 1
- Yeshil Bangera 1
- Mithil Bangera 1
- Anab Maulana Barik 1
- Samuel Cahyawijaya 1
- Aida Cardoso 1
- Carlos Rafael Catalan 1
- Maria Chatzigrigoriou 1
- Jan Christian Blaise Cruz 1
- John Amadeo Daniswara 1
- Kaja Dobrovoljc 1
- Meisyarah Dwiastuti 1
- Nilay Erdem Ayyıldız 1
- Doruk Eryiğit 1
- Gülşen Eryiğit 1
- Evan Evan 1
- Mohammad Rifqi Farhansyah 1
- Tirana Noor Fatyanosa 1
- Vicky Feliren 1
- Teddy Ferdinan 1
- Isaiah Edri W. Flores 1
- Radovan Garabik 1
- Rifo Ahmad Genadi 1
- Petra Giommarelli 1
- Voula Giouli 1
- Shahar Golan 1
- Muhammad Ravi Shulthan Habibi 1
- M.Alif Al Hakim 1
- Kenneth Chen Ko Han 1
- Ikhlasul Akmal Hanif 1
- Rochana Prih Hastuti 1
- Isabell Stinessen Haugen 1
- Wei He 1
- Ming Shan Hee 1
- Carlos Manuel Hidalgo-Ternero 1
- Nina Hosseini-Kivanani 1
- Frederikus Hudi 1
- Mahardika Krisna Ihsani 1
- Fenal Ashokbhai Ilasariya 1
- Mohamed Fazli Mohamed Imam 1
- Joseph Marvin Imperial 1
- Piyalitt Ittichaiwong 1
- Audra Aurora Izzani 1
- Shaoxiong Ji 1
- Danka Jokić 1
- Onno P. Kampman 1
- Anna Kanellopoulou 1
- Olha Kanishcheva 1
- Börje F. Karlsson 1
- Kun Kerdthaisong 1
- Jun Kevin 1
- Muhammad Ahsan Riaz Khan 1
- Aye Hninn Khine 1
- Fajri Koto 1
- Takdanai Kreangphet 1
- Alesia Lazarenka 1
- Haochen Li 1
- Chaya Liebeskind 1
- Noémi Ligeti-Nagy 1
- Adrian Xuan Wei Lim 1
- Wan Shen Lim 1
- Peerat Limkonchotiwat 1
- Veronika Lipp 1
- Irina Lobzhanidze 1
- Holy Lovenia 1
- Jiayun Luo 1
- Stella Markantonatou 1
- Jelena M. Marković 1
- Thant Thiri Maung 1
- Amália Mendes 1
- Lester James Validad Miranda 1
- Patricia Nicole Monderin 1
- Joel Ruben Antony Moniz 1
- Johanna Monti 1
- Adisai Na-Thalang 1
- Numaan Naeem 1
- Bahrul Ilmi Nasution 1
- Lynnette Hui Xian Ng 1
- Giang Nguyen 1
- Sanni Nimb 1
- William Nixon 1
- Nathalie Carmen Hau Norman 1
- Sussi Olsen 1
- Daniil Orel 1
- Petya Osenova 1
- Adriana Silvina Pagano 1
- Kadek Hendrawan Palgunadi 1
- Hitesh Laxmichand Patel 1
- Priyaranjan Pattnayak 1
- Bolette Sandford Pedersen 1
- Marija Pendevska 1
- Fred Philippy 1
- Kaung Si Phyo 1
- Thomas Pickard 1
- Kevin Pratama 1
- Muhammad Reza Qorib 1
- Taki Hasan Rafi 1
- Rian Adam Rajagede 1
- María Del Mar Sánchez Ramos 1
- Rozane Rebechi 1
- Laura Rituma 1
- Ieva Rizgeliene 1
- Antoni Brosa Rodríguez 1
- Matthew Theodore Roque 1
- Jostin Jerico Rosal 1
- Manuel Antonio Rufino 1
- Zahra Saaberi 1
- Saptarshi Saha 1
- Anjela Gail D. Santos 1
- Tim Santos 1
- Richardy Lobo Sapan 1
- Josue Alejandro Sauca 1
- Regina E. Semou 1
- Masoumeh Seyyedrezaei 1
- Mehrnoush Shamsfard 1
- Sarvinoz Sharipova 1
- Christian Simon 1
- Ayushman Singh 1
- Inguna Skadina 1
- Yueqi Song 1
- Srdjan Sucur 1
- Supryadi 1
- Muhammad Rizky Sya’ban 1
- Vahide Tajalli 1
- Filbert Aurelian Tjiaranata 1
- Dilara Torunoğlu-Selamet 1
- Samia Touileb 1
- Eleni Triantafyllidi 1
- Can Udomcharoenchaikit 1
- Kingsley O. Ugwuanyi 1
- Baiba Valkovska 1
- Giedre Valunaite Oleskeviciene 1
- Kanyakorn Veerakanjana 1
- Dan John Velasco 1
- Erik Velldal 1
- Aline Villavicencio 1
- Karissa Vincentio 1
- Bin Wang 1
- Chengwei Wei 1
- Robert Wijaya 1
- Rodrigo Wilkens 1
- Genta Indra Winata 1
- Tack Hwa Wong 1
- Beata Wójtowicz 1
- Zhuohan Xie 1
- Olha Yatsyshyna 1
- Yelda Yeşildal Eraydın 1
- Yanzhi Yu 1
- Eryawan Presma Yulianrifat 1
- Hanif Muhammad Zhafran 1
- Ruochen Zhang 1
- Lilja Øvrelid 1