Tack Hwa Wong
2026
CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data
Pedro Ortiz Suarez | Laurie Burchell | Catherine Arnett | Rafael Mosquera | Sara Hincapié Monsalve | Thom Vaughan | Damian Stewart | Malte Ostendorff | Idris Abdulmumin | Vukosi Marivate | Shamsuddeen Hassan Muhammad | Atnafu Lambebo Tonja | Hend Al-Khalifa | Nadia Ghezaiel Hammouda | Verrah Akinyi Otiende | Tack Hwa Wong | Jakhongir Saydaliev | Melika Nobakhtian | Muhammad Ravi Shulthan Habibi | Chalamalasetti Kranti | Carol Muchemi | Khang Nguyen | Faisal Muhammad Adam | Luis Frentzen Salim | Reem Alqifari | Cynthia Jayne Amol | Joseph Marvin Imperial | Ilker Kesen | Ahmad Mustafid | Pavel Stepachev | Leshem Choshen | David Anugraha | Hamada Nayel | Seid Muhie Yimam | Vallerie Alexandra Putra | My Chiffon Nguyen | Azmine Toushik Wasi | Gouthami Vadithya | Rob Van Der Goot | Lanwenn ar C’horr | Karan Dua | Andrew Yates | Mithil Bangera | Yeshil Bangera | Hitesh Laxmichand Patel | Shu Okabe | Fenal Ashokbhai Ilasariya | Dmitry Gaynullin | Genta Indra Winata | Yiyuan Li | Juan Pablo Martínez | Amit Agarwal | Ikhlasul Akmal Hanif | Raia Abu Ahmad | Esther Adenuga | Filbert Aurelian Tjiaranata | Weerayut Buaphet | Michael Anugraha | Sowmya Vajjala | Benjamin L Rice | Azril Hafizi Amirudin | Jesujoba Oluwadara Alabi | Srikant Panda | Yassine Toughrai | Bruhan Kyomuhendo | Daniel Ruffinelli | Akshata | Manuel Goulão | Ej Zhou | Ingrid Gabriela Franco Ramirez | Cristina Aggazzotti | Konstantin Dobler | Jun Kevin | Quentin Pagès | Nicholas Andrews | Nuhu Ibrahim | Mattes Ruckdeschel | Amr Keleg | Mike Zhang | Casper Rufaro Muziri | Saron Samuel | Sotaro Takeshita | Kun Kerdthaisong | Luca Foppiano | Rasul Dent | Tommaso Green | Ahmad Mustapha Wali | Kamohelo Makaaka | Vicky Feliren | Inshirah Idris | Hande Celikkanat | Abdulhamid Abubakar | Jean Maillard | Benoît Sagot | Thibault Clérice | Kenton Murray | Sarah K. K. Luger
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Pedro Ortiz Suarez | Laurie Burchell | Catherine Arnett | Rafael Mosquera | Sara Hincapié Monsalve | Thom Vaughan | Damian Stewart | Malte Ostendorff | Idris Abdulmumin | Vukosi Marivate | Shamsuddeen Hassan Muhammad | Atnafu Lambebo Tonja | Hend Al-Khalifa | Nadia Ghezaiel Hammouda | Verrah Akinyi Otiende | Tack Hwa Wong | Jakhongir Saydaliev | Melika Nobakhtian | Muhammad Ravi Shulthan Habibi | Chalamalasetti Kranti | Carol Muchemi | Khang Nguyen | Faisal Muhammad Adam | Luis Frentzen Salim | Reem Alqifari | Cynthia Jayne Amol | Joseph Marvin Imperial | Ilker Kesen | Ahmad Mustafid | Pavel Stepachev | Leshem Choshen | David Anugraha | Hamada Nayel | Seid Muhie Yimam | Vallerie Alexandra Putra | My Chiffon Nguyen | Azmine Toushik Wasi | Gouthami Vadithya | Rob Van Der Goot | Lanwenn ar C’horr | Karan Dua | Andrew Yates | Mithil Bangera | Yeshil Bangera | Hitesh Laxmichand Patel | Shu Okabe | Fenal Ashokbhai Ilasariya | Dmitry Gaynullin | Genta Indra Winata | Yiyuan Li | Juan Pablo Martínez | Amit Agarwal | Ikhlasul Akmal Hanif | Raia Abu Ahmad | Esther Adenuga | Filbert Aurelian Tjiaranata | Weerayut Buaphet | Michael Anugraha | Sowmya Vajjala | Benjamin L Rice | Azril Hafizi Amirudin | Jesujoba Oluwadara Alabi | Srikant Panda | Yassine Toughrai | Bruhan Kyomuhendo | Daniel Ruffinelli | Akshata | Manuel Goulão | Ej Zhou | Ingrid Gabriela Franco Ramirez | Cristina Aggazzotti | Konstantin Dobler | Jun Kevin | Quentin Pagès | Nicholas Andrews | Nuhu Ibrahim | Mattes Ruckdeschel | Amr Keleg | Mike Zhang | Casper Rufaro Muziri | Saron Samuel | Sotaro Takeshita | Kun Kerdthaisong | Luca Foppiano | Rasul Dent | Tommaso Green | Ahmad Mustapha Wali | Kamohelo Makaaka | Vicky Feliren | Inshirah Idris | Hande Celikkanat | Abdulhamid Abubakar | Jean Maillard | Benoît Sagot | Thibault Clérice | Kenton Murray | Sarah K. K. Luger
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID’s value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.
2025
Language Surgery in Multilingual Large Language Models
Joanito Agili Lopo | Muhammad Ravi Shulthan Habibi | Tack Hwa Wong | Muhammad Ilham Ghozali | Fajri Koto | Genta Indra Winata | Peerat Limkonchotiwat | Alham Fikri Aji | Samuel Cahyawijaya
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)
Joanito Agili Lopo | Muhammad Ravi Shulthan Habibi | Tack Hwa Wong | Muhammad Ilham Ghozali | Fajri Koto | Genta Indra Winata | Peerat Limkonchotiwat | Alham Fikri Aji | Samuel Cahyawijaya
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)
Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across tasks and languages, revolutionizing natural language processing. This paper investigates the naturally emerging representation alignment in LLMs, particularly in the middle layers, and its implications for disentangling language-specific and language-agnostic information. We empirically confirm the existence of this alignment, analyze its behavior in comparison to explicitly designed alignment models, and demonstrate its potential for language-specific manipulation without semantic degradation. Building on these findings, we propose Inference-Time Language Control (ITLC), a novel method that leverages latent injection to enable precise cross-lingual language control and mitigate language confusion in LLMs. Our experiments highlight ITLC’s strong cross-lingual control capabilities while preserving semantic integrity in target languages. Furthermore, we demonstrate its effectiveness in alleviating the cross-lingual language confusion problem, which persists even in current large-scale LLMs, leading to inconsistent language generation. This work advances our understanding of representation alignment in LLMs and introduces a practical solution for enhancing their monolingual and cross-lingual performance.
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
Samuel Cahyawijaya | Holy Lovenia | Joel Ruben Antony Moniz | Tack Hwa Wong | Mohammad Rifqi Farhansyah | Thant Thiri Maung | Frederikus Hudi | David Anugraha | Muhammad Ravi Shulthan Habibi | Muhammad Reza Qorib | Amit Agarwal | Joseph Marvin Imperial | Hitesh Laxmichand Patel | Vicky Feliren | Bahrul Ilmi Nasution | Manuel Antonio Rufino | Genta Indra Winata | Rian Adam Rajagede | Carlos Rafael Catalan | Mohamed Fazli Mohamed Imam | Priyaranjan Pattnayak | Salsabila Zahirah Pranida | Kevin Pratama | Yeshil Bangera | Adisai Na-Thalang | Patricia Nicole Monderin | Yueqi Song | Christian Simon | Lynnette Hui Xian Ng | Richardy Lobo Sapan | Taki Hasan Rafi | Bin Wang | Supryadi | Kanyakorn Veerakanjana | Piyalitt Ittichaiwong | Matthew Theodore Roque | Karissa Vincentio | Takdanai Kreangphet | Phakphum Artkaew | Kadek Hendrawan Palgunadi | Yanzhi Yu | Rochana Prih Hastuti | William Nixon | Mithil Bangera | Adrian Xuan Wei Lim | Aye Hninn Khine | Hanif Muhammad Zhafran | Teddy Ferdinan | Audra Aurora Izzani | Ayushman Singh | Evan Evan | Jauza Akbar Krito | Michael Anugraha | Fenal Ashokbhai Ilasariya | Haochen Li | John Amadeo Daniswara | Filbert Aurelian Tjiaranata | Eryawan Presma Yulianrifat | Can Udomcharoenchaikit | Fadil Risdian Ansori | Mahardika Krisna Ihsani | Giang Nguyen | Anab Maulana Barik | Dan John Velasco | Rifo Ahmad Genadi | Saptarshi Saha | Chengwei Wei | Isaiah Edri W. Flores | Kenneth Chen Ko Han | Anjela Gail D. Santos | Wan Shen Lim | Kaung Si Phyo | Tim Santos | Meisyarah Dwiastuti | Jiayun Luo | Jan Christian Blaise Cruz | Ming Shan Hee | Ikhlasul Akmal Hanif | M.Alif Al Hakim | Muhammad Rizky Sya’ban | Kun Kerdthaisong | Lester James Validad Miranda | Fajri Koto | Tirana Noor Fatyanosa | Alham Fikri Aji | Jostin Jerico Rosal | Jun Kevin | Robert Wijaya | Onno P. Kampman | Ruochen Zhang | Börje F. Karlsson | Peerat Limkonchotiwat
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Samuel Cahyawijaya | Holy Lovenia | Joel Ruben Antony Moniz | Tack Hwa Wong | Mohammad Rifqi Farhansyah | Thant Thiri Maung | Frederikus Hudi | David Anugraha | Muhammad Ravi Shulthan Habibi | Muhammad Reza Qorib | Amit Agarwal | Joseph Marvin Imperial | Hitesh Laxmichand Patel | Vicky Feliren | Bahrul Ilmi Nasution | Manuel Antonio Rufino | Genta Indra Winata | Rian Adam Rajagede | Carlos Rafael Catalan | Mohamed Fazli Mohamed Imam | Priyaranjan Pattnayak | Salsabila Zahirah Pranida | Kevin Pratama | Yeshil Bangera | Adisai Na-Thalang | Patricia Nicole Monderin | Yueqi Song | Christian Simon | Lynnette Hui Xian Ng | Richardy Lobo Sapan | Taki Hasan Rafi | Bin Wang | Supryadi | Kanyakorn Veerakanjana | Piyalitt Ittichaiwong | Matthew Theodore Roque | Karissa Vincentio | Takdanai Kreangphet | Phakphum Artkaew | Kadek Hendrawan Palgunadi | Yanzhi Yu | Rochana Prih Hastuti | William Nixon | Mithil Bangera | Adrian Xuan Wei Lim | Aye Hninn Khine | Hanif Muhammad Zhafran | Teddy Ferdinan | Audra Aurora Izzani | Ayushman Singh | Evan Evan | Jauza Akbar Krito | Michael Anugraha | Fenal Ashokbhai Ilasariya | Haochen Li | John Amadeo Daniswara | Filbert Aurelian Tjiaranata | Eryawan Presma Yulianrifat | Can Udomcharoenchaikit | Fadil Risdian Ansori | Mahardika Krisna Ihsani | Giang Nguyen | Anab Maulana Barik | Dan John Velasco | Rifo Ahmad Genadi | Saptarshi Saha | Chengwei Wei | Isaiah Edri W. Flores | Kenneth Chen Ko Han | Anjela Gail D. Santos | Wan Shen Lim | Kaung Si Phyo | Tim Santos | Meisyarah Dwiastuti | Jiayun Luo | Jan Christian Blaise Cruz | Ming Shan Hee | Ikhlasul Akmal Hanif | M.Alif Al Hakim | Muhammad Rizky Sya’ban | Kun Kerdthaisong | Lester James Validad Miranda | Fajri Koto | Tirana Noor Fatyanosa | Alham Fikri Aji | Jostin Jerico Rosal | Jun Kevin | Robert Wijaya | Onno P. Kampman | Ruochen Zhang | Börje F. Karlsson | Peerat Limkonchotiwat
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite Southeast Asia’s (SEA) extraordinary linguistic and cultural diversity, the region remains significantly underrepresented in vision-language (VL) research, resulting in AI models that inadequately capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing culturally relevant high-quality datasets for SEA languages. By involving contributors from SEA countries, SEA-VL ensures better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages and cultural depictions in VL research. Our methodology employed three approaches: community-driven crowdsourcing with SEA contributors, automated image crawling, and synthetic image generation. We evaluated each method’s effectiveness in capturing cultural relevance. We found that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing, whereas synthetic image generation failed to accurately reflect SEA cultural nuances and contexts. Collectively, we gathered 1.28 million SEA culturally relevant images, more than 50 times larger than other existing datasets. This work bridges the representation gap in SEA, establishes a foundation for developing culturally aware AI systems for this region, and provides a replicable framework for addressing representation gaps in other underrepresented regions.
Search
Fix author
Co-authors
- Muhammad Ravi Shulthan Habibi 3
- Genta Indra Winata 3
- Amit Agarwal 2
- Alham Fikri Aji 2
- David Anugraha 2
- Michael Anugraha 2
- Mithil Bangera 2
- Yeshil Bangera 2
- Samuel Cahyawijaya 2
- Vicky Feliren 2
- Ikhlasul Akmal Hanif 2
- Fenal Ashokbhai Ilasariya 2
- Joseph Marvin Imperial 2
- Kun Kerdthaisong 2
- Jun Kevin 2
- Fajri Koto 2
- Peerat Limkonchotiwat 2
- Hitesh Laxmichand Patel 2
- Filbert Aurelian Tjiaranata 2
- Idris Abdulmumin 1
- Abdulhamid Abubakar 1
- Faisal Muhammad Adam 1
- Esther Adenuga 1
- Cristina Aggazzotti 1
- Raia Abu Ahmad 1
- Akshata 1
- Hend Al-Khalifa 1
- Jesujoba Alabi 1
- Vallerie Alexandra Putra 1
- Reem Alqifari 1
- Azril Hafizi Amirudin 1
- Cynthia Jayne Amol 1
- Nicholas Andrews 1
- Fadil Risdian Ansori 1
- Catherine Arnett 1
- Phakphum Artkaew 1
- Anab Maulana Barik 1
- Weerayut Buaphet 1
- Laurie Burchell 1
- Carlos Rafael Catalan 1
- Hande Celikkanat 1
- Kranti Chalamalasetti 1
- Leshem Choshen 1
- Thibault Clérice 1
- Jan Christian Blaise Cruz 1
- Lanwenn ar C’horr 1
- John Amadeo Daniswara 1
- Rasul Dent 1
- Konstantin Dobler 1
- Karan Dua 1
- Meisyarah Dwiastuti 1
- Evan Evan 1
- Mohammad Rifqi Farhansyah 1
- Tirana Noor Fatyanosa 1
- Teddy Ferdinan 1
- Isaiah Edri W. Flores 1
- Luca Foppiano 1
- Dmitry Gaynullin 1
- Rifo Ahmad Genadi 1
- Muhammad Ilham Ghozali 1
- Manuel Goulão 1
- Tommaso Green 1
- M.Alif Al Hakim 1
- Nadia Ghezaiel Hammouda 1
- Kenneth Chen Ko Han 1
- Rochana Prih Hastuti 1
- Ming Shan Hee 1
- Frederikus Hudi 1
- Nuhu Ibrahim 1
- Inshirah Idris 1
- Mahardika Krisna Ihsani 1
- Mohamed Fazli Mohamed Imam 1
- Piyalitt Ittichaiwong 1
- Audra Aurora Izzani 1
- Onno P. Kampman 1
- Börje F. Karlsson 1
- Amr Keleg 1
- Ilker Kesen 1
- Aye Hninn Khine 1
- Takdanai Kreangphet 1
- Jauza Akbar Krito 1
- Bruhan Kyomuhendo 1
- Haochen Li 1
- Yiyuan Li 1
- Adrian Xuan Wei Lim 1
- Wan Shen Lim 1
- Joanito Agili Lopo 1
- Holy Lovenia 1
- Sarah K. K. Luger 1
- Jiayun Luo 1
- Jean Maillard 1
- Kamohelo Makaaka 1
- Vukosi Marivate 1
- Juan Pablo Martínez 1
- Thant Thiri Maung 1
- Lester James Validad Miranda 1
- Patricia Nicole Monderin 1
- Joel Ruben Antony Moniz 1
- Sara Hincapié Monsalve 1
- Rafael Mosquera 1
- Carol Muchemi 1
- Shamsuddeen Hassan Muhammad 1
- Kenton Murray 1
- Ahmad Mustafid 1
- Casper Rufaro Muziri 1
- Adisai Na-Thalang 1
- Bahrul Ilmi Nasution 1
- Hamada Nayel 1
- Lynnette Hui Xian Ng 1
- Giang Nguyen 1
- Khang Nguyen 1
- My Chiffon Nguyen 1
- William Nixon 1
- Melika Nobakhtian 1
- Shu Okabe 1
- Pedro Ortiz Suarez 1
- Malte Ostendorff 1
- Verrah Akinyi Otiende 1
- Quentin Pagès 1
- Kadek Hendrawan Palgunadi 1
- Srikant Panda 1
- Priyaranjan Pattnayak 1
- Kaung Si Phyo 1
- Salsabila Zahirah Pranida 1
- Kevin Pratama 1
- Muhammad Reza Qorib 1
- Taki Hasan Rafi 1
- Rian Adam Rajagede 1
- Ingrid Gabriela Franco Ramirez 1
- Benjamin L Rice 1
- Matthew Theodore Roque 1
- Jostin Jerico Rosal 1
- Mattes Ruckdeschel 1
- Daniel Ruffinelli 1
- Manuel Antonio Rufino 1
- Benoît Sagot 1
- Saptarshi Saha 1
- Luis Frentzen Salim 1
- Saron Samuel 1
- Anjela Gail D. Santos 1
- Tim Santos 1
- Richardy Lobo Sapan 1
- Jakhongir Saydaliev 1
- Christian Simon 1
- Ayushman Singh 1
- Yueqi Song 1
- Pavel Stepachev 1
- Damian Stewart 1
- Supryadi 1
- Muhammad Rizky Sya’ban 1
- Sotaro Takeshita 1
- Atnafu Lambebo Tonja 1
- Yassine Toughrai 1
- Can Udomcharoenchaikit 1
- Gouthami Vadithya 1
- Sowmya Vajjala 1
- Rob Van Der Goot 1
- Thom Vaughan 1
- Kanyakorn Veerakanjana 1
- Dan John Velasco 1
- Karissa Vincentio 1
- Ahmad Mustapha Wali 1
- Bin Wang 1
- Azmine Toushik Wasi 1
- Chengwei Wei 1
- Robert Wijaya 1
- Andrew Yates 1
- Seid Muhie Yimam 1
- Yanzhi Yu 1
- Eryawan Presma Yulianrifat 1
- Hanif Muhammad Zhafran 1
- Mike Zhang 1
- Ruochen Zhang 1
- Ej Zhou 1