Ej Zhou
Also published as: Yijie Zhou
2026
Dial HEALTHDIAL for Advice: A Multilingual and Multi-Parallel Spoken Dialogue Dataset for Knowledge-Grounded Information Seeking
Songbo Hu | Yinhong Liu | Ej Zhou | Evgeniia Razumovskaia | Xiaobin Wang | Alexander Fraser | Ivan Vuli\'c | Anna Korhonen
Findings of the Association for Computational Linguistics: ACL 2026
Songbo Hu | Yinhong Liu | Ej Zhou | Evgeniia Razumovskaia | Xiaobin Wang | Alexander Fraser | Ivan Vuli\'c | Anna Korhonen
Findings of the Association for Computational Linguistics: ACL 2026
Creating spoken dialogue datasets is methodologically challenging, and these challenges are amplified when the goal is to build multilingual, multi-parallel datasets at scale. This work introduces HEALTHDIAL, a large-scale, multilingual, and multi-parallel dataset for developing and evaluating retrieval-augmented generation (RAG)–based spoken dialogue systems. The dataset comprises 6,000 information-seeking dialogues (1,500 per language) grounded in trusted content from the World Health Organization (WHO) and 163 hours of user speech recorded from native speakers of diverse dialects across four official WHO languages: Arabic, Chinese, English, and Spanish. Each speaker is annotated with demographic (e.g., gender, age) and sociolinguistic (e.g., primary language, region of origin) variables. We report benchmark results across key dialogue tasks, which reveal consistent performance disparities across languages, even among high-resource ones. To support future research, we release the dataset, a prototype system, and a toolkit for data collection and system evaluation.
Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework
Jiaqi Weng | Han Zheng | Hanyu Zhang | Ej Zhou | Qinqin He | Jialing Tao | Hui Xue | Zhixuan Chu | Xiting Wang
Findings of the Association for Computational Linguistics: ACL 2026
Jiaqi Weng | Han Zheng | Hanyu Zhang | Ej Zhou | Qinqin He | Jialing Tao | Hui Xue | Zhixuan Chu | Xiting Wang
Findings of the Association for Computational Linguistics: ACL 2026
Sparse autoencoders (SAEs) enable interpretability research by decomposing entangled model activations into monosemantic features. However, under what circumstances SAEs derive most fine-grained latent features for safety—a low-frequency concept domain—remains unexplored. Two key challenges exist: identifying SAEs with the greatest potential for generating safety domain-specific features, and the prohibitively high cost of detailed feature explanation. In this paper, we propose **Safe-SAIL**, a unified framework for interpreting SAE features in safety-critical domains to advance mechanistic understanding of large language models. Safe-SAIL introduces a pre-explanation evaluation metric to efficiently identify SAEs with strong safety domain-specific interpretability, and reduces interpretation cost by 55% through a segment-level simulation strategy. Building on Safe-SAIL, we train a comprehensive suite of SAEs with human-readable explanations and systematic evaluations for 1,758 safety-related features spanning four domains: pornography, politics, violence, and terror. Using this resource, we conduct empirical analyses and provide insights on the effectiveness of Safe-SAIL for risk feature identification and how safety-critical entities and concepts are encoded across model layers. All models, explanations, and tools are publicly released in an open-source toolkit at https://anonymous.4open.science/r/Safe-SAIL/.
CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data
Pedro Ortiz Suarez | Laurie Burchell | Catherine Arnett | Rafael Mosquera | Sara Hincapi\'e Monsalve | Thom Vaughan | Damian Stewart | Malte Ostendorff | Idris Abdulmumin | Vukosi Marivate | Shamsuddeen Hassan Muhammad | Atnafu Lambebo Tonja | Hend Al-Khalifa | Nadia Ghezaiel Hammouda | Verrah Akinyi Otiende | Tack Hwa Wong | Jakhongir Saydaliev | Melika Nobakhtian | Muhammad Ravi Shulthan Habibi | Chalamalasetti Kranti | Carol Muchemi | Khang Nguyen | Faisal Muhammad Adam | Luis Frentzen Salim | Reem Alqifari | Cynthia Jayne Amol | Joseph Marvin Imperial | Ilker Kesen | Ahmad Mustafid | Pavel Stepachev | Leshem Choshen | David Anugraha | Hamada Nayel | Seid Muhie Yimam | Vallerie Alexandra Putra | My Chiffon Nguyen | Azmine Toushik Wasi | Gouthami Vadithya | Rob Van Der Goot | Lanwenn ar C'horr | Karan Dua | Andrew Yates | Mithil Bangera | Yeshil Bangera | Hitesh Laxmichand Patel | Shu Okabe | Fenal Ashokbhai Ilasariya | Dmitry Gaynullin | Genta Indra Winata | Yiyuan Li | Juan Pablo Mart{\'\i}nez | Amit Agarwal | Ikhlasul Akmal Hanif | Raia Abu Ahmad | Esther Adenuga | Filbert Aurelian Tjiaranata | Weerayut Buaphet | Michael Anugraha | Sowmya Vajjala | Benjamin L Rice | Azril Hafizi Amirudin | Jesujoba Oluwadara Alabi | Srikant Panda | Yassine Toughrai | Bruhan Kyomuhendo | Daniel Ruffinelli | Akshata | Manuel Goul\~ao | Ej Zhou | Ingrid Gabriela Franco Ramirez | Cristina Aggazzotti | Konstantin Dobler | Jun Kevin | Quentin Pag\`es | Nicholas Andrews | Nuhu Ibrahim | Mattes Ruckdeschel | Amr Keleg | Mike Zhang | Casper Rufaro Muziri | Saron Samuel | Sotaro Takeshita | Kun Kerdthaisong | Luca Foppiano | Rasul Dent | Tommaso Green | Ahmad Mustapha Wali | Kamohelo Makaaka | Vicky Feliren | Inshirah Idris | Hande Celikkanat | Abdulhamid Abubakar | Jean Maillard | Beno{\^\i}t Sagot | Thibault Cl\'erice | Kenton Murray | Sarah K. K. Luger
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Pedro Ortiz Suarez | Laurie Burchell | Catherine Arnett | Rafael Mosquera | Sara Hincapi\'e Monsalve | Thom Vaughan | Damian Stewart | Malte Ostendorff | Idris Abdulmumin | Vukosi Marivate | Shamsuddeen Hassan Muhammad | Atnafu Lambebo Tonja | Hend Al-Khalifa | Nadia Ghezaiel Hammouda | Verrah Akinyi Otiende | Tack Hwa Wong | Jakhongir Saydaliev | Melika Nobakhtian | Muhammad Ravi Shulthan Habibi | Chalamalasetti Kranti | Carol Muchemi | Khang Nguyen | Faisal Muhammad Adam | Luis Frentzen Salim | Reem Alqifari | Cynthia Jayne Amol | Joseph Marvin Imperial | Ilker Kesen | Ahmad Mustafid | Pavel Stepachev | Leshem Choshen | David Anugraha | Hamada Nayel | Seid Muhie Yimam | Vallerie Alexandra Putra | My Chiffon Nguyen | Azmine Toushik Wasi | Gouthami Vadithya | Rob Van Der Goot | Lanwenn ar C'horr | Karan Dua | Andrew Yates | Mithil Bangera | Yeshil Bangera | Hitesh Laxmichand Patel | Shu Okabe | Fenal Ashokbhai Ilasariya | Dmitry Gaynullin | Genta Indra Winata | Yiyuan Li | Juan Pablo Mart{\'\i}nez | Amit Agarwal | Ikhlasul Akmal Hanif | Raia Abu Ahmad | Esther Adenuga | Filbert Aurelian Tjiaranata | Weerayut Buaphet | Michael Anugraha | Sowmya Vajjala | Benjamin L Rice | Azril Hafizi Amirudin | Jesujoba Oluwadara Alabi | Srikant Panda | Yassine Toughrai | Bruhan Kyomuhendo | Daniel Ruffinelli | Akshata | Manuel Goul\~ao | Ej Zhou | Ingrid Gabriela Franco Ramirez | Cristina Aggazzotti | Konstantin Dobler | Jun Kevin | Quentin Pag\`es | Nicholas Andrews | Nuhu Ibrahim | Mattes Ruckdeschel | Amr Keleg | Mike Zhang | Casper Rufaro Muziri | Saron Samuel | Sotaro Takeshita | Kun Kerdthaisong | Luca Foppiano | Rasul Dent | Tommaso Green | Ahmad Mustapha Wali | Kamohelo Makaaka | Vicky Feliren | Inshirah Idris | Hande Celikkanat | Abdulhamid Abubakar | Jean Maillard | Beno{\^\i}t Sagot | Thibault Cl\'erice | Kenton Murray | Sarah K. K. Luger
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID’s value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.
2025
Extended Abstract for “Linguistic Universals”: Emergent Shared Features in Independent Monolingual Language Models via Sparse Autoencoders
Ej Zhou | Suchir Salhan
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)
Ej Zhou | Suchir Salhan
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)
Do independently trained monolingual language models converge on shared linguistic principles? To explore this question, we propose to analyze a suite of models trained separately on single languages but with identical architectures and budgets. We train sparse autoencoders (SAEs) on model activations to obtain interpretable latent features, then align them across languages using activation correlations. We do pairwise analyses to see if feature spaces show non-trivial convergence, and we identify universal features that consistently emerge across diverse models. Positive results will provide evidence that certain high-level regularities in language are rediscovered independently in machine learning systems.
2023
Revisiting Cross-Lingual Summarization: A Corpus-based Study and A New Benchmark with Improved Annotation
Yulong Chen | Huajian Zhang | Yijie Zhou | Xuefeng Bai | Yueguan Wang | Ming Zhong | Jianhao Yan | Yafu Li | Judy Li | Xianchao Zhu | Yue Zhang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yulong Chen | Huajian Zhang | Yijie Zhou | Xuefeng Bai | Yueguan Wang | Ming Zhong | Jianhao Yan | Yafu Li | Judy Li | Xianchao Zhu | Yue Zhang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Most existing cross-lingual summarization (CLS) work constructs CLS corpora by simply and directly translating pre-annotated summaries from one language to another, which can contain errors from both summarization and translation processes. To address this issue, we propose ConvSumX, a cross-lingual conversation summarization benchmark, through a new annotation schema that explicitly considers source input context. ConvSumX consists of 2 sub-tasks under different real-world scenarios, with each covering 3 language directions. We conduct thorough analysis on ConvSumX and 3 widely-used manually annotated CLS corpora and empirically find that ConvSumX is more faithful towards input text. Additionally, based on the same intuition, we propose a 2-Step method, which takes both conversation and summary as input to simulate human annotation process. Experimental results show that 2-Step method surpasses strong baselines on ConvSumX under both automatic and human evaluation. Analysis shows that both source input text and summary are crucial for modeling cross-lingual summaries.
Search
Fix author
Co-authors
- Idris Abdulmumin 1
- Abdulhamid Abubakar 1
- Faisal Muhammad Adam 1
- Esther Adenuga 1
- Amit Agarwal 1
- Cristina Aggazzotti 1
- Raia Abu Ahmad 1
- Akshata 1
- Hend Al-Khalifa 1
- Jesujoba Alabi 1
- Reem Alqifari 1
- Azril Hafizi Amirudin 1
- Cynthia Jayne Amol 1
- Nicholas Andrews 1
- David Anugraha 1
- Michael Anugraha 1
- Catherine Arnett 1
- Xuefeng Bai (白雪峰) 1
- Mithil Bangera 1
- Yeshil Bangera 1
- Weerayut Buaphet 1
- Laurie Burchell 1
- Lanwenn ar C'horr 1
- Hande Celikkanat 1
- Kranti Chalamalasetti 1
- Yulong Chen 1
- Leshem Choshen 1
- Zhixuan Chu 1
- Thibault Cl\'erice 1
- Rasul Dent 1
- Konstantin Dobler 1
- Karan Dua 1
- Vicky Feliren 1
- Luca Foppiano 1
- Alexander Fraser 1
- Dmitry Gaynullin 1
- Manuel Goul\~ao 1
- Tommaso Green 1
- Muhammad Ravi Shulthan Habibi 1
- Nadia Ghezaiel Hammouda 1
- Ikhlasul Akmal Hanif 1
- Qinqin He 1
- Songbo Hu 1
- Nuhu Ibrahim 1
- Inshirah Idris 1
- Fenal Ashokbhai Ilasariya 1
- Joseph Marvin Imperial 1
- Amr Keleg 1
- Kun Kerdthaisong 1
- Ilker Kesen 1
- Jun Kevin 1
- Anna Korhonen 1
- Bruhan Kyomuhendo 1
- Yiyuan Li 1
- Yafu Li 1
- Judy Li 1
- Yinhong Liu 1
- Sarah K. K. Luger 1
- Jean Maillard 1
- Kamohelo Makaaka 1
- Vukosi Marivate 1
- Juan Pablo Martínez 1
- Sara Hincapi\'e Monsalve 1
- Rafael Mosquera 1
- Carol Muchemi 1
- Shamsuddeen Hassan Muhammad 1
- Kenton Murray 1
- Ahmad Mustafid 1
- Casper Rufaro Muziri 1
- Hamada Nayel 1
- Khang Nguyen 1
- My Chiffon Nguyen 1
- Melika Nobakhtian 1
- Shu Okabe 1
- Pedro Ortiz Suarez 1
- Malte Ostendorff 1
- Verrah Akinyi Otiende 1
- Quentin Pag\`es 1
- Srikant Panda 1
- Hitesh Laxmichand Patel 1
- Vallerie Alexandra Putra 1
- Ingrid Gabriela Franco Ramirez 1
- Evgeniia Razumovskaia 1
- Benjamin L Rice 1
- Mattes Ruckdeschel 1
- Daniel Ruffinelli 1
- Benoît Sagot 1
- Suchir Salhan 1
- Luis Frentzen Salim 1
- Saron Samuel 1
- Jakhongir Saydaliev 1
- Pavel Stepachev 1
- Damian Stewart 1
- Sotaro Takeshita 1
- Jialing Tao 1
- Filbert Aurelian Tjiaranata 1
- Atnafu Lambebo Tonja 1
- Yassine Toughrai 1
- Gouthami Vadithya 1
- Sowmya Vajjala 1
- Rob Van Der Goot 1
- Thom Vaughan 1
- Ivan Vuli\'c 1
- Ahmad Mustapha Wali 1
- Xiaobin Wang 1
- Xiting Wang 1
- Yueguan Wang 1
- Azmine Toushik Wasi 1
- Jiaqi Weng 1
- Genta Indra Winata 1
- Tack Hwa Wong 1
- Hui Xue 1
- Jianhao Yan 1
- Andrew Yates 1
- Seid Muhie Yimam 1
- Hanyu Zhang 1
- Mike Zhang 1
- Huajian Zhang 1
- Yue Zhang 1
- Han Zheng 1
- Ming Zhong 1
- Xianchao Zhu 1