Yiyuan Li
2026
CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data
Pedro Ortiz Suarez | Laurie Burchell | Catherine Arnett | Rafael Mosquera | Sara Hincapi\'e Monsalve | Thom Vaughan | Damian Stewart | Malte Ostendorff | Idris Abdulmumin | Vukosi Marivate | Shamsuddeen Hassan Muhammad | Atnafu Lambebo Tonja | Hend Al-Khalifa | Nadia Ghezaiel Hammouda | Verrah Akinyi Otiende | Tack Hwa Wong | Jakhongir Saydaliev | Melika Nobakhtian | Muhammad Ravi Shulthan Habibi | Chalamalasetti Kranti | Carol Muchemi | Khang Nguyen | Faisal Muhammad Adam | Luis Frentzen Salim | Reem Alqifari | Cynthia Jayne Amol | Joseph Marvin Imperial | Ilker Kesen | Ahmad Mustafid | Pavel Stepachev | Leshem Choshen | David Anugraha | Hamada Nayel | Seid Muhie Yimam | Vallerie Alexandra Putra | My Chiffon Nguyen | Azmine Toushik Wasi | Gouthami Vadithya | Rob Van Der Goot | Lanwenn ar C'horr | Karan Dua | Andrew Yates | Mithil Bangera | Yeshil Bangera | Hitesh Laxmichand Patel | Shu Okabe | Fenal Ashokbhai Ilasariya | Dmitry Gaynullin | Genta Indra Winata | Yiyuan Li | Juan Pablo Mart{\'\i}nez | Amit Agarwal | Ikhlasul Akmal Hanif | Raia Abu Ahmad | Esther Adenuga | Filbert Aurelian Tjiaranata | Weerayut Buaphet | Michael Anugraha | Sowmya Vajjala | Benjamin L Rice | Azril Hafizi Amirudin | Jesujoba Oluwadara Alabi | Srikant Panda | Yassine Toughrai | Bruhan Kyomuhendo | Daniel Ruffinelli | Akshata | Manuel Goul\~ao | Ej Zhou | Ingrid Gabriela Franco Ramirez | Cristina Aggazzotti | Konstantin Dobler | Jun Kevin | Quentin Pag\`es | Nicholas Andrews | Nuhu Ibrahim | Mattes Ruckdeschel | Amr Keleg | Mike Zhang | Casper Rufaro Muziri | Saron Samuel | Sotaro Takeshita | Kun Kerdthaisong | Luca Foppiano | Rasul Dent | Tommaso Green | Ahmad Mustapha Wali | Kamohelo Makaaka | Vicky Feliren | Inshirah Idris | Hande Celikkanat | Abdulhamid Abubakar | Jean Maillard | Beno{\^\i}t Sagot | Thibault Cl\'erice | Kenton Murray | Sarah K. K. Luger
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Pedro Ortiz Suarez | Laurie Burchell | Catherine Arnett | Rafael Mosquera | Sara Hincapi\'e Monsalve | Thom Vaughan | Damian Stewart | Malte Ostendorff | Idris Abdulmumin | Vukosi Marivate | Shamsuddeen Hassan Muhammad | Atnafu Lambebo Tonja | Hend Al-Khalifa | Nadia Ghezaiel Hammouda | Verrah Akinyi Otiende | Tack Hwa Wong | Jakhongir Saydaliev | Melika Nobakhtian | Muhammad Ravi Shulthan Habibi | Chalamalasetti Kranti | Carol Muchemi | Khang Nguyen | Faisal Muhammad Adam | Luis Frentzen Salim | Reem Alqifari | Cynthia Jayne Amol | Joseph Marvin Imperial | Ilker Kesen | Ahmad Mustafid | Pavel Stepachev | Leshem Choshen | David Anugraha | Hamada Nayel | Seid Muhie Yimam | Vallerie Alexandra Putra | My Chiffon Nguyen | Azmine Toushik Wasi | Gouthami Vadithya | Rob Van Der Goot | Lanwenn ar C'horr | Karan Dua | Andrew Yates | Mithil Bangera | Yeshil Bangera | Hitesh Laxmichand Patel | Shu Okabe | Fenal Ashokbhai Ilasariya | Dmitry Gaynullin | Genta Indra Winata | Yiyuan Li | Juan Pablo Mart{\'\i}nez | Amit Agarwal | Ikhlasul Akmal Hanif | Raia Abu Ahmad | Esther Adenuga | Filbert Aurelian Tjiaranata | Weerayut Buaphet | Michael Anugraha | Sowmya Vajjala | Benjamin L Rice | Azril Hafizi Amirudin | Jesujoba Oluwadara Alabi | Srikant Panda | Yassine Toughrai | Bruhan Kyomuhendo | Daniel Ruffinelli | Akshata | Manuel Goul\~ao | Ej Zhou | Ingrid Gabriela Franco Ramirez | Cristina Aggazzotti | Konstantin Dobler | Jun Kevin | Quentin Pag\`es | Nicholas Andrews | Nuhu Ibrahim | Mattes Ruckdeschel | Amr Keleg | Mike Zhang | Casper Rufaro Muziri | Saron Samuel | Sotaro Takeshita | Kun Kerdthaisong | Luca Foppiano | Rasul Dent | Tommaso Green | Ahmad Mustapha Wali | Kamohelo Makaaka | Vicky Feliren | Inshirah Idris | Hande Celikkanat | Abdulhamid Abubakar | Jean Maillard | Beno{\^\i}t Sagot | Thibault Cl\'erice | Kenton Murray | Sarah K. K. Luger
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID’s value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.
2024
FRoG: Evaluating Fuzzy Reasoning of Generalized Quantifiers in LLMs
Yiyuan Li | Shichao Sun | Pengfei Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Yiyuan Li | Shichao Sun | Pengfei Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Fuzzy reasoning is vital due to the frequent use of imprecise information in daily contexts. However, the ability of current large language models (LLMs) to handle such reasoning remains largely uncharted. In this paper, we introduce a new benchmark, FRoG, for fuzzy reasoning, featuring real-world mathematical word problems that incorporate generalized quantifiers. Our experimental findings reveal that fuzzy reasoning continues to pose significant challenges for LLMs. Moreover, we find that existing methods designed to enhance reasoning do not consistently improve performance in tasks involving fuzzy logic. Additionally, our results show an inverse scaling effect in the performance of LLMs on FRoG. Interestingly, we also demonstrate that strong mathematical reasoning skills are not necessarily indicative of success on our benchmark.
2023
Pragmatic Reasoning Unlocks Quantifier Semantics for Foundation Models
Yiyuan Li | Rakesh Menon | Sayan Ghosh | Shashank Srivastava
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Yiyuan Li | Rakesh Menon | Sayan Ghosh | Shashank Srivastava
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Generalized quantifiers (e.g., few, most) are used to indicate the proportions predicates satisfy (for example, some apples are red). One way to interpret quantifier semantics is to explicitly bind these satisfactions with percentage scopes (e.g., 30%-40% of apples are red). This approach can be helpful for tasks like logic formalization and surface-form quantitative reasoning (Gordon and Schubert, 2010; Roy et al., 2015). However, it remains unclear if recent foundation models (Bommasani et al., 2021) possess this ability due to the absence of direct training signals. To explore this, we introduce QuRe, a crowd-sourced dataset of human-annotated generalized quantifiers in Wikipedia sentences featuring percentage-equipped predicates. We explore quantifier comprehension using PRESQUE, a framework that combines natural language inference and the Rational Speech Acts framework. Experimental results on the HVD dataset (Herbelot and Vecchi, 2015) and QuRe demonstrate PRESQUE’s superiority over a literal listener baseline, showing a 20% relative improvement in F1 in predicting percentage scopes for quantifiers, even with no additional training.
2022
SPE: Symmetrical Prompt Enhancement for Fact Probing
Yiyuan Li | Tong Che | Yezhen Wang | Zhengbao Jiang | Caiming Xiong | Snigdha Chaturvedi
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Yiyuan Li | Tong Che | Yezhen Wang | Zhengbao Jiang | Caiming Xiong | Snigdha Chaturvedi
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Pretrained language models (PLMs) have been shown to accumulate factual knowledge during pretraining (Petroni et al. 2019). Recent works probe PLMs for the extent of this knowledge through prompts either in discrete or continuous forms. However, these methods do not consider symmetry of the task: object prediction and subject prediction. In this work, we propose Symmetrical Prompt Enhancement (SPE), a continuous prompt-based method for factual probing in PLMs that leverages the symmetry of the task by constructing symmetrical prompts for subject and object prediction. Our results on a popular factual probing dataset, LAMA, show significant improvement of SPE over previous probing methods.
2021
Adversarial Scrubbing of Demographic Information for Text Classification
Somnath Basu Roy Chowdhury | Sayan Ghosh | Yiyuan Li | Junier Oliva | Shashank Srivastava | Snigdha Chaturvedi
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Somnath Basu Roy Chowdhury | Sayan Ghosh | Yiyuan Li | Junier Oliva | Shashank Srivastava | Snigdha Chaturvedi
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Contextual representations learned by language models can often encode undesirable attributes, like demographic associations of the users, while being trained for an unrelated target task. We aim to scrub such undesirable attributes and learn fair representations while maintaining performance on the target task. In this paper, we present an adversarial learning framework “Adversarial Scrubber” (AdS), to debias contextual representations. We perform theoretical analysis to show that our framework converges without leaking demographic information under certain conditions. We extend previous evaluation techniques by evaluating debiasing performance using Minimum Description Length (MDL) probing. Experimental evaluations on 8 datasets show that AdS generates representations with minimal information about demographic attributes while being maximally informative about the target task.
2020
A Summary of the First Workshop on Language Technology for Language Documentation and Revitalization
Graham Neubig | Shruti Rijhwani | Alexis Palmer | Jordan MacKenzie | Hilaria Cruz | Xinjian Li | Matthew Lee | Aditi Chaudhary | Luke Gessler | Steven Abney | Shirley Anugrah Hayati | Antonios Anastasopoulos | Olga Zamaraeva | Emily Prud’hommeaux | Jennette Child | Sara Child | Rebecca Knowles | Sarah Moeller | Jeffrey Micher | Yiyuan Li | Sydney Zink | Mengzhou Xia | Roshan Sharma | Patrick Littell
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)
Graham Neubig | Shruti Rijhwani | Alexis Palmer | Jordan MacKenzie | Hilaria Cruz | Xinjian Li | Matthew Lee | Aditi Chaudhary | Luke Gessler | Steven Abney | Shirley Anugrah Hayati | Antonios Anastasopoulos | Olga Zamaraeva | Emily Prud’hommeaux | Jennette Child | Sara Child | Rebecca Knowles | Sarah Moeller | Jeffrey Micher | Yiyuan Li | Sydney Zink | Mengzhou Xia | Roshan Sharma | Patrick Littell
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)
Despite recent advances in natural language processing and other language technology, the application of such technology to language documentation and conservation has been limited. In August 2019, a workshop was held at Carnegie Mellon University in Pittsburgh, PA, USA to attempt to bring together language community members, documentary linguists, and technologists to discuss how to bridge this gap and create prototypes of novel and practical language revitalization technologies. The workshop focused on developing technologies to aid language documentation and revitalization in four areas: 1) spoken language (speech transcription, phone to orthography decoding, text-to-speech and text-speech forced alignment), 2) dictionary extraction and management, 3) search tools for corpora, and 4) social media (language learning bots and social media analysis). This paper reports the results of this workshop, including issues discussed, and various conceived and implemented technologies for nine languages: Arapaho, Cayuga, Inuktitut, Irish Gaelic, Kidaw’ida, Kwak’wala, Ojibwe, San Juan Quiahije Chatino, and Seneca.
Search
Fix author
Co-authors
- Snigdha Chaturvedi 2
- Sayan Ghosh 2
- Shashank Srivastava 2
- Idris Abdulmumin 1
- Steven Abney 1
- Abdulhamid Abubakar 1
- Faisal Muhammad Adam 1
- Esther Adenuga 1
- Amit Agarwal 1
- Cristina Aggazzotti 1
- Raia Abu Ahmad 1
- Akshata 1
- Hend Al-Khalifa 1
- Jesujoba Alabi 1
- Reem Alqifari 1
- Azril Hafizi Amirudin 1
- Cynthia Jayne Amol 1
- Antonios Anastasopoulos 1
- Nicholas Andrews 1
- David Anugraha 1
- Michael Anugraha 1
- Catherine Arnett 1
- Mithil Bangera 1
- Yeshil Bangera 1
- Weerayut Buaphet 1
- Laurie Burchell 1
- Lanwenn ar C'horr 1
- Hande Celikkanat 1
- Kranti Chalamalasetti 1
- Aditi Chaudhary 1
- Tong Che 1
- Jennette Child 1
- Sara Child 1
- Leshem Choshen 1
- Somnath Basu Roy Chowdhury 1
- Thibault Cl\'erice 1
- Hilaria Cruz 1
- Rasul Dent 1
- Konstantin Dobler 1
- Karan Dua 1
- Vicky Feliren 1
- Luca Foppiano 1
- Dmitry Gaynullin 1
- Luke Gessler 1
- Manuel Goul\~ao 1
- Tommaso Green 1
- Muhammad Ravi Shulthan Habibi 1
- Nadia Ghezaiel Hammouda 1
- Ikhlasul Akmal Hanif 1
- Shirley Anugrah Hayati 1
- Nuhu Ibrahim 1
- Inshirah Idris 1
- Fenal Ashokbhai Ilasariya 1
- Joseph Marvin Imperial 1
- Zhengbao Jiang 1
- Amr Keleg 1
- Kun Kerdthaisong 1
- Ilker Kesen 1
- Jun Kevin 1
- Rebecca Knowles 1
- Bruhan Kyomuhendo 1
- Matthew Lee 1
- Xinjian Li 1
- Patrick Littell 1
- Pengfei Liu 1
- Sarah K. K. Luger 1
- Jordan MacKenzie 1
- Jean Maillard 1
- Kamohelo Makaaka 1
- Vukosi Marivate 1
- Juan Pablo Martínez 1
- Rakesh R Menon 1
- Jeffrey Micher 1
- Sarah Moeller 1
- Sara Hincapi\'e Monsalve 1
- Rafael Mosquera 1
- Carol Muchemi 1
- Shamsuddeen Hassan Muhammad 1
- Kenton Murray 1
- Ahmad Mustafid 1
- Casper Rufaro Muziri 1
- Hamada Nayel 1
- Graham Neubig 1
- Khang Nguyen 1
- My Chiffon Nguyen 1
- Melika Nobakhtian 1
- Shu Okabe 1
- Junier Oliva 1
- Pedro Ortiz Suarez 1
- Malte Ostendorff 1
- Verrah Akinyi Otiende 1
- Quentin Pag\`es 1
- Alexis Palmer 1
- Srikant Panda 1
- Hitesh Laxmichand Patel 1
- Emily Prud’hommeaux 1
- Vallerie Alexandra Putra 1
- Ingrid Gabriela Franco Ramirez 1
- Benjamin L Rice 1
- Shruti Rijhwani 1
- Mattes Ruckdeschel 1
- Daniel Ruffinelli 1
- Benoît Sagot 1
- Luis Frentzen Salim 1
- Saron Samuel 1
- Jakhongir Saydaliev 1
- Roshan Sharma 1
- Pavel Stepachev 1
- Damian Stewart 1
- Shichao Sun 1
- Sotaro Takeshita 1
- Filbert Aurelian Tjiaranata 1
- Atnafu Lambebo Tonja 1
- Yassine Toughrai 1
- Gouthami Vadithya 1
- Sowmya Vajjala 1
- Rob Van Der Goot 1
- Thom Vaughan 1
- Ahmad Mustapha Wali 1
- Yezhen Wang 1
- Azmine Toushik Wasi 1
- Genta Indra Winata 1
- Tack Hwa Wong 1
- Mengzhou Xia 1
- Caiming Xiong 1
- Andrew Yates 1
- Seid Muhie Yimam 1
- Olga Zamaraeva 1
- Mike Zhang 1
- Ej Zhou 1
- Sydney Zink 1