Jinyan Su
2026
FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning
Zhuohan Xie | Daniil Orel | Rushil Thareja | Dhruv Sahnan | Hachem Madmoun | Fan Zhang | Debopriyo Banerjee | Georgi Nenkov Georgiev | Xueqing Peng | Lingfei Qian | Jimin Huang | Jinyan Su | Aaryamonvikram Singh | Rui Xing | Rania Elbadry | Chen Xu | Haonan Li | Fajri Koto | Ivan Koychev | Tanmoy Chakraborty | Yuxia Wang | Salem Lahlou | Veselin Stoyanov | Sophia Ananiadou | Preslav Nakov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhuohan Xie | Daniil Orel | Rushil Thareja | Dhruv Sahnan | Hachem Madmoun | Fan Zhang | Debopriyo Banerjee | Georgi Nenkov Georgiev | Xueqing Peng | Lingfei Qian | Jimin Huang | Jinyan Su | Aaryamonvikram Singh | Rui Xing | Rania Elbadry | Chen Xu | Haonan Li | Fajri Koto | Ivan Koychev | Tanmoy Chakraborty | Yuxia Wang | Salem Lahlou | Veselin Stoyanov | Sophia Ananiadou | Preslav Nakov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multi-step symbolic reasoning is essential for robust financial analysis; yet, current benchmarks largely overlook this capability. Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while neglecting the intermediate reasoning steps required for transparency and verification. To address this gap, we introduce FinChain, the first benchmark specifically designed for verifiable Chain-of-Thought evaluation in finance. FinChain spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python code that enable fully machine-verifiable reasoning and scalable, contamination-free data generation.To assess reasoning capacity, we propose ChainEval, a dynamic alignment measure that jointly evaluates both the final-answer correctness and the step-level reasoning consistency. Our evaluation of 26 leading LLMs reveals that even frontier LLMs exhibit clear limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models can substantially narrow this gap.Overall, FinChain exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI. This project is available at https://github.com/mbzuai-nlp/finchain.git.
GSM-Noise: Exploring and Enhancing Large Language Models’ Reasoning under Noisy Inputs
Zhengxin Zhang | Chengyu Huang | Xufu Liu | Dan Zhao | Jinyan Su | Claire Cardie
Findings of the Association for Computational Linguistics: ACL 2026
Zhengxin Zhang | Chengyu Huang | Xufu Liu | Dan Zhao | Jinyan Su | Claire Cardie
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) have demonstrated impressive reasoning capabilities, yet they often struggle when dealing with complex, ill-formed, or noisy inputs that frequently occur in interactions with real users. LLMs typically lack crucial refining capabilities needed to filter out irrelevant details, restructure key points before reasoning over the text and responding, resulting in suboptimal performance and incorrect answers. From an information theory perspective, this behavior is akin to decoding a high-entropy problem without first reducing its entropy. In this work, we first introduce GSM-Noise, a benchmark featuring grade-school math problems systematically perturbed to reflect real-world input variability. We show that the reasoning ability of open-source models (e.g., LLaMA and Qwen series) can be compromised by noise, while closed-source models are more robust. To improve LLM robustness under noisy conditions, we propose that LLMs first refine inputs — thereby reducing their entropy — before engaging in in-depth analysis. We investigate three approaches to instill this refinement capability: prompt engineering (PE), supervised finetuning (SFT), and reinforcement learning (RL). Experimental results show that input refinement leads to consistent performance gains: 2–12% with PE, 4–13% with SFT, and 3–25% with RL. These results highlight the importance of incorporating an explicit refinement phase to enhance the robustness and reliability of LLM reasoning in real-world scenarios.
The Illusion of Specialization: Unveiling the Domain-Invariant "Standing Committee" in Mixture-of-Experts Models
Yan Wang | Yitao Xu | Nanhan Shen | Jinyan Su | Jimin Huang | Zining Zhu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yan Wang | Yitao Xu | Nanhan Shen | Jinyan Su | Jimin Huang | Zining Zhu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Mixture of Experts models are widely assumed to achieve domain specialization through sparse routing. In this work, we question this assumption by introducing COMMITTEEAUDIT, a post hoc framework that analyzes routing behavior at the level of expert groups rather than individual experts. Across three representative models and the MMLU benchmark, we uncover a domain invariant Standing Committee. This is a compact coalition of routed experts that consistently captures the majority of routing mass across domains, layers, and routing budgets, even when architectures already include shared experts. Qualitative analysis further shows that Standing Committees anchor reasoning structure and syntax, while peripheral experts handle domain-specific knowledge. These findings reveal a strong structural bias toward centralized computation, suggesting that specialization in Mixture of Experts models is far less pervasive than commonly believed. Crucially, this inherent bias indicates that current training objectives, such as load-balancing losses that enforce uniform expert utilization, may be working against the model’s natural optimization path, thereby limiting training efficiency and performance.
Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI
Yuxia Wang | Rui Xing | Jonibek Mansurov | Giovanni Puccetti | Zhuohan Xie | Minh Ngoc Ta | Jiahui Geng | Jinyan Su | Mervat Abassy | Saadeldine Eletter | Kareem Elozeiri | Nurkhan Laiyk | Maiya Goloburda | Tarek Mahmoud | Raj Vardhan Tomar | Alexander Aziz | Ryuto Koike | Masahiro Kaneko | Artem Shelmanov | Ekaterina Artemova | Vladislav Mikhailov | Akim Tsvigun | Alham Fikri Aji | Nizar Habash | Iryna Gurevych | Preslav Nakov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuxia Wang | Rui Xing | Jonibek Mansurov | Giovanni Puccetti | Zhuohan Xie | Minh Ngoc Ta | Jiahui Geng | Jinyan Su | Mervat Abassy | Saadeldine Eletter | Kareem Elozeiri | Nurkhan Laiyk | Maiya Goloburda | Tarek Mahmoud | Raj Vardhan Tomar | Alexander Aziz | Ryuto Koike | Masahiro Kaneko | Artem Shelmanov | Ekaterina Artemova | Vladislav Mikhailov | Akim Tsvigun | Alham Fikri Aji | Nizar Habash | Iryna Gurevych | Preslav Nakov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Prior studies have shown that distinguishing text generated by Large Language Models (LLMs) from human-written one is highly challenging for humans, and often no better than random guessing. To verify the generalizability of this finding across languages and domains, we perform an extensive case study to identify the upper bound of human detection accuracy. Across 16 datasets covering 9 languages and 9 domains, 19 annotators achieved an average detection accuracy of 87.6%, thus challenging previous conclusions. We find that major gaps between human and machine text lie in concreteness, cultural nuances, and diversity. Prompting by explicitly explaining the distinctions in the prompts can partially bridge the gaps in over 50% of the cases. However, we also find that humans do not always prefer human-written text, particularly when they cannot clearly identify its source. We release our dataset, the human labels, and the annotator metadata at https://github.com/xnlp-lab/HumanEval-MGT.
2025
Corpus Poisoning via Approximate Greedy Gradient Descent
Jinyan Su | Preslav Nakov | Claire Cardie
Findings of the Association for Computational Linguistics: ACL 2025
Jinyan Su | Preslav Nakov | Claire Cardie
Findings of the Association for Computational Linguistics: ACL 2025
Dense retrievers are widely used in information retrieval and have also been successfully extended to other knowledge intensive areas such as language models, e.g., Retrieval-Augmented Generation (RAG) systems. Unfortunately, they have recently been shown to be vulnerable to corpus poisoning attacks in which a malicious user injects a small fraction of adversarial passages into the retrieval corpus to trick the system into returning these passages among the top-ranked results for a broad set of user queries. Further study is needed to understand the extent to which these attacks could limit the deployment of dense retrievers in real-world applications. In this work, we propose Approximate Greedy Gradient Descent (AGGD), a new attack on dense retrieval systems based on the widely used HotFlip method for efficiently generating adversarial passages. We demonstrate that AGGD can select a higher quality set of token-level perturbations than HotFlip by replacing its random token sampling with a more structured search. Experimentally, we show that our method achieves a high attack success rate on several datasets and using several retrievers, and can generalize to unseen queries and new domains. Notably, our method is extremely effective in attacking the ANCE retrieval model, achieving attack success rates that are 15.24% and 17.44% higher on the NQ and MS MARCO datasets, respectively, compared to HotFlip. Additionally, we demonstrate AGGD’s potential to replace HotFlip in other adversarial attacks, such as knowledge poisoning of RAG systems.
GenAI Content Detection Task 1: English and Multilingual Machine-Generated Text Detection: AI vs. Human
Yuxia Wang | Artem Shelmanov | Jonibek Mansurov | Akim Tsvigun | Vladislav Mikhailov | Rui Xing | Zhuohan Xie | Jiahui Geng | Giovanni Puccetti | Ekaterina Artemova | Jinyan Su | Minh Ngoc Ta | Mervat Abassy | Kareem Ashraf Elozeiri | Saad El Dine Ahmed El Etter | Maiya Goloburda | Tarek Mahmoud | Raj Vardhan Tomar | Nurkhan Laiyk | Osama Mohammed Afzal | Ryuto Koike | Masahiro Kaneko | Alham Fikri Aji | Nizar Habash | Iryna Gurevych | Preslav Nakov
Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect)
Yuxia Wang | Artem Shelmanov | Jonibek Mansurov | Akim Tsvigun | Vladislav Mikhailov | Rui Xing | Zhuohan Xie | Jiahui Geng | Giovanni Puccetti | Ekaterina Artemova | Jinyan Su | Minh Ngoc Ta | Mervat Abassy | Kareem Ashraf Elozeiri | Saad El Dine Ahmed El Etter | Maiya Goloburda | Tarek Mahmoud | Raj Vardhan Tomar | Nurkhan Laiyk | Osama Mohammed Afzal | Ryuto Koike | Masahiro Kaneko | Alham Fikri Aji | Nizar Habash | Iryna Gurevych | Preslav Nakov
Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect)
We present the GenAI Content Detection Task 1 – a shared task on binary machine generated text detection, conducted as a part of the GenAI workshop at COLING 2025. The task consists of two subtasks: Monolingual (English) and Multilingual. The shared task attracted many participants: 36 teams made official submissions to the Monolingual subtask during the test phase and 27 teams – to the Multilingual. We provide a comprehensive overview of the data, a summary of the results – including system rankings and performance scores – detailed descriptions of the participating systems, and an in-depth analysis of submissions.
2024
M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection
Yuxia Wang | Jonibek Mansurov | Petar Ivanov | Jinyan Su | Artem Shelmanov | Akim Tsvigun | Chenxi Whitehouse | Osama Mohammed Afzal | Tarek Mahmoud | Toru Sasaki | Thomas Arnold | Alham Fikri Aji | Nizar Habash | Iryna Gurevych | Preslav Nakov
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuxia Wang | Jonibek Mansurov | Petar Ivanov | Jinyan Su | Artem Shelmanov | Akim Tsvigun | Chenxi Whitehouse | Osama Mohammed Afzal | Tarek Mahmoud | Toru Sasaki | Thomas Arnold | Alham Fikri Aji | Nizar Habash | Iryna Gurevych | Preslav Nakov
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries. However, this has also raised concerns about the potential misuse of such texts in journalism, education, and academia. In this study, we strive to create automated systems that can detect machine-generated texts and pinpoint potential misuse. We first introduce a large-scale benchmark M4, which is a multi-generator, multi-domain, and multi-lingual corpus for machine-generated text detection. Through an extensive empirical study of this dataset, we show that it is challenging for detectors to generalize well on instances from unseen domains or LLMs. In such cases, detectors tend to misclassify machine-generated text as human-written. These results show that the problem is far from solved and that there is a lot of room for improvement. We believe that our dataset will enable future research towards more robust approaches to this pressing societal problem. The dataset is available at https://github.com/mbzuai-nlp/M4
Adapting Fake News Detection to the Era of Large Language Models
Jinyan Su | Claire Cardie | Preslav Nakov
Findings of the Association for Computational Linguistics: NAACL 2024
Jinyan Su | Claire Cardie | Preslav Nakov
Findings of the Association for Computational Linguistics: NAACL 2024
In the age of large language models (LLMs) and the widespread adoption of AI-driven content creation, the landscape of information dissemination has witnessed a paradigm shift. With the proliferation of both human-written and machine-generated real and fake news, robustly and effectively discerning the veracity of news articles has become an intricate challenge. While substantial research has been dedicated to fake news detection, it has either assumed that all news articles are human-written or has abruptly assumed that all machine-generated news was fake. Thus, a significant gap exists in understanding the interplay between machine-paraphrased real news, machine-generated fake news, human-written fake news, and human-written real news. In this paper, we study this gap by conducting a comprehensive evaluation of fake news detectors trained in various scenarios. Our primary objectives revolve around the following pivotal question: How can we adapt fake news detectors to the era of LLMs?Our experiments reveal an interesting pattern that detectors trained exclusively on human-written articles can indeed perform well at detecting machine-generated fake news, but not vice versa. Moreover, due to the bias of detectors against machine-generated texts (CITATION), they should be trained on datasets with a lower machine-generated news ratio than the test set. Building on our findings, we provide a practical strategy for the development of robust fake news detectors.
M4GT-Bench: Evaluation Benchmark for Black-Box Machine-Generated Text Detection
Yuxia Wang | Jonibek Mansurov | Petar Ivanov | Jinyan Su | Artem Shelmanov | Akim Tsvigun | Osama Mohammed Afzal | Tarek Mahmoud | Giovanni Puccetti | Thomas Arnold | Alham Aji | Nizar Habash | Iryna Gurevych | Preslav Nakov
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuxia Wang | Jonibek Mansurov | Petar Ivanov | Jinyan Su | Artem Shelmanov | Akim Tsvigun | Osama Mohammed Afzal | Tarek Mahmoud | Giovanni Puccetti | Thomas Arnold | Alham Aji | Nizar Habash | Iryna Gurevych | Preslav Nakov
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The advent of Large Language Models (LLMs) has brought an unprecedented surge in machine-generated text (MGT) across diverse channels. This raises legitimate concerns about its potential misuse and societal implications. The need to identify and differentiate such content from genuine human-generated text is critical in combating disinformation, preserving the integrity of education and scientific fields, and maintaining trust in communication. In this work, we address this problem by introducing a new benchmark based on a multilingual, multi-domain and multi-generator corpus of MGTs — M4GT-Bench. The benchmark is compiled of three tasks: (1) mono-lingual and multi-lingual binary MGT detection; (2) multi-way detection where one need to identify, which particular model generated the text; and (3) mixed human-machine text detection, where a word boundary delimiting MGT from human-written content should be determined. On the developed benchmark, we have tested several MGT detection baselines and also conducted an evaluation of human performance. We see that obtaining good performance in MGT detection usually requires an access to the training data from the same domain and generators. The benchmark is available at https://github.com/mbzuai-nlp/M4GT-Bench.
SemEval-2024 Task 8: Multidomain, Multimodel and Multilingual Machine-Generated Text Detection
Yuxia Wang | Jonibek Mansurov | Petar Ivanov | Jinyan Su | Artem Shelmanov | Akim Tsvigun | Osama Mohammed Afzal | Tarek Mahmoud | Giovanni Puccetti | Thomas Arnold
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Yuxia Wang | Jonibek Mansurov | Petar Ivanov | Jinyan Su | Artem Shelmanov | Akim Tsvigun | Osama Mohammed Afzal | Tarek Mahmoud | Giovanni Puccetti | Thomas Arnold
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
We present the results and the main findings of SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection. The task featured three subtasks. Subtask A is a binary classification task determining whether a text is written by a human or generated by a machine. This subtask has two tracks: a monolingual track focused solely on English texts and a multilingual track. Subtask B is to detect the exact source of a text, discerning whether it is written by a human or generated by a specific LLM. Subtask C aims to identify the changing point within a text, at which the authorship transitions from human to machine. The task attracted a large number of participants: subtask A monolingual (126), subtask A multilingual (59), subtask B (70), and subtask C (30). In this paper, we present the task, analyze the results, and discuss the system submissions and the methods they used. For all subtasks, the best systems used LLMs.
2023
DetectLLM: Leveraging Log Rank Information for Zero-Shot Detection of Machine-Generated Text
Jinyan Su | Terry Zhuo | Di Wang | Preslav Nakov
Findings of the Association for Computational Linguistics: EMNLP 2023
Jinyan Su | Terry Zhuo | Di Wang | Preslav Nakov
Findings of the Association for Computational Linguistics: EMNLP 2023
With the rapid progress of Large language models (LLMs) and the huge amount of text they generate, it becomes impractical to manually distinguish whether a text is machine-generated. The growing use of LLMs in social media and education, prompts us to develop methods to detect machine-generated text, preventing malicious use such as plagiarism, misinformation, and propaganda. In this paper, we introduce two novel zero-shot methods for detecting machine-generated text by leveraging the Log-Rank information. One is called DetectLLM-LRR, which is fast and efficient, and the other is called DetectLLM-NPR, which is more accurate, but slower due to the need for perturbations. Our experiments on three datasets and seven language models show that our proposed methods improve over the state of the art by 3.9 and 1.75 AUROC points absolute. Moreover, DetectLLM-NPR needs fewer perturbations than previous work to achieve the same level of performance, which makes it more practical for real-world use. We also investigate the efficiency-performance trade-off based on users’ preference for these two measures and provide intuition for using them in practice effectively. We release the data and the code of both methods in https://github.com/mbzuai-nlp/DetectLLM.
Search
Fix author
Co-authors
- Preslav Nakov 8
- Yuxia Wang 6
- Tarek Mahmoud 5
- Jonibek Mansurov 5
- Artem Shelmanov 5
- Akim Tsvigun 5
- Osama Mohammed Afzal 4
- Iryna Gurevych 4
- Nizar Habash 4
- Giovanni Puccetti 4
- Alham Fikri Aji 3
- Thomas Arnold 3
- Claire Cardie 3
- Petar Ivanov 3
- Zhuohan Xie 3
- Rui Xing 3
- Mervat Abassy 2
- Ekaterina Artemova 2
- Jiahui Geng 2
- Maiya Goloburda 2
- Jimin Huang 2
- Masahiro Kaneko 2
- Ryuto Koike 2
- Nurkhan Laiyk 2
- Vladislav Mikhailov 2
- Minh Ngoc Ta 2
- Raj Vardhan Tomar 2
- Alham Aji 1
- Sophia Ananiadou 1
- Alexander Aziz 1
- Debopriyo Banerjee 1
- Tanmoy Chakraborty 1
- Saad El Dine Ahmed El Etter 1
- Rania Elbadry 1
- Saadeldine Eletter 1
- Kareem Elozeiri 1
- Kareem Ashraf Elozeiri 1
- Georgi Nenkov Georgiev 1
- Chengyu Huang 1
- Fajri Koto 1
- Ivan Koychev 1
- Salem Lahlou 1
- Haonan Li 1
- Xufu Liu 1
- Hachem Madmoun 1
- Daniil Orel 1
- Xueqing Peng 1
- Lingfei Qian 1
- Dhruv Sahnan 1
- Toru Sasaki 1
- Nanhan Shen 1
- Aaryamonvikram Singh 1
- Veselin Stoyanov 1
- Rushil Thareja 1
- Yan Wang 1
- Di Wang 1
- Chenxi Whitehouse 1
- Chen Xu 1
- Yitao Xu 1
- Fan Zhang 1
- Zhengxin Zhang 1
- Dan Zhao 1
- Zining Zhu 1
- Terry Zhuo 1