Zhuohan Xie
2026
RealFin: How Well Do LLMs Reason About Finance When Users Leave Things Unsaid?
Yuyang Dai | Yan Lin | Zhuohan Xie | Yuxia Wang
Findings of the Association for Computational Linguistics: ACL 2026
Yuyang Dai | Yan Lin | Zhuohan Xie | Yuxia Wang
Findings of the Association for Computational Linguistics: ACL 2026
Reliable financial reasoning requires knowing not only how to answer, but also when an answer cannot be justified. In real financial practice, problems often rely on implicit assumptions that are taken for granted rather than stated explicitly, causing problems to appear solvable while lacking enough information for a definite answer. We introduce RealFin, a bilingual benchmark that evaluates financial reasoning by systematically removing essential premises from exam-style questions while keeping them linguistically plausible. Based on this, we evaluate models under three formulations that test answering, recognizing missing information, and rejecting unjustified options, and find consistent performance drops when key conditions are absent. General-purpose models tend to over-commit and guess, while most finance-specialized models fail to clearly identify missing premises. These results highlight a critical gap in current evaluations and show that reliable financial models must know when a question should not be answered. The dataset and code are available athttps://github.com/insait-institute/RealFin.
Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI
Yuxia Wang | Rui Xing | Jonibek Mansurov | Giovanni Puccetti | Zhuohan Xie | Minh Ngoc Ta | Jiahui Geng | Jinyan Su | Mervat Abassy | Saadeldine Eletter | Kareem Elozeiri | Nurkhan Laiyk | Maiya Goloburda | Tarek Mahmoud | Raj Vardhan Tomar | Alexander Aziz | Ryuto Koike | Masahiro Kaneko | Artem Shelmanov | Ekaterina Artemova | Vladislav Mikhailov | Akim Tsvigun | Alham Fikri Aji | Nizar Habash | Iryna Gurevych | Preslav Nakov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuxia Wang | Rui Xing | Jonibek Mansurov | Giovanni Puccetti | Zhuohan Xie | Minh Ngoc Ta | Jiahui Geng | Jinyan Su | Mervat Abassy | Saadeldine Eletter | Kareem Elozeiri | Nurkhan Laiyk | Maiya Goloburda | Tarek Mahmoud | Raj Vardhan Tomar | Alexander Aziz | Ryuto Koike | Masahiro Kaneko | Artem Shelmanov | Ekaterina Artemova | Vladislav Mikhailov | Akim Tsvigun | Alham Fikri Aji | Nizar Habash | Iryna Gurevych | Preslav Nakov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Prior studies have shown that distinguishing text generated by Large Language Models (LLMs) from human-written one is highly challenging for humans, and often no better than random guessing. To verify the generalizability of this finding across languages and domains, we perform an extensive case study to identify the upper bound of human detection accuracy. Across 16 datasets covering 9 languages and 9 domains, 19 annotators achieved an average detection accuracy of 87.6%, thus challenging previous conclusions. We find that major gaps between human and machine text lie in concreteness, cultural nuances, and diversity. Prompting by explicitly explaining the distinctions in the prompts can partially bridge the gaps in over 50% of the cases. However, we also find that humans do not always prefer human-written text, particularly when they cannot clearly identify its source. We release our dataset, the human labels, and the annotator metadata at https://github.com/xnlp-lab/HumanEval-MGT.
FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning
Zhuohan Xie | Daniil Orel | Rushil Thareja | Dhruv Sahnan | Hachem Madmoun | Fan Zhang | Debopriyo Banerjee | Georgi Nenkov Georgiev | Xueqing Peng | Lingfei Qian | Jimin Huang | Jinyan Su | Aaryamonvikram Singh | Rui Xing | Rania Elbadry | Chen Xu | Haonan Li | Fajri Koto | Ivan Koychev | Tanmoy Chakraborty | Yuxia Wang | Salem Lahlou | Veselin Stoyanov | Sophia Ananiadou | Preslav Nakov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhuohan Xie | Daniil Orel | Rushil Thareja | Dhruv Sahnan | Hachem Madmoun | Fan Zhang | Debopriyo Banerjee | Georgi Nenkov Georgiev | Xueqing Peng | Lingfei Qian | Jimin Huang | Jinyan Su | Aaryamonvikram Singh | Rui Xing | Rania Elbadry | Chen Xu | Haonan Li | Fajri Koto | Ivan Koychev | Tanmoy Chakraborty | Yuxia Wang | Salem Lahlou | Veselin Stoyanov | Sophia Ananiadou | Preslav Nakov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multi-step symbolic reasoning is essential for robust financial analysis; yet, current benchmarks largely overlook this capability. Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while neglecting the intermediate reasoning steps required for transparency and verification. To address this gap, we introduce FinChain, the first benchmark specifically designed for verifiable Chain-of-Thought evaluation in finance. FinChain spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python code that enable fully machine-verifiable reasoning and scalable, contamination-free data generation.To assess reasoning capacity, we propose ChainEval, a dynamic alignment measure that jointly evaluates both the final-answer correctness and the step-level reasoning consistency. Our evaluation of 26 leading LLMs reveals that even frontier LLMs exhibit clear limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models can substantially narrow this gap.Overall, FinChain exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI. This project is available at https://github.com/mbzuai-nlp/finchain.git.
Cultural Benchmarking of LLMs in Standard and Dialectal Arabic Dialogues
Muhammad Dehan Al Kautsar | Saeed Almheiri | Momina Ahsan | Bilal Elbouardi | Younes Samih | Sarfraz Ahmad | Amr Keleg | Omar El Herraoui | Kareem Elzeky | Abed Alhakim Freihat | Mohamed Anwar | Zhuohan Xie | Junhong Liang | Mohammad Rustom Al Nasar | Preslav Nakov | Fajri Koto
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Muhammad Dehan Al Kautsar | Saeed Almheiri | Momina Ahsan | Bilal Elbouardi | Younes Samih | Sarfraz Ahmad | Amr Keleg | Omar El Herraoui | Kareem Elzeky | Abed Alhakim Freihat | Mohamed Anwar | Zhuohan Xie | Junhong Liang | Mohammad Rustom Al Nasar | Preslav Nakov | Fajri Koto
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
There is a significant gap in evaluating cultural reasoning in LLMs using conversational datasets that capture culturally rich and dialectal contexts. Most Arabic benchmarks focus on short text snippets in Modern Standard Arabic (MSA), overlooking the cultural nuances that naturally arise in dialogues. To address this gap, we introduce ArabCulture-Dialogue, a culturally grounded conversational dataset covering 13 Arabic-speaking countries, in both MSA and each country’s respective dialect, spanning 12 daily-life topics and 54 fine-grained subtopics. We utilize the dataset to form three benchmarking tasks: (i) multiple-choice cultural reasoning, (ii) machine translation between MSA and dialects, and (iii) dialect-steering generation. Our experiments indicate that the performance gap between MSA and Arabic dialects still exists, whereby the models perform worse on all three tasks in the dialectal setup, compared to the MSA one.
Same Claim, Different Judgment: Benchmarking Scenario-Induced Bias in Multilingual Financial Misinformation Detection
Zhiwei Liu | Yupeng Cao | Yuechen Jiang | Mohsinul Kabir | Polydoros Giannouris | Chen Xu | Ziyang Xu | Tianlei Zhu | Md. Tariquzzaman | Triantafillos Papadopoulos | Yan Wang | Lingfei Qian | Xueqing Peng | Zhuohan Xie | Ye Yuan | Saeed Almheiri | Abdulrazzaq Alnajjar | Ming-Bin Chen | Harry Stuart | Paul Thompson | Prayag Tiwari | Alejandro Lopez-Lira | Xue Liu | Jimin Huang | Sophia Ananiadou
Findings of the Association for Computational Linguistics: ACL 2026
Zhiwei Liu | Yupeng Cao | Yuechen Jiang | Mohsinul Kabir | Polydoros Giannouris | Chen Xu | Ziyang Xu | Tianlei Zhu | Md. Tariquzzaman | Triantafillos Papadopoulos | Yan Wang | Lingfei Qian | Xueqing Peng | Zhuohan Xie | Ye Yuan | Saeed Almheiri | Abdulrazzaq Alnajjar | Ming-Bin Chen | Harry Stuart | Paul Thompson | Prayag Tiwari | Alejandro Lopez-Lira | Xue Liu | Jimin Huang | Sophia Ananiadou
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) have been widely applied across various domains of finance. Since their training data are largely derived from human-authored corpora, LLMs may inherit a range of human biases. Behavioral biases can lead to instability and uncertainty in decision-making, particularly when processing financial information. However, existing research on LLM bias has mainly focused on direct questioning or simplified, general-purpose settings, with limited consideration of the complex real-world financial environments and high-risk, context-sensitive, multilingual financial misinformation detection tasks (MFMD). In this work, we propose MFMDScen, a comprehensive benchmark for evaluating behavioral biases of LLMs in MFMD across diverse economic scenarios. In collaboration with financial experts, we construct three types of complex financial scenarios: (i) role- and personality-based, (ii) role- and region-based, and (iii) role-based scenarios incorporating ethnicity and religious beliefs. We further develop a multilingual financial misinformation dataset covering English, Chinese, Greek, and Bengali. By integrating these scenarios with misinformation claims, MFMDScen enables a systematic evaluation of 22 mainstream LLMs. Our findings reveal that pronounced behavioral biases persist across both commercial and open-source models. This project is available at https://github.com/lzw108/FMD.
FinCARDS: Card-Based Analyst Reranking for Financial Document Question Answering
Yixi Zhou | Fan Zhang | YU Chen | Haipeng Zhang | Preslav Nakov | Zhuohan Xie
Findings of the Association for Computational Linguistics: ACL 2026
Yixi Zhou | Fan Zhang | YU Chen | Haipeng Zhang | Preslav Nakov | Zhuohan Xie
Findings of the Association for Computational Linguistics: ACL 2026
Financial question answering (QA) over long corporate filings requires evidence to satisfy strict constraints on entities, financial metrics, fiscal periods, and numeric values. However, existing LLM-based rerankers primarily optimize semantic relevance, leading to unstable rankings and opaque decisions on long documents. We propose FINCARDS, a structured reranking framework that reframes financial evidence selection as constraint satisfaction under a finance-aware schema. FINCARDS represents filing chunks and questions using aligned schema fields (entities, metrics, periods, and numeric spans), enabling deterministic field-level matching. Evidence is selected via a multi-stage tournament reranking with stability-aware aggregation, producing auditable decision traces. Across two corporate filing QA benchmarks, FINCARDS substantially improves early-rank retrieval over both lexical and LLM-based reranking baselines, while reducing ranking variance, without requiring model fine-tuning or unpredictable inference budgets. Our code is available at https://github.com/XanderZhou2022/FINCARDS.
SAHM: A Benchmark for Arabic Financial and Shari’ah-Compliant Reasoning
Rania Elbadry | Sarfraz Ahmad | Ahmed Heakl | Dani Bouch | Momina Ahsan | Muhra AlMahri | Marwa Elsaid Khalil | Yuxia Wang | Salem Lahlou | Sophia Ananiadou | Veselin Stoyanov | Jimin Huang | Xueqing Peng | Preslav Nakov | Zhuohan Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Rania Elbadry | Sarfraz Ahmad | Ahmed Heakl | Dani Bouch | Momina Ahsan | Muhra AlMahri | Marwa Elsaid Khalil | Yuxia Wang | Salem Lahlou | Sophia Ananiadou | Veselin Stoyanov | Jimin Huang | Xueqing Peng | Preslav Nakov | Zhuohan Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
English financial NLP has progressed rapidly through benchmarks for sentiment, document understanding, and financial question answering, while Arabic financial NLP remains comparatively under-explored despite strong practical demand for trustworthy finance and Islamic-finance assistants. We introduce SAHM, a document-grounded benchmark and instruction-tuning dataset for Arabic financial NLP and Shari’ah-compliant reasoning. SAHM contains 14,380 expert-verified instances spanning seven tasks: AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event–cause reasoning, curated from authentic regulatory, juristic, and corporate sources. We evaluate 19 strong open and proprietary LLMs using task-specific metrics and rubric-based scoring for open-ended outputs, and find that Arabic fluency does not reliably translate to evidence-grounded financial reasoning: models are substantially stronger on recognition-style tasks than on generation and causal reasoning, with the largest gaps on event–cause reasoning. We release the benchmark, evaluation framework, and an instruction-tuned model to support future research on trustworthy Arabic financial NLP.
2025
VSCBench: Bridging the Gap in Vision-Language Model Safety Calibration
Jiahui Geng | Qing Li | Zongxiong Chen | Yuxia Wang | Derui Zhu | Zhuohan Xie | Chenyang Lyu | Xiuying Chen | Preslav Nakov | Fakhri Karray
Findings of the Association for Computational Linguistics: ACL 2025
Jiahui Geng | Qing Li | Zongxiong Chen | Yuxia Wang | Derui Zhu | Zhuohan Xie | Chenyang Lyu | Xiuying Chen | Preslav Nakov | Fakhri Karray
Findings of the Association for Computational Linguistics: ACL 2025
The rapid advancement of vision-language models (VLMs) has brought a lot of attention to their safety alignment. However, existing methods have primarily focused on model undersafety, where the model responds to hazardous queries, while neglecting oversafety, where the model refuses to answer safe queries. In this paper, we introduce the concept of safety calibration, which systematically addresses both undersafety and oversafety. Specifically, we present VSCBench, a novel dataset of 3,600 image-text pairs that are visually or textually similar but differ in terms of safety, which is designed to evaluate safety calibration across image-centric and text-centric scenarios. Based on our benchmark, we evaluate safety calibration across eleven widely used VLMs. Our extensive experiments revealed major issues with both undersafety and oversafety. We further investigated four approaches to improve the model’s safety calibration. We found that even though some methods effectively calibrated the models’ safety problems, these methods also lead to the degradation of models’ utility. This trade-off underscores the urgent need for advanced calibration methods, and our benchmark provides a valuable tool for evaluating future approaches.
GenAI Content Detection Task 1: English and Multilingual Machine-Generated Text Detection: AI vs. Human
Yuxia Wang | Artem Shelmanov | Jonibek Mansurov | Akim Tsvigun | Vladislav Mikhailov | Rui Xing | Zhuohan Xie | Jiahui Geng | Giovanni Puccetti | Ekaterina Artemova | Jinyan Su | Minh Ngoc Ta | Mervat Abassy | Kareem Ashraf Elozeiri | Saad El Dine Ahmed El Etter | Maiya Goloburda | Tarek Mahmoud | Raj Vardhan Tomar | Nurkhan Laiyk | Osama Mohammed Afzal | Ryuto Koike | Masahiro Kaneko | Alham Fikri Aji | Nizar Habash | Iryna Gurevych | Preslav Nakov
Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect)
Yuxia Wang | Artem Shelmanov | Jonibek Mansurov | Akim Tsvigun | Vladislav Mikhailov | Rui Xing | Zhuohan Xie | Jiahui Geng | Giovanni Puccetti | Ekaterina Artemova | Jinyan Su | Minh Ngoc Ta | Mervat Abassy | Kareem Ashraf Elozeiri | Saad El Dine Ahmed El Etter | Maiya Goloburda | Tarek Mahmoud | Raj Vardhan Tomar | Nurkhan Laiyk | Osama Mohammed Afzal | Ryuto Koike | Masahiro Kaneko | Alham Fikri Aji | Nizar Habash | Iryna Gurevych | Preslav Nakov
Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect)
We present the GenAI Content Detection Task 1 – a shared task on binary machine generated text detection, conducted as a part of the GenAI workshop at COLING 2025. The task consists of two subtasks: Monolingual (English) and Multilingual. The shared task attracted many participants: 36 teams made official submissions to the Monolingual subtask during the test phase and 27 teams – to the Multilingual. We provide a comprehensive overview of the data, a summary of the results – including system rankings and performance scores – detailed descriptions of the participating systems, and an in-depth analysis of submissions.
KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan
Mukhammed Togmanov | Nurdaulet Mukhituly | Diana Turmakhan | Jonibek Mansurov | Maiya Goloburda | Akhmed Sakip | Zhuohan Xie | Yuxia Wang | Bekassyl Syzdykov | Nurkhan Laiyk | Alham Fikri Aji | Ekaterina Kochmar | Preslav Nakov | Fajri Koto
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Mukhammed Togmanov | Nurdaulet Mukhituly | Diana Turmakhan | Jonibek Mansurov | Maiya Goloburda | Akhmed Sakip | Zhuohan Xie | Yuxia Wang | Bekassyl Syzdykov | Nurkhan Laiyk | Alham Fikri Aji | Ekaterina Kochmar | Preslav Nakov | Fajri Koto
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite having a population of twenty million, Kazakhstan’s culture and language remain underrepresented in the field of natural language processing. Although large language models (LLMs) continue to advance worldwide, progress in Kazakh language has been limited, as seen in the scarcity of dedicated models and benchmark evaluations. To address this gap, we introduce KazMMLU, the first MMLU-style dataset specifically designed for Kazakh language. KazMMLU comprises 23,000 questions that cover various educational levels, including STEM, humanities, and social sciences, sourced from authentic educational materials and manually validated by native speakers and educators. The dataset includes 10,969 Kazakh questions and 12,031 Russian questions, reflecting Kazakhstan’s bilingual education system and rich local context. Our evaluation of several state-of-the-art multilingual models (Llama3.1, Qwen-2.5, GPT-4, and DeepSeek V3) demonstrates substantial room for improvement, as even the best-performing models struggle to achieve competitive performance in Kazakh and Russian. These findings highlight significant performance gaps compared to high-resource languages. We hope that our dataset will enable further research and development of Kazakh-centric LLMs.
Entity Framing and Role Portrayal in the News
Tarek Mahmoud | Zhuohan Xie | Dimitar Iliyanov Dimitrov | Nikolaos Nikolaidis | Purificação Silvano | Roman Yangarber | Shivam Sharma | Elisa Sartori | Nicolas Stefanovitch | Giovanni Da San Martino | Jakub Piskorski | Preslav Nakov
Findings of the Association for Computational Linguistics: ACL 2025
Tarek Mahmoud | Zhuohan Xie | Dimitar Iliyanov Dimitrov | Nikolaos Nikolaidis | Purificação Silvano | Roman Yangarber | Shivam Sharma | Elisa Sartori | Nicolas Stefanovitch | Giovanni Da San Martino | Jakub Piskorski | Preslav Nakov
Findings of the Association for Computational Linguistics: ACL 2025
We introduce a novel multilingual and hierarchical corpus annotated for entity framing and role portrayal in news articles. The dataset uses a unique taxonomy inspired by storytelling elements, comprising 22 fine-grained roles, or archetypes, nested within three main categories: protagonist, antagonist, and innocent. Each archetype is carefully defined, capturing nuanced portrayals of entities such as guardian, martyr, and underdog for protagonists; tyrant, deceiver, and bigot for antagonists; and victim, scapegoat, and exploited for innocents. The dataset includes 1,378 recent news articles in five languages (Bulgarian, English, Hindi, European Portuguese, and Russian) focusing on two critical domains of global significance: the Ukraine-Russia War and Climate Change. Over 5,800 entity mentions have been annotated with role labels. This dataset serves as a valuable resource for research into role portrayal and has broader implications for news analysis. We describe the characteristics of the dataset and the annotation process, and we report evaluation results on fine-tuned state-of-the-art multilingual transformers and hierarchical zero-shot learning using LLMs at the level of a document, a paragraph, and a sentence.
SemEval 2025 Task 10: Multilingual Characterization and Extraction of Narratives from Online News
Jakub Piskorski | Tarek Mahmoud | Nikolaos Nikolaidis | Ricardo Campos | Alipio Mario Jorge | Dimitar Dimitrov | Purificação Silvano | Roman Yangarber | Shivam Sharma | Tanmoy Chakraborty | Nuno Guimaraes | Elisa Sartori | Nicolas Stefanovitch | Zhuohan Xie | Preslav Nakov | Giovanni Da San Martino
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
Jakub Piskorski | Tarek Mahmoud | Nikolaos Nikolaidis | Ricardo Campos | Alipio Mario Jorge | Dimitar Dimitrov | Purificação Silvano | Roman Yangarber | Shivam Sharma | Tanmoy Chakraborty | Nuno Guimaraes | Elisa Sartori | Nicolas Stefanovitch | Zhuohan Xie | Preslav Nakov | Giovanni Da San Martino
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
We introduce SemEval-2025 Task 10 on Multilingual Characterization and Extraction of Narratives from Online News, which focuses on the identification and analysis of narratives in online news media. The task is structured into three subtasks: (1) Entity Framing, to identify the roles that relevant entities play within narratives, (2) Narrative Classification, to assign documents fine-grained narratives according to a given, topic-specific taxonomy of narrative labels, and (3) Narrative Extraction, to provide a justification for the dominant narrative of the document. To this end, we analyze news articles across two critical domains, Ukraine-Russia War and Climate Change, in five languages: Bulgarian, English, Hindi, Portuguese, and Russian. This task introduces a novel multilingual and multifaceted framework for studying how online news media construct and disseminate manipulative narratives. By addressing these challenges, our work contributes to the broader effort of detecting, understanding, and mitigating the spread of propaganda and disinformation. The task attracted a lot of interest: 310 teams registered, with 66 submitting official results on the test set.
BERTastic at SemEval-2025 Task 10: State-of-the-Art Accuracy in Coarse-Grained Entity Framing for Hindi News
Tarek Mahmoud | Zhuohan Xie | Preslav Nakov
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
Tarek Mahmoud | Zhuohan Xie | Preslav Nakov
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
We describe our system for SemEval-2025 Task 10 Subtask 1 on coarse-grained entity framing in Hindi news, exploring two complementary strategies. First, we experiment with LLM prompting using GPT-4o, comparing hierarchical multi-step prompting with native single-step prompting for both main and fine-grained role prediction. Second, we conduct an extensive study on fine-tuning XLM-R, analyzing different context granularities (full article, paragraph, or sentence-level entity mentions), monolingual vs. multilingual settings, and main vs. fine-grained role labels. Our best system, trained on fine-grained role annotations across languages using sentence-level context, achieved 43.99% exact match, 56.56 % precision, 47.38% recall, and 51.57% F1-score. Notably, our system set a new state-of-the-art for main role prediction on Hindi news, achieving 78.48 % accuracy - outperforming the next best model at 76.90%, as per the official leaderboard. Our findings highlight effective strategies for entity framing in multilingual and low-resource settings.
FIRE: Fact-checking with Iterative Retrieval and Verification
Zhuohan Xie | Rui Xing | Yuxia Wang | Jiahui Geng | Hasan Iqbal | Dhruv Sahnan | Iryna Gurevych | Preslav Nakov
Findings of the Association for Computational Linguistics: NAACL 2025
Zhuohan Xie | Rui Xing | Yuxia Wang | Jiahui Geng | Hasan Iqbal | Dhruv Sahnan | Iryna Gurevych | Preslav Nakov
Findings of the Association for Computational Linguistics: NAACL 2025
Fact-checking long-form text is challenging, and it is therefore common practice to break it down into multiple atomic claims. The typical approach to fact-checking these atomic claims involves retrieving a fixed number of pieces of evidence, followed by a verification step. However, this method is usually not cost-effective, as it underutilizes the verification model’s internal knowledge of the claim and fails to replicate the iterative reasoning process in human search strategies. To address these limitations, we propose FIRE, a novel agent-based framework that integrates evidence retrieval and claim verification in an iterative manner. Specifically, FIRE employs a unified mechanism to decide whether to provide a final answer or generate a subsequent search query, based on its confidence in the current judgment. We compare FIRE with other strong fact-checking frameworks and find that it achieves slightly better performance while reducing large language model (LLM) costs by an average of 7.6 times and search costs by 16.5 times. These results indicate that FIRE holds promise for application in large-scale fact-checking operations.
A Head to Predict and a Head to Question: Pre-trained Uncertainty Quantification Heads for Hallucination Detection in LLM Outputs
Artem Shelmanov | Ekaterina Fadeeva | Akim Tsvigun | Ivan Tsvigun | Zhuohan Xie | Igor Kiselev | Nico Daheim | Caiqi Zhang | Artem Vazhentsev | Mrinmaya Sachan | Preslav Nakov | Timothy Baldwin
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Artem Shelmanov | Ekaterina Fadeeva | Akim Tsvigun | Ivan Tsvigun | Zhuohan Xie | Igor Kiselev | Nico Daheim | Caiqi Zhang | Artem Vazhentsev | Mrinmaya Sachan | Preslav Nakov | Timothy Baldwin
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
LLMs have the tendency to hallucinate, i.e., to sporadically generate false or fabricated information, and users generally lack the tools to detect when this happens. Uncertainty quantification (UQ) provides a framework for assessing the reliability of model outputs, aiding in the identification of potential hallucinations. In this work, we introduce pre-trained UQ heads: supervised auxiliary modules for LLMs that substantially enhance their ability to capture uncertainty compared to unsupervised UQ methods. Their strong performance stems from the transformer architecture in their design, in the form of informative features derived from LLM attention maps and logits. Our experiments show that these heads are highly robust and achieve state-of-the-art performance in claim-level hallucination detection across both in-domain and out-of-domain prompts. Moreover, these modules demonstrate strong generalization to languages they were not explicitly trained on. We pre-train a collection of UQ heads for popular LLM series, including Mistral, Llama, and Gemma. We publicly release both the code and the pre-trained heads.
2024
LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection
Mervat Abassy | Kareem Elozeiri | Alexander Aziz | Minh Ngoc Ta | Raj Vardhan Tomar | Bimarsha Adhikari | Saad El Dine Ahmed | Yuxia Wang | Osama Mohammed Afzal | Zhuohan Xie | Jonibek Mansurov | Ekaterina Artemova | Vladislav Mikhailov | Rui Xing | Jiahui Geng | Hasan Iqbal | Zain Muhammad Mujahid | Tarek Mahmoud | Akim Tsvigun | Alham Fikri Aji | Artem Shelmanov | Nizar Habash | Iryna Gurevych | Preslav Nakov
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Mervat Abassy | Kareem Elozeiri | Alexander Aziz | Minh Ngoc Ta | Raj Vardhan Tomar | Bimarsha Adhikari | Saad El Dine Ahmed | Yuxia Wang | Osama Mohammed Afzal | Zhuohan Xie | Jonibek Mansurov | Ekaterina Artemova | Vladislav Mikhailov | Rui Xing | Jiahui Geng | Hasan Iqbal | Zain Muhammad Mujahid | Tarek Mahmoud | Akim Tsvigun | Alham Fikri Aji | Artem Shelmanov | Nizar Habash | Iryna Gurevych | Preslav Nakov
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
The ease of access to large language models (LLMs) has enabled a widespread of machine-generated texts, and now it is often hard to tell whether a piece of text was human-written or machine-generated. This raises concerns about potential misuse, particularly within educational and academic domains. Thus, it is important to develop practical systems that can automate the process. Here, we present one such system, LLM-DetectAIve, designed for fine-grained detection. Unlike most previous work on machine-generated text detection, which focused on binary classification, LLM-DetectAIve supports four categories: (i) human-written, (ii) machine-generated, (iii) machine-written, then machine-humanized, and (iv) human-written, then machine-polished. Category (iii) aims to detect attempts to obfuscate the fact that a text was machine-generated, while category (iv) looks for cases where the LLM was used to polish a human-written text, which is typically acceptable in academic writing, but not in education. Our experiments show that LLM-DetectAIve can effectively identify the above four categories, which makes it a potentially useful tool in education, academia, and other domains.LLM-DetectAIve is publicly accessible at https://github.com/mbzuai-nlp/LLM-DetectAIve. The video describing our system is available at https://youtu.be/E8eT_bE7k8c.
2023
DeltaScore: Fine-Grained Story Evaluation with Perturbations
Zhuohan Xie | Miao Li | Trevor Cohn | Jey Lau
Findings of the Association for Computational Linguistics: EMNLP 2023
Zhuohan Xie | Miao Li | Trevor Cohn | Jey Lau
Findings of the Association for Computational Linguistics: EMNLP 2023
Numerous evaluation metrics have been developed for natural language generation tasks, but their effectiveness in evaluating stories is limited as they are not specifically tailored to assess intricate aspects of storytelling, such as fluency and interestingness. In this paper, we introduce DeltaScore, a novel methodology that uses perturbation techniques for the evaluation of nuanced story aspects. We posit that the extent to which a story excels in a specific aspect (e.g., fluency) correlates with the magnitude of its susceptibility to particular perturbations (e.g., the introduction of typos). Given this, we measure the quality of an aspect by calculating the likelihood difference between pre- and post-perturbation states using pre-trained language models. We compare DeltaScore with existing metrics on storytelling datasets from two domains in five fine-grained story aspects: fluency, coherence, relatedness, logicality, and interestingness. DeltaScore demonstrates strong performance, revealing a surprising finding that one specific perturbation proves highly effective in capturing multiple aspects. Source code is available on our GitHub repository.
The Next Chapter: A Study of Large Language Models in Storytelling
Zhuohan Xie | Trevor Cohn | Jey Han Lau
Proceedings of the 16th International Natural Language Generation Conference
Zhuohan Xie | Trevor Cohn | Jey Han Lau
Proceedings of the 16th International Natural Language Generation Conference
To enhance the quality of generated stories, recent story generation models have been investigating the utilization of higher-level attributes like plots or commonsense knowledge. The application of prompt-based learning with large language models (LLMs), exemplified by GPT-3, has exhibited remarkable performance in diverse natural language processing (NLP) tasks. This paper conducts a comprehensive investigation, utilizing both automatic and human evaluation, to compare the story generation capacity of LLMs with recent models across three datasets with variations in style, register, and length of stories. The results demonstrate that LLMs generate stories of significantly higher quality compared to other story generation models. Moreover, they exhibit a level of performance that competes with human authors, albeit with the preliminary observation that they tend to replicate real stories in situations involving world knowledge, resembling a form of plagiarism.
2021
Exploring Story Generation with Multi-task Objectives in Variational Autoencoders
Zhuohan Xie | Jey Han Lau | Trevor Cohn
Proceedings of the 19th Annual Workshop of the Australasian Language Technology Association
Zhuohan Xie | Jey Han Lau | Trevor Cohn
Proceedings of the 19th Annual Workshop of the Australasian Language Technology Association
GPT-2 has been frequently adapted in story generation models as it provides powerful generative capability. However, it still fails to generate consistent stories and lacks diversity. Current story generation models leverage additional information such as plots or commonsense into GPT-2 to guide the generation process. These approaches focus on improving generation quality of stories while our work look at both quality and diversity. We explore combining BERT and GPT-2 to build a variational autoencoder (VAE), and extend it by adding additional objectives to learn global features such as story topic and discourse relations. Our evaluations show our enhanced VAE can provide better quality and diversity trade off, generate less repetitive story content and learn a more informative latent variable.
2019
From Shakespeare to Li-Bai: Adapting a Sonnet Model to Chinese Poetry
Zhuohan Xie | Jey Han Lau | Trevor Cohn
Proceedings of the 17th Annual Workshop of the Australasian Language Technology Association
Zhuohan Xie | Jey Han Lau | Trevor Cohn
Proceedings of the 17th Annual Workshop of the Australasian Language Technology Association
In this paper, we adapt Deep-speare, a joint neural network model for English sonnets, to Chinese poetry. We illustrate characteristics of Chinese quatrain and explain our architecture as well as training and generation procedure, which differs from Shakespeare sonnets in several aspects. We analyse the generated poetry and find that model works well for Chinese poetry, as it can: (1) generate coherent 4-line quatrains of different topics; and (2) capture rhyme automatically (to a certain extent).
Search
Fix author
Co-authors
- Preslav Nakov 14
- Yuxia Wang 9
- Tarek Mahmoud 6
- Jiahui Geng 5
- Rui Xing 5
- Alham Fikri Aji 4
- Trevor Cohn 4
- Iryna Gurevych 4
- Jonibek Mansurov 4
- Artem Shelmanov 4
- Akim Tsvigun 4
- Mervat Abassy 3
- Sophia Ananiadou 3
- Ekaterina Artemova 3
- Maiya Goloburda 3
- Nizar Habash 3
- Jimin Huang 3
- Fajri Koto 3
- Nurkhan Laiyk 3
- Jey Han Lau 3
- Vladislav Mikhailov 3
- Xueqing Peng 3
- Jinyan Su 3
- Minh Ngoc Ta 3
- Raj Vardhan Tomar 3
- Osama Mohammed Afzal 2
- Sarfraz Ahmad 2
- Momina Ahsan 2
- Saeed Almheiri 2
- Alexander Aziz 2
- Tanmoy Chakraborty 2
- Giovanni Da San Martino 2
- Rania Elbadry 2
- Kareem Elozeiri 2
- Hasan Iqbal 2
- Masahiro Kaneko 2
- Ryuto Koike 2
- Salem Lahlou 2
- Nikolaos Nikolaidis 2
- Jakub Piskorski 2
- Giovanni Puccetti 2
- Lingfei Qian 2
- Dhruv Sahnan 2
- Elisa Sartori 2
- Shivam Sharma 2
- Purificação Silvano 2
- Nicolas Stefanovitch 2
- Veselin Stoyanov 2
- Chen Xu 2
- Roman Yangarber 2
- Fan Zhang 2
- Bimarsha Adhikari 1
- Saad El Dine Ahmed 1
- Muhammad Dehan Al Kautsar 1
- Mohammad Rustom Al Nasar 1
- Muhra AlMahri 1
- Abdulrazzaq Alnajjar 1
- Mohamed Anwar 1
- Timothy Baldwin 1
- Debopriyo Banerjee 1
- Dani Bouch 1
- Ricardo Campos 1
- Yupeng Cao 1
- Zongxiong Chen 1
- Xiuying Chen 1
- Ming-Bin Chen 1
- YU Chen 1
- Nico Daheim 1
- Yuyang Dai 1
- Dimitar Iliyanov Dimitrov 1
- Dimitar Dimitrov 1
- Saad El Dine Ahmed El Etter 1
- Omar El Herraoui 1
- Bilal Elbouardi 1
- Saadeldine Eletter 1
- Kareem Ashraf Elozeiri 1
- Kareem Elzeky 1
- Ekaterina Fadeeva 1
- Abed Alhakim Freihat 1
- Georgi Nenkov Georgiev 1
- Polydoros Giannouris 1
- Nuno Guimarães 1
- Ahmed Heakl 1
- Yuechen Jiang 1
- Mohsinul Kabir 1
- Fakhri Karray 1
- Amr Keleg 1
- Marwa Elsaid Khalil 1
- Igor Kiselev 1
- Ekaterina Kochmar 1
- Ivan Koychev 1
- Jey Lau 1
- Qing Li 1
- Haonan Li 1
- Miao Li 1
- Junhong Liang 1
- Yan Lin 1
- Zhiwei Liu 1
- Xue Liu 1
- Alejandro Lopez-Lira 1
- Chenyang Lyu 1
- Hachem Madmoun 1
- Alípio Mario Jorge 1
- Zain Muhammad Mujahid 1
- Nurdaulet Mukhituly 1
- Daniil Orel 1
- Triantafillos Papadopoulos 1
- Mrinmaya Sachan 1
- Akhmed Sakip 1
- Younes Samih 1
- Aaryamonvikram Singh 1
- Harry Stuart 1
- Bekassyl Syzdykov 1
- Md. Tariquzzaman 1
- Rushil Thareja 1
- Paul Thompson 1
- Prayag Tiwari 1
- Mukhammed Togmanov 1
- Ivan Tsvigun 1
- Diana Turmakhan 1
- Artem Vazhentsev 1
- Yan Wang 1
- Ziyang Xu 1
- Ye Yuan 1
- Haipeng Zhang 1
- Caiqi Zhang 1
- Yixi Zhou 1
- Derui Zhu 1
- Tianlei Zhu 1