Zhiwei Liu
Other people with similar names: Zhiwei Liu, Zhiwei Liu
Unverified author pages with similar names: Zhiwei Liu
2026
MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application
Xueqing Peng | Lingfei Qian | Yan Wang | Ruoyu Xiang | Yueru He | Yang Ren | Mingyang Jiang | Vincent Jim Zhang | Yuqing Guo | Jeff Zhao | Huan He | Yi Han | Yun Feng | Yuechen Jiang | Yupeng Cao | Haohang Li | Yangyang Yu | Xiaoyu Wang | Penglei Gao | Shengyuan Lin | Keyi Wang | Shanshan Yang | Yilun Zhao | Zhiwei Liu | Peng Lu | Jerry Huang | Suyuchen Wang | Triantafillos Papadopoulos | Polydoros Giannouris | Efstathia Soufleri | Nuo Chen | Zhiyang Deng | Heming Fu | Yijia Zhao | Mingquan Lin | Meikang Qiu | Kaleb E Smith | Arman Cohan | Xiao-Yang Liu | Jimin Huang | Guojun Xiong | Alejandro Lopez-Lira | Xi Chen | Junichi Tsujii | Jian-Yun Nie | Sophia Ananiadou | Qianqian Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xueqing Peng | Lingfei Qian | Yan Wang | Ruoyu Xiang | Yueru He | Yang Ren | Mingyang Jiang | Vincent Jim Zhang | Yuqing Guo | Jeff Zhao | Huan He | Yi Han | Yun Feng | Yuechen Jiang | Yupeng Cao | Haohang Li | Yangyang Yu | Xiaoyu Wang | Penglei Gao | Shengyuan Lin | Keyi Wang | Shanshan Yang | Yilun Zhao | Zhiwei Liu | Peng Lu | Jerry Huang | Suyuchen Wang | Triantafillos Papadopoulos | Polydoros Giannouris | Efstathia Soufleri | Nuo Chen | Zhiyang Deng | Heming Fu | Yijia Zhao | Mingquan Lin | Meikang Qiu | Kaleb E Smith | Arman Cohan | Xiao-Yang Liu | Jimin Huang | Guojun Xiong | Alejandro Lopez-Lira | Xi Chen | Junichi Tsujii | Jian-Yun Nie | Sophia Ananiadou | Qianqian Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Real-world financial analysis involves information across multiple languages and modalities, from reports and news to scanned filings and meeting recordings. Yet most existing evaluations of LLMs in finance remain text-only, monolingual, and largely saturated by current models. To bridge these gaps, we present MultiFinBen, the first expert-annotated multilingual (five languages) and multimodal (text, vision, audio) benchmark for evaluating LLMs in realistic financial contexts. MultiFinBen introduces two new task families: multilingual financial reasoning, which tests cross-lingual evidence integration from filings and news, and financial OCR, which extracts structured text from scanned documents containing tables and charts. Rather than aggregating all available datasets, we apply a structured, difficulty-aware selection based on advanced model performance, ensuring balanced challenge and removing redundant tasks. Evaluating 21 leading LLMs shows that even frontier multimodal models like GPT-4o achieve only 46.01% overall, stronger on vision and audio but dropping sharply in multilingual settings. These findings expose persistent limitations in multilingual, multimodal, and expert-level financial reasoning. All datasets, evaluation scripts, and leaderboards are publicly released.
All That Glisters Is Not Gold: A Benchmark for Reference-Free Counterfactual Financial Misinformation Detection
Yuechen Jiang | Zhiwei Liu | Yupeng Cao | Yueru He | Ziyang Xu | Chen Xu | Zhiyang Deng | Prayag Tiwari | Xi Chen | Alejandro Lopez-Lira | Jimin Huang | Junichi Tsujii | Sophia Ananiadou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuechen Jiang | Zhiwei Liu | Yupeng Cao | Yueru He | Ziyang Xu | Chen Xu | Zhiyang Deng | Prayag Tiwari | Xi Chen | Alejandro Lopez-Lira | Jimin Huang | Junichi Tsujii | Sophia Ananiadou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We introduce RFC-Bench, a benchmark for evaluating large language models on financial misinformation under realistic news. RFC-Bench operates at the paragraph level and captures the contextual complexity of financial news where meaning emerges from dispersed cues. The benchmark defines two complementary tasks: reference-free misinformation detection and comparison-based diagnosis using paired original–perturbed inputs. Experiments reveal a consistent pattern: performance is substantially stronger when comparative context is available, while reference-free settings expose significant weaknesses, including unstable predictions and elevated invalid outputs. These results indicate that current models struggle to maintain coherent belief states without external grounding. By highlighting this gap, RFC-Bench provides a structured testbed for studying reference-free reasoning and advancing more reliable financial misinformation detection in real-world settings.
Same Claim, Different Judgment: Benchmarking Scenario-Induced Bias in Multilingual Financial Misinformation Detection
Zhiwei Liu | Yupeng Cao | Yuechen Jiang | Mohsinul Kabir | Polydoros Giannouris | Chen Xu | Ziyang Xu | Tianlei Zhu | Md. Tariquzzaman | Triantafillos Papadopoulos | Yan Wang | Lingfei Qian | Xueqing Peng | Zhuohan Xie | Ye Yuan | Saeed Almheiri | Abdulrazzaq Alnajjar | Ming-Bin Chen | Harry Stuart | Paul Thompson | Prayag Tiwari | Alejandro Lopez-Lira | Xue Liu | Jimin Huang | Sophia Ananiadou
Findings of the Association for Computational Linguistics: ACL 2026
Zhiwei Liu | Yupeng Cao | Yuechen Jiang | Mohsinul Kabir | Polydoros Giannouris | Chen Xu | Ziyang Xu | Tianlei Zhu | Md. Tariquzzaman | Triantafillos Papadopoulos | Yan Wang | Lingfei Qian | Xueqing Peng | Zhuohan Xie | Ye Yuan | Saeed Almheiri | Abdulrazzaq Alnajjar | Ming-Bin Chen | Harry Stuart | Paul Thompson | Prayag Tiwari | Alejandro Lopez-Lira | Xue Liu | Jimin Huang | Sophia Ananiadou
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) have been widely applied across various domains of finance. Since their training data are largely derived from human-authored corpora, LLMs may inherit a range of human biases. Behavioral biases can lead to instability and uncertainty in decision-making, particularly when processing financial information. However, existing research on LLM bias has mainly focused on direct questioning or simplified, general-purpose settings, with limited consideration of the complex real-world financial environments and high-risk, context-sensitive, multilingual financial misinformation detection tasks (MFMD). In this work, we propose MFMDScen, a comprehensive benchmark for evaluating behavioral biases of LLMs in MFMD across diverse economic scenarios. In collaboration with financial experts, we construct three types of complex financial scenarios: (i) role- and personality-based, (ii) role- and region-based, and (iii) role-based scenarios incorporating ethnicity and religious beliefs. We further develop a multilingual financial misinformation dataset covering English, Chinese, Greek, and Bengali. By integrating these scenarios with misinformation claims, MFMDScen enables a systematic evaluation of 22 mainstream LLMs. Our findings reveal that pronounced behavioral biases persist across both commercial and open-source models. This project is available at https://github.com/lzw108/FMD.
Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training
Kailai Yang | Xiao Liu | Lei Ji | Hao Li | Xiao Liang | Zhiwei Liu | Yeyun Gong | Peng Cheng | Mao Yang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kailai Yang | Xiao Liu | Lei Ji | Hao Li | Xiao Liang | Zhiwei Liu | Yeyun Gong | Peng Cheng | Mao Yang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Continual pre-training on small-scale task-specific data is an effective method for improving large language models in new target fields, yet it risks catastrophic forgetting of their original capabilities. A common solution is to re-weight training data mixtures from source and target fields on a domain space to achieve balanced performance. Previous domain reweighting strategies rely on manual designation with certain heuristics based on human intuition or empirical results. In this work, we prove that more general heuristics can be parameterized by proposing Data Mixing Agent, the first model-based, end-to-end framework that learns to re-weight domains. The agent learns generalizable heuristics through reinforcement learning on large quantities of data mixing trajectories with corresponding feedback from an evaluation environment. Experiments in continual pre-training on math reasoning show that Data Mixing Agent outperforms strong baselines in achieving balanced performance across source and target field benchmarks. Furthermore, it generalizes well across unseen source fields, target models, and domain spaces without retraining. Direct application to the code generation field also indicates its adaptability across target domains. Further analysis showcases the agents’ well-aligned heuristics with human intuitions and their efficiency in achieving superior model performance with less source-field data.
2025
Selective Preference Optimization via Token-Level Reward Function Estimation
Kailai Yang | Zhiwei Liu | Qianqian Xie | Jimin Huang | Erxue Min | Sophia Ananiadou
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Kailai Yang | Zhiwei Liu | Qianqian Xie | Jimin Huang | Erxue Min | Sophia Ananiadou
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Recent advancements in LLM alignment leverage token-level supervisions to perform fine-grained preference optimization. However, existing token-level alignment methods either optimize on all available tokens, which can be noisy and inefficient, or perform selective training with complex and expensive key token selection strategies. In this work, we propose Selective Preference Optimization (SePO), a novel selective alignment strategy that centers on efficient key token selection without requiring strong, fine-grained supervision signals. We theoretically prove the feasibility of Direct Preference Optimization (DPO) as token-level reward function estimators, which applies to any existing alignment datasets and enables cost-efficient token selection with small-scale model sizes and training data. We then train an oracle model with DPO on the target data and utilize the estimated reward function to score all tokens within the target dataset, where only the key tokens are selected to supervise the target policy model with a contrastive objective function. Extensive experiments on three public evaluation benchmarks show that SePO significantly outperforms competitive baseline methods by only optimizing on 30% key tokens with up to 60% reduction in GPU training hours. We also explore SePO as a new paradigm for weak-to-strong generalization, showing that weak oracle models effectively supervise strong policy models with up to 16.8 more parameters. SePO also selects useful supervision signals from out-of-distribution data, alleviating the over-optimization problem.
RAEmoLLM: Retrieval Augmented LLMs for Cross-Domain Misinformation Detection Using In-Context Learning Based on Emotional Information
Zhiwei Liu | Kailai Yang | Qianqian Xie | Christine de Kock | Sophia Ananiadou | Eduard Hovy
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhiwei Liu | Kailai Yang | Qianqian Xie | Christine de Kock | Sophia Ananiadou | Eduard Hovy
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Misinformation is prevalent in various fields such as education, politics, health, etc., causing significant harm to society. However, current methods for cross-domain misinformation detection rely on effort- and resource-intensive fine-tuning and complex model structures. With the outstanding performance of LLMs, many studies have employed them for misinformation detection. Unfortunately, they focus on in-domain tasks and do not incorporate significant sentiment and emotion features (which we jointly call affect). In this paper, we propose RAEmoLLM, the first retrieval augmented (RAG) LLMs framework to address cross-domain misinformation detection using in-context learning based on affective information. RAEmoLLM includes three modules. (1) In the index construction module, we apply an emotional LLM to obtain affective embeddings from all domains to construct a retrieval database. (2) The retrieval module uses the database to recommend top K examples (text-label pairs) from source domain data for target domain contents. (3) These examples are adopted as few-shot demonstrations for the inference module to process the target domain content. The RAEmoLLM can effectively enhance the general performance of LLMs in cross-domain misinformation detection tasks through affect-based retrieval, without fine-tuning. We evaluate our framework on three misinformation benchmarks. Results show that RAEmoLLM achieves significant improvements compared to the other few-shot methods on three datasets, with the highest increases of 15.64%, 31.18%, and 15.73% respectively. This project is available at https://github.com/lzw108/RAEmoLLM.
Search
Fix author
Co-authors
- Sophia Ananiadou 5
- Jimin Huang 4
- Yupeng Cao 3
- Yuechen Jiang 3
- Alejandro Lopez-Lira 3
- Qianqian Xie 3
- Kailai Yang 3
- Xi Chen 2
- Zhiyang Deng 2
- Polydoros Giannouris 2
- Yueru He 2
- Triantafillos Papadopoulos 2
- Xueqing Peng 2
- Lingfei Qian 2
- Prayag Tiwari 2
- Jun’ichi Tsujii 2
- Yan Wang 2
- Ziyang Xu 2
- Chen Xu 2
- Saeed Almheiri 1
- Abdulrazzaq Alnajjar 1
- Nuo Chen 1
- Ming-Bin Chen 1
- Peng Cheng 1
- Arman Cohan 1
- Yun Feng 1
- Heming Fu 1
- Penglei Gao 1
- Yeyun Gong 1
- Yuqing Guo 1
- Yi Han 1
- Huan He 1
- Eduard Hovy 1
- Jerry Huang 1
- Lei Ji 1
- Mingyang Jiang 1
- Mohsinul Kabir 1
- Haohang Li 1
- Hao Li 1
- Xiao Liang (梁霄) 1
- Shengyuan Lin 1
- Mingquan Lin 1
- Xiao-Yang Liu 1
- Xue Liu 1
- Xiao Liu 1
- Peng Lu 1
- Erxue Min 1
- Jian-Yun Nie 1
- Meikang Qiu 1
- Yang Ren 1
- Kaleb E. Smith 1
- Efstathia Soufleri 1
- Harry Stuart 1
- Md. Tariquzzaman 1
- Paul Thompson 1
- Xiaoyu Wang 1
- Keyi Wang 1
- Suyuchen Wang 1
- Ruoyu Xiang 1
- Zhuohan Xie 1
- Guojun Xiong 1
- Shanshan Yang 1
- Mao Yang 1
- Yangyang Yu 1
- Ye Yuan 1
- Vincent Jim Zhang 1
- Jeff Zhao 1
- Yilun Zhao 1
- Yijia Zhao 1
- Tianlei Zhu 1
- Christine de Kock 1