Yueru He
2026
MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application
Xueqing Peng | Lingfei Qian | Yan Wang | Ruoyu Xiang | Yueru He | Yang Ren | Mingyang Jiang | Vincent Jim Zhang | Yuqing Guo | Jeff Zhao | Huan He | Yi Han | Yun Feng | Yuechen Jiang | Yupeng Cao | Haohang Li | Yangyang Yu | Xiaoyu Wang | Penglei Gao | Shengyuan Lin | Keyi Wang | Shanshan Yang | Yilun Zhao | Zhiwei Liu | Peng Lu | Jerry Huang | Suyuchen Wang | Triantafillos Papadopoulos | Polydoros Giannouris | Efstathia Soufleri | Nuo Chen | Zhiyang Deng | Heming Fu | Yijia Zhao | Mingquan Lin | Meikang Qiu | Kaleb E Smith | Arman Cohan | Xiao-Yang Liu | Jimin Huang | Guojun Xiong | Alejandro Lopez-Lira | Xi Chen | Junichi Tsujii | Jian-Yun Nie | Sophia Ananiadou | Qianqian Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xueqing Peng | Lingfei Qian | Yan Wang | Ruoyu Xiang | Yueru He | Yang Ren | Mingyang Jiang | Vincent Jim Zhang | Yuqing Guo | Jeff Zhao | Huan He | Yi Han | Yun Feng | Yuechen Jiang | Yupeng Cao | Haohang Li | Yangyang Yu | Xiaoyu Wang | Penglei Gao | Shengyuan Lin | Keyi Wang | Shanshan Yang | Yilun Zhao | Zhiwei Liu | Peng Lu | Jerry Huang | Suyuchen Wang | Triantafillos Papadopoulos | Polydoros Giannouris | Efstathia Soufleri | Nuo Chen | Zhiyang Deng | Heming Fu | Yijia Zhao | Mingquan Lin | Meikang Qiu | Kaleb E Smith | Arman Cohan | Xiao-Yang Liu | Jimin Huang | Guojun Xiong | Alejandro Lopez-Lira | Xi Chen | Junichi Tsujii | Jian-Yun Nie | Sophia Ananiadou | Qianqian Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Real-world financial analysis involves information across multiple languages and modalities, from reports and news to scanned filings and meeting recordings. Yet most existing evaluations of LLMs in finance remain text-only, monolingual, and largely saturated by current models. To bridge these gaps, we present MultiFinBen, the first expert-annotated multilingual (five languages) and multimodal (text, vision, audio) benchmark for evaluating LLMs in realistic financial contexts. MultiFinBen introduces two new task families: multilingual financial reasoning, which tests cross-lingual evidence integration from filings and news, and financial OCR, which extracts structured text from scanned documents containing tables and charts. Rather than aggregating all available datasets, we apply a structured, difficulty-aware selection based on advanced model performance, ensuring balanced challenge and removing redundant tasks. Evaluating 21 leading LLMs shows that even frontier multimodal models like GPT-4o achieve only 46.01% overall, stronger on vision and audio but dropping sharply in multilingual settings. These findings expose persistent limitations in multilingual, multimodal, and expert-level financial reasoning. All datasets, evaluation scripts, and leaderboards are publicly released.
All That Glisters Is Not Gold: A Benchmark for Reference-Free Counterfactual Financial Misinformation Detection
Yuechen Jiang | Zhiwei Liu | Yupeng Cao | Yueru He | Ziyang Xu | Chen Xu | Zhiyang Deng | Prayag Tiwari | Xi Chen | Alejandro Lopez-Lira | Jimin Huang | Junichi Tsujii | Sophia Ananiadou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuechen Jiang | Zhiwei Liu | Yupeng Cao | Yueru He | Ziyang Xu | Chen Xu | Zhiyang Deng | Prayag Tiwari | Xi Chen | Alejandro Lopez-Lira | Jimin Huang | Junichi Tsujii | Sophia Ananiadou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We introduce RFC-Bench, a benchmark for evaluating large language models on financial misinformation under realistic news. RFC-Bench operates at the paragraph level and captures the contextual complexity of financial news where meaning emerges from dispersed cues. The benchmark defines two complementary tasks: reference-free misinformation detection and comparison-based diagnosis using paired original–perturbed inputs. Experiments reveal a consistent pattern: performance is substantially stronger when comparative context is available, while reference-free settings expose significant weaknesses, including unstable predictions and elevated invalid outputs. These results indicate that current models struggle to maintain coherent belief states without external grounding. By highlighting this gap, RFC-Bench provides a structured testbed for studying reference-free reasoning and advancing more reliable financial misinformation detection in real-world settings.
FinReporting: An Agentic Workflow for Localized Reporting of Cross-Jurisdiction Financial Disclosure
Fan Zhang | Mingzi Song | Rania Elbadry | Yankai Chen | Shaobo Wang | Yixi Zhou | Xunwen Zheng | Yueru He | Yuyang Dai | Georgi Nenkov Georgiev | Ayesha Gull | Muhammad Usman Safder | Fan Wu | Liyuan Meng | Fengxian Ji | Junning Zhao | Xueqing Peng | Jimin Huang | YU Chen | Xue Liu | Preslav Nakov | Zhuohan Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Fan Zhang | Mingzi Song | Rania Elbadry | Yankai Chen | Shaobo Wang | Yixi Zhou | Xunwen Zheng | Yueru He | Yuyang Dai | Georgi Nenkov Georgiev | Ayesha Gull | Muhammad Usman Safder | Fan Wu | Liyuan Meng | Fengxian Ji | Junning Zhao | Xueqing Peng | Jimin Huang | YU Chen | Xue Liu | Preslav Nakov | Zhuohan Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Financial reporting systems increasingly leverage Large Language Models (LLMs) to extract and summarize corporate disclosures. However, most existing approaches assume a single-market setting and overlook structural differences across jurisdictions. Variations in accounting taxonomies, tagging infrastructures (e.g., XBRL vs. PDF), and aggregation conventions introduce substantial challenges for semantic alignment and reliable verification. Here, we aim to bridge this gap. We present FinReporting, an agentic workflow for localized cross-jurisdiction financial reporting. The system constructs a unified canonical ontology spanning the income statement, balance sheet, and cash flow statement, and decomposes reporting into auditable stages, including filing acquisition, extraction, canonical mapping, and anomaly logging. Rather than treating LLMs as free-form generators, FinReporting employs them as constrained verifiers operating under explicit decision rules with evidence grounding.Evaluated on annual filings from the USA, Japan, and China, FinReporting improves consistency and reliability under heterogeneous reporting regimes. We further release an interactive demo that enables cross-market inspection and supports structured export of localized financial statements. Our demo is available at https://huggingface.co/spaces/BoomQ/FinReporting-Demo. A video describing our system is available at https://www.youtube.com/watch?v=f65jdEL31Kk.
2025
UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models
Yuzhe Yang | Yifei Zhang | Yan Hu | Yilin Guo | Ruoli Gan | Yueru He | Mingcong Lei | Xiao Zhang | Haining Wang | Qianqian Xie | Jimin Huang | Honghai Yu | Benyou Wang
Findings of the Association for Computational Linguistics: NAACL 2025
Yuzhe Yang | Yifei Zhang | Yan Hu | Yilin Guo | Ruoli Gan | Yueru He | Mingcong Lei | Xiao Zhang | Haining Wang | Qianqian Xie | Jimin Huang | Honghai Yu | Benyou Wang
Findings of the Association for Computational Linguistics: NAACL 2025
This paper introduces the UCFE: User-Centric Financial Expertise benchmark, an innovative framework designed to evaluate the ability of large language models (LLMs) to handle complex real-world financial tasks. UCFE benchmark adopts a hybrid approach that combines human expert evaluations with dynamic, task-specific interactions to simulate the complexities of evolving financial scenarios. Firstly, we conducted a user study involving 804 participants, collecting their feedback on financial tasks. Secondly, based on this feedback, we created our dataset that encompasses a wide range of user intents and interactions. This dataset serves as the foundation for benchmarking 11 LLMs services using the LLM-as-Judge methodology. Our results show a significant alignment between benchmark scores and human preferences, with a Pearson correlation coefficient of 0.78, confirming the effectiveness of the UCFE dataset and our evaluation approach. UCFE benchmark not only reveals the potential of LLMs in the financial domain but also provides a robust framework for assessing their performance and user satisfaction.
INVESTORBENCH: A Benchmark for Financial Decision-Making Tasks with LLM-based Agent
Haohang Li | Yupeng Cao | Yangyang Yu | Shashidhar Reddy Javaji | Zhiyang Deng | Yueru He | Yuechen Jiang | Zining Zhu | K.p. Subbalakshmi | Jimin Huang | Lingfei Qian | Xueqing Peng | Jordan W. Suchow | Qianqian Xie
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Haohang Li | Yupeng Cao | Yangyang Yu | Shashidhar Reddy Javaji | Zhiyang Deng | Yueru He | Yuechen Jiang | Zining Zhu | K.p. Subbalakshmi | Jimin Huang | Lingfei Qian | Xueqing Peng | Jordan W. Suchow | Qianqian Xie
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advancements have underscored the potential of large language model (LLM)-based agents in financial decision-making. Despite this progress, the field currently encounters two main challenges: (1) the lack of a comprehensive LLM agent framework adaptable to a variety of financial tasks, and (2) the absence of standardized benchmarks and consistent datasets for assessing agent performance. To tackle these issues, we introduce InvestorBench, the first benchmark specifically designed for evaluating LLM-based agents in diverse financial decision-making contexts. InvestorBench enhances the versatility of LLM-enabled agents by providing a comprehensive suite of tasks applicable to different financial products, including single equities like stocks and cryptocurrencies, and exchange-traded funds (ETFs). Additionally, we assess the reasoning and decision-making capabilities of our agent framework using thirteen different LLMs as backbone models, across various market environments and tasks. Furthermore, we have curated a diverse collection of open-source, datasets and developed a comprehensive suite of environments for financial decision-making. This establishes a highly accessible platform for evaluating financial agents’ performance across various scenarios.
2024
FinNLP-AgentScen-2024 Shared Task: Financial Challenges in Large Language Models - FinLLMs
Qianqian Xie | Jimin Huang | Dong Li | Zhengyu Chen | Ruoyu Xiang | Mengxi Xiao | Yangyang Yu | Vijayasai Somasundaram | Kailai Yang | Chenhan Yuan | Zheheng Luo | Zhiwei Liu | Yueru He | Yuechen Jiang | Haohang Li | Duanyu Feng | Xiao-Yang Liu | Benyou Wang | Hao Wang | Yanzhao Lai | Jordan Suchow | Alejandro Lopez-Lira | Min Peng | Sophia Ananiadou
Proceedings of the Eighth Financial Technology and Natural Language Processing and the 1st Agent AI for Scenario Planning
Qianqian Xie | Jimin Huang | Dong Li | Zhengyu Chen | Ruoyu Xiang | Mengxi Xiao | Yangyang Yu | Vijayasai Somasundaram | Kailai Yang | Chenhan Yuan | Zheheng Luo | Zhiwei Liu | Yueru He | Yuechen Jiang | Haohang Li | Duanyu Feng | Xiao-Yang Liu | Benyou Wang | Hao Wang | Yanzhao Lai | Jordan Suchow | Alejandro Lopez-Lira | Min Peng | Sophia Ananiadou
Proceedings of the Eighth Financial Technology and Natural Language Processing and the 1st Agent AI for Scenario Planning
Search
Fix author
Co-authors
- Jimin Huang 6
- Yuechen Jiang 4
- Qianqian Xie 4
- Sophia Ananiadou 3
- Yupeng Cao 3
- Zhiyang Deng 3
- Haohang Li 3
- Alejandro Lopez-Lira 3
- Xueqing Peng 3
- Yangyang Yu 3
- Xi Chen 2
- Xiao-Yang Liu 2
- Zhiwei Liu 2
- Lingfei Qian 2
- Jun’ichi Tsujii 2
- Benyou Wang 2
- Ruoyu Xiang 2
- Nuo Chen 1
- YU Chen (陈昱) 1
- Yankai Chen 1
- Zhengyu Chen 1
- Arman Cohan 1
- Yuyang Dai 1
- Rania Elbadry 1
- Duanyu Feng 1
- Yun Feng 1
- Heming Fu 1
- Ruoli Gan 1
- Penglei Gao 1
- Georgi Nenkov Georgiev 1
- Polydoros Giannouris 1
- Ayesha Gull 1
- Yilin Guo 1
- Yuqing Guo 1
- Yi Han 1
- Huan He 1
- Yan Hu 1
- Jerry Huang 1
- Shashidhar Reddy Javaji 1
- Fengxian Ji 1
- Mingyang Jiang 1
- Yanzhao Lai 1
- Mingcong Lei 1
- Dong Li 1
- Mingquan Lin 1
- Shengyuan Lin 1
- Xue Liu 1
- Zhiwei Liu 1
- Peng Lu 1
- Zheheng Luo 1
- Liyuan Meng 1
- Preslav Nakov 1
- Jian-Yun Nie 1
- Triantafillos Papadopoulos 1
- Min Peng 1
- Meikang Qiu 1
- Yang Ren 1
- Muhammad Usman Safder 1
- Kaleb E. Smith 1
- Vijayasai Somasundaram 1
- Mingzi Song 1
- Efstathia Soufleri 1
- K.p. Subbalakshmi 1
- Jordan Suchow 1
- Jordan W. Suchow 1
- Prayag Tiwari 1
- Haining Wang 1
- Hao Wang 1
- Keyi Wang 1
- Shaobo Wang 1
- Suyuchen Wang 1
- Xiaoyu Wang 1
- Yan Wang 1
- Fan Wu 1
- Mengxi Xiao 1
- Zhuohan Xie 1
- Guojun Xiong 1
- Chen Xu 1
- Ziyang Xu 1
- Kailai Yang 1
- Shanshan Yang 1
- Yuzhe Yang 1
- Honghai Yu 1
- Chenhan Yuan 1
- Fan Zhang 1
- Vincent Jim Zhang 1
- Xiao Zhang 1
- Yifei Zhang 1
- Jeff Zhao 1
- Junning Zhao 1
- Yijia Zhao 1
- Yilun Zhao 1
- Xunwen Zheng 1
- Yixi Zhou 1
- Zining Zhu 1