Jimin Huang
2026
MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application
Xueqing Peng | Lingfei Qian | Yan Wang | Ruoyu Xiang | Yueru He | Yang Ren | Mingyang Jiang | Vincent Jim Zhang | Yuqing Guo | Jeff Zhao | Huan He | Yi Han | Yun Feng | Yuechen Jiang | Yupeng Cao | Haohang Li | Yangyang Yu | Xiaoyu Wang | Penglei Gao | Shengyuan Lin | Keyi Wang | Shanshan Yang | Yilun Zhao | Zhiwei Liu | Peng Lu | Jerry Huang | Suyuchen Wang | Triantafillos Papadopoulos | Polydoros Giannouris | Efstathia Soufleri | Nuo Chen | Zhiyang Deng | Heming Fu | Yijia Zhao | Mingquan Lin | Meikang Qiu | Kaleb E Smith | Arman Cohan | Xiao-Yang Liu | Jimin Huang | Guojun Xiong | Alejandro Lopez-Lira | Xi Chen | Junichi Tsujii | Jian-Yun Nie | Sophia Ananiadou | Qianqian Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xueqing Peng | Lingfei Qian | Yan Wang | Ruoyu Xiang | Yueru He | Yang Ren | Mingyang Jiang | Vincent Jim Zhang | Yuqing Guo | Jeff Zhao | Huan He | Yi Han | Yun Feng | Yuechen Jiang | Yupeng Cao | Haohang Li | Yangyang Yu | Xiaoyu Wang | Penglei Gao | Shengyuan Lin | Keyi Wang | Shanshan Yang | Yilun Zhao | Zhiwei Liu | Peng Lu | Jerry Huang | Suyuchen Wang | Triantafillos Papadopoulos | Polydoros Giannouris | Efstathia Soufleri | Nuo Chen | Zhiyang Deng | Heming Fu | Yijia Zhao | Mingquan Lin | Meikang Qiu | Kaleb E Smith | Arman Cohan | Xiao-Yang Liu | Jimin Huang | Guojun Xiong | Alejandro Lopez-Lira | Xi Chen | Junichi Tsujii | Jian-Yun Nie | Sophia Ananiadou | Qianqian Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Real-world financial analysis involves information across multiple languages and modalities, from reports and news to scanned filings and meeting recordings. Yet most existing evaluations of LLMs in finance remain text-only, monolingual, and largely saturated by current models. To bridge these gaps, we present MultiFinBen, the first expert-annotated multilingual (five languages) and multimodal (text, vision, audio) benchmark for evaluating LLMs in realistic financial contexts. MultiFinBen introduces two new task families: multilingual financial reasoning, which tests cross-lingual evidence integration from filings and news, and financial OCR, which extracts structured text from scanned documents containing tables and charts. Rather than aggregating all available datasets, we apply a structured, difficulty-aware selection based on advanced model performance, ensuring balanced challenge and removing redundant tasks. Evaluating 21 leading LLMs shows that even frontier multimodal models like GPT-4o achieve only 46.01% overall, stronger on vision and audio but dropping sharply in multilingual settings. These findings expose persistent limitations in multilingual, multimodal, and expert-level financial reasoning. All datasets, evaluation scripts, and leaderboards are publicly released.
FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning
Zhuohan Xie | Daniil Orel | Rushil Thareja | Dhruv Sahnan | Hachem Madmoun | Fan Zhang | Debopriyo Banerjee | Georgi Nenkov Georgiev | Xueqing Peng | Lingfei Qian | Jimin Huang | Jinyan Su | Aaryamonvikram Singh | Rui Xing | Rania Elbadry | Chen Xu | Haonan Li | Fajri Koto | Ivan Koychev | Tanmoy Chakraborty | Yuxia Wang | Salem Lahlou | Veselin Stoyanov | Sophia Ananiadou | Preslav Nakov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhuohan Xie | Daniil Orel | Rushil Thareja | Dhruv Sahnan | Hachem Madmoun | Fan Zhang | Debopriyo Banerjee | Georgi Nenkov Georgiev | Xueqing Peng | Lingfei Qian | Jimin Huang | Jinyan Su | Aaryamonvikram Singh | Rui Xing | Rania Elbadry | Chen Xu | Haonan Li | Fajri Koto | Ivan Koychev | Tanmoy Chakraborty | Yuxia Wang | Salem Lahlou | Veselin Stoyanov | Sophia Ananiadou | Preslav Nakov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multi-step symbolic reasoning is essential for robust financial analysis; yet, current benchmarks largely overlook this capability. Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while neglecting the intermediate reasoning steps required for transparency and verification. To address this gap, we introduce FinChain, the first benchmark specifically designed for verifiable Chain-of-Thought evaluation in finance. FinChain spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python code that enable fully machine-verifiable reasoning and scalable, contamination-free data generation.To assess reasoning capacity, we propose ChainEval, a dynamic alignment measure that jointly evaluates both the final-answer correctness and the step-level reasoning consistency. Our evaluation of 26 leading LLMs reveals that even frontier LLMs exhibit clear limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models can substantially narrow this gap.Overall, FinChain exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI. This project is available at https://github.com/mbzuai-nlp/finchain.git.
All That Glisters Is Not Gold: A Benchmark for Reference-Free Counterfactual Financial Misinformation Detection
Yuechen Jiang | Zhiwei Liu | Yupeng Cao | Yueru He | Ziyang Xu | Chen Xu | Zhiyang Deng | Prayag Tiwari | Xi Chen | Alejandro Lopez-Lira | Jimin Huang | Junichi Tsujii | Sophia Ananiadou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuechen Jiang | Zhiwei Liu | Yupeng Cao | Yueru He | Ziyang Xu | Chen Xu | Zhiyang Deng | Prayag Tiwari | Xi Chen | Alejandro Lopez-Lira | Jimin Huang | Junichi Tsujii | Sophia Ananiadou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We introduce RFC-Bench, a benchmark for evaluating large language models on financial misinformation under realistic news. RFC-Bench operates at the paragraph level and captures the contextual complexity of financial news where meaning emerges from dispersed cues. The benchmark defines two complementary tasks: reference-free misinformation detection and comparison-based diagnosis using paired original–perturbed inputs. Experiments reveal a consistent pattern: performance is substantially stronger when comparative context is available, while reference-free settings expose significant weaknesses, including unstable predictions and elevated invalid outputs. These results indicate that current models struggle to maintain coherent belief states without external grounding. By highlighting this gap, RFC-Bench provides a structured testbed for studying reference-free reasoning and advancing more reliable financial misinformation detection in real-world settings.
The Illusion of Specialization: Unveiling the Domain-Invariant "Standing Committee" in Mixture-of-Experts Models
Yan Wang | Yitao Xu | Nanhan Shen | Jinyan Su | Jimin Huang | Zining Zhu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yan Wang | Yitao Xu | Nanhan Shen | Jinyan Su | Jimin Huang | Zining Zhu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Mixture of Experts models are widely assumed to achieve domain specialization through sparse routing. In this work, we question this assumption by introducing COMMITTEEAUDIT, a post hoc framework that analyzes routing behavior at the level of expert groups rather than individual experts. Across three representative models and the MMLU benchmark, we uncover a domain invariant Standing Committee. This is a compact coalition of routed experts that consistently captures the majority of routing mass across domains, layers, and routing budgets, even when architectures already include shared experts. Qualitative analysis further shows that Standing Committees anchor reasoning structure and syntax, while peripheral experts handle domain-specific knowledge. These findings reveal a strong structural bias toward centralized computation, suggesting that specialization in Mixture of Experts models is far less pervasive than commonly believed. Crucially, this inherent bias indicates that current training objectives, such as load-balancing losses that enforce uniform expert utilization, may be working against the model’s natural optimization path, thereby limiting training efficiency and performance.
Same Claim, Different Judgment: Benchmarking Scenario-Induced Bias in Multilingual Financial Misinformation Detection
Zhiwei Liu | Yupeng Cao | Yuechen Jiang | Mohsinul Kabir | Polydoros Giannouris | Chen Xu | Ziyang Xu | Tianlei Zhu | Md. Tariquzzaman | Triantafillos Papadopoulos | Yan Wang | Lingfei Qian | Xueqing Peng | Zhuohan Xie | Ye Yuan | Saeed Almheiri | Abdulrazzaq Alnajjar | Ming-Bin Chen | Harry Stuart | Paul Thompson | Prayag Tiwari | Alejandro Lopez-Lira | Xue Liu | Jimin Huang | Sophia Ananiadou
Findings of the Association for Computational Linguistics: ACL 2026
Zhiwei Liu | Yupeng Cao | Yuechen Jiang | Mohsinul Kabir | Polydoros Giannouris | Chen Xu | Ziyang Xu | Tianlei Zhu | Md. Tariquzzaman | Triantafillos Papadopoulos | Yan Wang | Lingfei Qian | Xueqing Peng | Zhuohan Xie | Ye Yuan | Saeed Almheiri | Abdulrazzaq Alnajjar | Ming-Bin Chen | Harry Stuart | Paul Thompson | Prayag Tiwari | Alejandro Lopez-Lira | Xue Liu | Jimin Huang | Sophia Ananiadou
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) have been widely applied across various domains of finance. Since their training data are largely derived from human-authored corpora, LLMs may inherit a range of human biases. Behavioral biases can lead to instability and uncertainty in decision-making, particularly when processing financial information. However, existing research on LLM bias has mainly focused on direct questioning or simplified, general-purpose settings, with limited consideration of the complex real-world financial environments and high-risk, context-sensitive, multilingual financial misinformation detection tasks (MFMD). In this work, we propose MFMDScen, a comprehensive benchmark for evaluating behavioral biases of LLMs in MFMD across diverse economic scenarios. In collaboration with financial experts, we construct three types of complex financial scenarios: (i) role- and personality-based, (ii) role- and region-based, and (iii) role-based scenarios incorporating ethnicity and religious beliefs. We further develop a multilingual financial misinformation dataset covering English, Chinese, Greek, and Bengali. By integrating these scenarios with misinformation claims, MFMDScen enables a systematic evaluation of 22 mainstream LLMs. Our findings reveal that pronounced behavioral biases persist across both commercial and open-source models. This project is available at https://github.com/lzw108/FMD.
SAHM: A Benchmark for Arabic Financial and Shari’ah-Compliant Reasoning
Rania Elbadry | Sarfraz Ahmad | Ahmed Heakl | Dani Bouch | Momina Ahsan | Muhra AlMahri | Marwa Elsaid Khalil | Yuxia Wang | Salem Lahlou | Sophia Ananiadou | Veselin Stoyanov | Jimin Huang | Xueqing Peng | Preslav Nakov | Zhuohan Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Rania Elbadry | Sarfraz Ahmad | Ahmed Heakl | Dani Bouch | Momina Ahsan | Muhra AlMahri | Marwa Elsaid Khalil | Yuxia Wang | Salem Lahlou | Sophia Ananiadou | Veselin Stoyanov | Jimin Huang | Xueqing Peng | Preslav Nakov | Zhuohan Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
English financial NLP has progressed rapidly through benchmarks for sentiment, document understanding, and financial question answering, while Arabic financial NLP remains comparatively under-explored despite strong practical demand for trustworthy finance and Islamic-finance assistants. We introduce SAHM, a document-grounded benchmark and instruction-tuning dataset for Arabic financial NLP and Shari’ah-compliant reasoning. SAHM contains 14,380 expert-verified instances spanning seven tasks: AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event–cause reasoning, curated from authentic regulatory, juristic, and corporate sources. We evaluate 19 strong open and proprietary LLMs using task-specific metrics and rubric-based scoring for open-ended outputs, and find that Arabic fluency does not reliably translate to evidence-grounded financial reasoning: models are substantially stronger on recognition-style tasks than on generation and causal reasoning, with the largest gaps on event–cause reasoning. We release the benchmark, evaluation framework, and an instruction-tuned model to support future research on trustworthy Arabic financial NLP.
2025
FLAG-TRADER: Fusion LLM-Agent with Gradient-based Reinforcement Learning for Financial Trading
Guojun Xiong | Zhiyang Deng | Keyi Wang | Yupeng Cao | Haohang Li | Yangyang Yu | Xueqing Peng | Mingquan Lin | Kaleb E Smith | Xiao-Yang Liu | Jimin Huang | Sophia Ananiadou | Qianqian Xie
Findings of the Association for Computational Linguistics: ACL 2025
Guojun Xiong | Zhiyang Deng | Keyi Wang | Yupeng Cao | Haohang Li | Yangyang Yu | Xueqing Peng | Mingquan Lin | Kaleb E Smith | Xiao-Yang Liu | Jimin Huang | Sophia Ananiadou | Qianqian Xie
Findings of the Association for Computational Linguistics: ACL 2025
Large language models (LLMs) fine-tuned on multimodal financial data have demonstrated impressive reasoning capabilities in various financial tasks. However, they often struggle with multi-step, goal-oriented scenarios in interactive financial markets, such as trading, where complex agentic approaches are required to improve decision-making. To address this, we propose FLAG-Trader, a unified architecture integrating linguistic processing (via LLMs) with gradient-driven reinforcement learning (RL) policy optimization, in which a partially fine-tuned LLM acts as the policy network, leveraging pre-trained knowledge while adapting to the financial domain through parameter-efficient fine-tuning. Through policy gradient optimization driven by trading rewards, our framework not only enhances LLM performance in trading but also improves results on other financial-domain tasks. We present extensive empirical evidence to validate these enhancements.
Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal)
Chung-Chi Chen | Antonio Moreno-Sandoval | Jimin Huang | Qianqian Xie | Sophia Ananiadou | Hsin-Hsi Chen
Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal)
Chung-Chi Chen | Antonio Moreno-Sandoval | Jimin Huang | Qianqian Xie | Sophia Ananiadou | Hsin-Hsi Chen
Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal)
INVESTORBENCH: A Benchmark for Financial Decision-Making Tasks with LLM-based Agent
Haohang Li | Yupeng Cao | Yangyang Yu | Shashidhar Reddy Javaji | Zhiyang Deng | Yueru He | Yuechen Jiang | Zining Zhu | K.p. Subbalakshmi | Jimin Huang | Lingfei Qian | Xueqing Peng | Jordan W. Suchow | Qianqian Xie
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Haohang Li | Yupeng Cao | Yangyang Yu | Shashidhar Reddy Javaji | Zhiyang Deng | Yueru He | Yuechen Jiang | Zining Zhu | K.p. Subbalakshmi | Jimin Huang | Lingfei Qian | Xueqing Peng | Jordan W. Suchow | Qianqian Xie
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advancements have underscored the potential of large language model (LLM)-based agents in financial decision-making. Despite this progress, the field currently encounters two main challenges: (1) the lack of a comprehensive LLM agent framework adaptable to a variety of financial tasks, and (2) the absence of standardized benchmarks and consistent datasets for assessing agent performance. To tackle these issues, we introduce InvestorBench, the first benchmark specifically designed for evaluating LLM-based agents in diverse financial decision-making contexts. InvestorBench enhances the versatility of LLM-enabled agents by providing a comprehensive suite of tasks applicable to different financial products, including single equities like stocks and cryptocurrencies, and exchange-traded funds (ETFs). Additionally, we assess the reasoning and decision-making capabilities of our agent framework using thirteen different LLMs as backbone models, across various market environments and tasks. Furthermore, we have curated a diverse collection of open-source, datasets and developed a comprehensive suite of environments for financial decision-making. This establishes a highly accessible platform for evaluating financial agents’ performance across various scenarios.
UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models
Yuzhe Yang | Yifei Zhang | Yan Hu | Yilin Guo | Ruoli Gan | Yueru He | Mingcong Lei | Xiao Zhang | Haining Wang | Qianqian Xie | Jimin Huang | Honghai Yu | Benyou Wang
Findings of the Association for Computational Linguistics: NAACL 2025
Yuzhe Yang | Yifei Zhang | Yan Hu | Yilin Guo | Ruoli Gan | Yueru He | Mingcong Lei | Xiao Zhang | Haining Wang | Qianqian Xie | Jimin Huang | Honghai Yu | Benyou Wang
Findings of the Association for Computational Linguistics: NAACL 2025
This paper introduces the UCFE: User-Centric Financial Expertise benchmark, an innovative framework designed to evaluate the ability of large language models (LLMs) to handle complex real-world financial tasks. UCFE benchmark adopts a hybrid approach that combines human expert evaluations with dynamic, task-specific interactions to simulate the complexities of evolving financial scenarios. Firstly, we conducted a user study involving 804 participants, collecting their feedback on financial tasks. Secondly, based on this feedback, we created our dataset that encompasses a wide range of user intents and interactions. This dataset serves as the foundation for benchmarking 11 LLMs services using the LLM-as-Judge methodology. Our results show a significant alignment between benchmark scores and human preferences, with a Pearson correlation coefficient of 0.78, confirming the effectiveness of the UCFE dataset and our evaluation approach. UCFE benchmark not only reveals the potential of LLMs in the financial domain but also provides a robust framework for assessing their performance and user satisfaction.
UCL-Bench: A Chinese User-Centric Legal Benchmark for Large Language Models
Ruoli Gan | Duanyu Feng | Chen Zhang | Zhihang Lin | Haochen Jia | Hao Wang | Zhenyang Cai | Lei Cui | Qianqian Xie | Jimin Huang | Benyou Wang
Findings of the Association for Computational Linguistics: NAACL 2025
Ruoli Gan | Duanyu Feng | Chen Zhang | Zhihang Lin | Haochen Jia | Hao Wang | Zhenyang Cai | Lei Cui | Qianqian Xie | Jimin Huang | Benyou Wang
Findings of the Association for Computational Linguistics: NAACL 2025
Existing legal benchmarks focusing on knowledge and logic effectively evaluate LLMs on various tasks in legal domain. However, few have explored the practical application of LLMs by actual users. To further assess whether LLMs meet the specific needs of legal practitioners in real-world scenarios, we introduce UCL-Bench, a Chinese User-Centric Legal Benchmark, comprising 22 tasks across 5 distinct legal scenarios.To build the UCL-Bench, we conduct a user survey targeting legal professionals to understand their needs and challenges. Based on the survey results, we craft tasks, verified by legal professionals, and categorized them according to Bloom’s taxonomy. Each task in UCL-Bench mirrors real-world legal scenarios, and instead of relying on pre-defined answers, legal experts provide detailed answer guidance for each task, incorporating both “information” and “needs” elements to mimic the complexities of legal practice. With the guidance, we use GPT-4 as the user simulator and evaluator, enabling multi-turn dialogues as a answer guidance based evaluation framework. Our findings reveal that many recent open-source general models achieve the highest performance, suggesting that they are well-suited to address the needs of legal practitioners. However, these legal LLMs do not outperform ChatGPT, indicating a need for training strategies aligned with users’ needs. Furthermore, we find that the most effective models are able to address legal issues within fewer dialogue turns, highlighting the importance of concise and accurate responses in achieving high performance. The code and dataset are available at https://github.com/wittenberg11/UCL-bench.
Selective Preference Optimization via Token-Level Reward Function Estimation
Kailai Yang | Zhiwei Liu | Qianqian Xie | Jimin Huang | Erxue Min | Sophia Ananiadou
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Kailai Yang | Zhiwei Liu | Qianqian Xie | Jimin Huang | Erxue Min | Sophia Ananiadou
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Recent advancements in LLM alignment leverage token-level supervisions to perform fine-grained preference optimization. However, existing token-level alignment methods either optimize on all available tokens, which can be noisy and inefficient, or perform selective training with complex and expensive key token selection strategies. In this work, we propose Selective Preference Optimization (SePO), a novel selective alignment strategy that centers on efficient key token selection without requiring strong, fine-grained supervision signals. We theoretically prove the feasibility of Direct Preference Optimization (DPO) as token-level reward function estimators, which applies to any existing alignment datasets and enables cost-efficient token selection with small-scale model sizes and training data. We then train an oracle model with DPO on the target data and utilize the estimated reward function to score all tokens within the target dataset, where only the key tokens are selected to supervise the target policy model with a contrastive objective function. Extensive experiments on three public evaluation benchmarks show that SePO significantly outperforms competitive baseline methods by only optimizing on 30% key tokens with up to 60% reduction in GPU training hours. We also explore SePO as a new paradigm for weak-to-strong generalization, showing that weak oracle models effectively supervise strong policy models with up to 16.8 more parameters. SePO also selects useful supervision signals from out-of-distribution data, alleviating the over-optimization problem.
LAiW: A Chinese Legal Large Language Models Benchmark
Yongfu Dai | Duanyu Feng | Jimin Huang | Haochen Jia | Qianqian Xie | Yifang Zhang | Weiguang Han | Wei Tian | Hao Wang
Proceedings of the 31st International Conference on Computational Linguistics
Yongfu Dai | Duanyu Feng | Jimin Huang | Haochen Jia | Qianqian Xie | Yifang Zhang | Weiguang Han | Wei Tian | Hao Wang
Proceedings of the 31st International Conference on Computational Linguistics
General and legal domain LLMs have demonstrated strong performance in various tasks of LegalAI. However, their current evaluations lack alignment with the fundamental logic of legal reasoning, the legal syllogism. This hinders trust and understanding from legal experts. To bridge this gap, we introduce LAiW, the Chinese legal LLM benchmark structured around the legal syllogism. We evaluate legal LLMs across three levels of capability, each reflecting a progressively more complex stage of legal syllogism: fundamental information retrieval, legal principles inference, and advanced legal applications, and encompassing a wide range of tasks in different legal scenarios. Our automatic evaluation reveals that LLMs, despite their ability to answer complex legal questions, lack the inherent logical processes of the legal syllogism. This limitation poses a barrier to acceptance by legal professionals. Furthermore, manual evaluation with legal experts confirms this issue and highlights the importance of pre-training on legal text to enhance the legal syllogism of LLMs. Future research may prioritize addressing this gap to unlock the full potential of LLMs in legal applications.
Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance
Xueqing Peng | Triantafillos Papadopoulos | Efstathia Soufleri | Polydoros Giannouris | Ruoyu Xiang | Yan Wang | Lingfei Qian | Jimin Huang | Qianqian Xie | Sophia Ananiadou
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Xueqing Peng | Triantafillos Papadopoulos | Efstathia Soufleri | Polydoros Giannouris | Ruoyu Xiang | Yan Wang | Lingfei Qian | Jimin Huang | Qianqian Xie | Sophia Ananiadou
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Despite Greece’s pivotal role in the global economy, large language models (LLMs) remain underexplored for Greek financial context due to the linguistic complexity of Greek and the scarcity of domain-specific datasets. While multilingual financial NLP has revealed large performance gaps across languages, no benchmarks or LLMs have been tailored for Greek financial tasks until now. To bridge this gap, we introduce Plutus-ben, the first Greek Financial Evaluation Benchmark, and Plutus-8B, the first financial LLM fine-tuned on Greek-specific financial data. Plutus-ben addresses six core tasks: numeric/textual named entity recognition, question answering, extractive summarization, abstractive summarization, and topic classification. To support these tasks, we release four new expert-annotated Greek financial datasets and incorporate two existing resources. Our comprehensive evaluation of 24 LLMs reveals persistent challenges in Greek financial NLP, driven by linguistic complexity, domain terminology, and financial reasoning gaps. Experiment results underscore the limitations of cross-lingual transfer and the need for Greek-specific financial modeling. We publicly release Plutus-ben, Plutus-8B, and all associated datasets to promote reproducible research and advance multilingual financial NLP.
ELAINE-medLLM: Lightweight English Japanese Chinese Trilingual Large Language Model for Bio-medical Domain
Ken Yano | Zheheng Luo | Jimin Huang | Qianqian Xie | Masaki Asada | Chenhan Yuan | Kailai Yang | Makoto Miwa | Sophia Ananiadou | Jun’ichi Tsujii
Proceedings of the 31st International Conference on Computational Linguistics
Ken Yano | Zheheng Luo | Jimin Huang | Qianqian Xie | Masaki Asada | Chenhan Yuan | Kailai Yang | Makoto Miwa | Sophia Ananiadou | Jun’ichi Tsujii
Proceedings of the 31st International Conference on Computational Linguistics
We propose ELAINE (EngLish-jApanese-chINesE)-medLLM, a trilingual (English, Japanese, Chinese) large language model adapted for the bio-medical domain based on Llama-3-8B. The training dataset was carefully curated in terms of volume and diversity to adapt to the biomedical domain and endow trilingual capability while preserving the knowledge and abilities of the base model. The training follows 2-stage paths: continued pre-training and supervised fine-tuning (SFT). Our results demonstrate that ELAINE-medLLM exhibits superior trilingual capabilities compared to existing bilingual or multilingual medical LLMs without severely sacrificing the base model’s capability.
2024
FinNLP-AgentScen-2024 Shared Task: Financial Challenges in Large Language Models - FinLLMs
Qianqian Xie | Jimin Huang | Dong Li | Zhengyu Chen | Ruoyu Xiang | Mengxi Xiao | Yangyang Yu | Vijayasai Somasundaram | Kailai Yang | Chenhan Yuan | Zheheng Luo | Zhiwei Liu | Yueru He | Yuechen Jiang | Haohang Li | Duanyu Feng | Xiao-Yang Liu | Benyou Wang | Hao Wang | Yanzhao Lai | Jordan Suchow | Alejandro Lopez-Lira | Min Peng | Sophia Ananiadou
Proceedings of the Eighth Financial Technology and Natural Language Processing and the 1st Agent AI for Scenario Planning
Qianqian Xie | Jimin Huang | Dong Li | Zhengyu Chen | Ruoyu Xiang | Mengxi Xiao | Yangyang Yu | Vijayasai Somasundaram | Kailai Yang | Chenhan Yuan | Zheheng Luo | Zhiwei Liu | Yueru He | Yuechen Jiang | Haohang Li | Duanyu Feng | Xiao-Yang Liu | Benyou Wang | Hao Wang | Yanzhao Lai | Jordan Suchow | Alejandro Lopez-Lira | Min Peng | Sophia Ananiadou
Proceedings of the Eighth Financial Technology and Natural Language Processing and the 1st Agent AI for Scenario Planning
HealMe: Harnessing Cognitive Reframing in Large Language Models for Psychotherapy
Mengxi Xiao | Qianqian Xie | Ziyan Kuang | Zhicheng Liu | Kailai Yang | Min Peng | Weiguang Han | Jimin Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Mengxi Xiao | Qianqian Xie | Ziyan Kuang | Zhicheng Liu | Kailai Yang | Min Peng | Weiguang Han | Jimin Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) can play a vital role in psychotherapy by adeptly handling the crucial task of cognitive reframing and overcoming challenges such as shame, distrust, therapist skill variability, and resource scarcity. Previous LLMs in cognitive reframing mainly converted negative emotions to positive ones, but these approaches have limited efficacy, often not promoting clients’ self-discovery of alternative perspectives. In this paper, we unveil the Helping and Empowering through Adaptive Language in Mental Enhancement (HealMe) model. This novel cognitive reframing therapy method effectively addresses deep-rooted negative thoughts and fosters rational, balanced perspectives. Diverging from traditional LLM methods, HealMe employs empathetic dialogue based on psychotherapeutic frameworks. It systematically guides clients through distinguishing circumstances from feelings, brainstorming alternative viewpoints, and developing empathetic, actionable suggestions. Moreover, we adopt the first comprehensive and expertly crafted psychological evaluation metrics, specifically designed to rigorously assess the performance of cognitive reframing, in both AI-simulated dialogues and real-world therapeutic conversations. Experimental results show that our model outperforms others in terms of empathy, guidance, and logical coherence, demonstrating its effectiveness and potential positive impact on psychotherapy.
AuditWen: An Open-Source Large Language Model for Audit
Jiajia Huang | Haoran Zhu | Chao Xu | Tianming Zhan | Qianqian Xie | Jimin Huang
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)
Jiajia Huang | Haoran Zhu | Chao Xu | Tianming Zhan | Qianqian Xie | Jimin Huang
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)
“Intelligent auditing represents a crucial advancement in modern audit practices, enhancing boththe quality and efficiency of audits within the realm of artificial intelligence. With the rise oflarge language model (LLM), there is enormous potential for intelligent models to contribute toaudit domain. However, general LLMs applied in audit domain face the challenges of lackingspecialized knowledge and the presence of data biases. To overcome these challenges, this studyintroduces AuditWen, an open-source audit LLM by fine-tuning Qwen with constructing instruc-tion data from audit domain. We first outline the application scenarios for LLMs in the audit andextract requirements that shape the development of LLMs tailored for audit purposes. We thenpropose an audit LLM, called AuditWen, by fine-tuning Qwen with constructing 30k instructiondataset from 15 audit tasks and 3 layers. In evaluation stage, we proposed a benchmark with 5kinstructions that covers a set of critical audit tasks derived from the application scenarios. Withthe benchmark, we compare AuditWen with other existing LLMs from information extraction,question answering and document generation. The experimental results demonstrate superiorperformance of AuditWen both in question understanding and answer generation, making it animmediately valuable tool for audit.Keyword AuditWen, LLM, instruction dataset, fine-tuning, benchmarkIntroduction”
2022
GRETEL: Graph Contrastive Topic Enhanced Language Model for Long Document Extractive Summarization
Qianqian Xie | Jimin Huang | Tulika Saha | Sophia Ananiadou
Proceedings of the 29th International Conference on Computational Linguistics
Qianqian Xie | Jimin Huang | Tulika Saha | Sophia Ananiadou
Proceedings of the 29th International Conference on Computational Linguistics
Recently, neural topic models (NTMs) have been incorporated into pre-trained language models (PLMs), to capture the global semantic information for text summarization. However, in these methods, there remain limitations in the way they capture and integrate the global semantic information. In this paper, we propose a novel model, the graph contrastive topic enhanced language model (GRETEL), that incorporates the graph contrastive topic model with the pre-trained language model, to fully leverage both the global and local contextual semantics for long document extractive summarization. To better capture and incorporate the global semantic information into PLMs, the graph contrastive topic model integrates the hierarchical transformer encoder and the graph contrastive learning to fuse the semantic information from the global document context and the gold summary. To this end, GRETEL encourages the model to efficiently extract salient sentences that are topically related to the gold summary, rather than redundant sentences that cover sub-optimal topics. Experimental results on both general domain and biomedical datasets demonstrate that our proposed method outperforms SOTA methods.
2021
Graph Relational Topic Model with Higher-order Graph Attention Auto-encoders
Qianqian Xie | Jimin Huang | Pan Du | Min Peng
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
Qianqian Xie | Jimin Huang | Pan Du | Min Peng
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
Inductive Topic Variational Graph Auto-Encoder for Text Classification
Qianqian Xie | Jimin Huang | Pan Du | Min Peng | Jian-Yun Nie
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Qianqian Xie | Jimin Huang | Pan Du | Min Peng | Jian-Yun Nie
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Graph convolutional networks (GCNs) have been applied recently to text classification and produced an excellent performance. However, existing GCN-based methods do not assume an explicit latent semantic structure of documents, making learned representations less effective and difficult to interpret. They are also transductive in nature, thus cannot handle out-of-graph documents. To address these issues, we propose a novel model named inductive Topic Variational Graph Auto-Encoder (T-VGAE), which incorporates a topic model into variational graph-auto-encoder (VGAE) to capture the hidden semantic information between documents and words. T-VGAE inherits the interpretability of the topic model and the efficient information propagation mechanism of VGAE. It learns probabilistic representations of words and documents by jointly encoding and reconstructing the global word-level graph and bipartite graphs of documents, where each document is considered individually and decoupled from the global correlation graph so as to enable inductive learning. Our experiments on several benchmark datasets show that our method outperforms the existing competitive models on supervised and semi-supervised text classification, as well as unsupervised text representation learning. In addition, it has higher interpretability and is able to deal with unseen documents.
2018
Neural Sparse Topical Coding
Min Peng | Qianqian Xie | Yanchun Zhang | Hua Wang | Xiuzhen Zhang | Jimin Huang | Gang Tian
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Min Peng | Qianqian Xie | Yanchun Zhang | Hua Wang | Xiuzhen Zhang | Jimin Huang | Gang Tian
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Topic models with sparsity enhancement have been proven to be effective at learning discriminative and coherent latent topics of short texts, which is critical to many scientific and engineering applications. However, the extensions of these models require carefully tailored graphical models and re-deduced inference algorithms, limiting their variations and applications. We propose a novel sparsity-enhanced topic model, Neural Sparse Topical Coding (NSTC) base on a sparsity-enhanced topic model called Sparse Topical Coding (STC). It focuses on replacing the complex inference process with the back propagation, which makes the model easy to explore extensions. Moreover, the external semantic information of words in word embeddings is incorporated to improve the representation of short texts. To illustrate the flexibility offered by the neural network based framework, we present three extensions base on NSTC without re-deduced inference algorithms. Experiments on Web Snippet and 20Newsgroups datasets demonstrate that our models outperform existing methods.
Search
Fix author
Co-authors
- Qianqian Xie 17
- Sophia Ananiadou 12
- Xueqing Peng 7
- Yupeng Cao 5
- Yueru He 5
- Yuechen Jiang 5
- Min Peng 5
- Lingfei Qian 5
- Zhiyang Deng 4
- Haohang Li 4
- Zhiwei Liu 4
- Alejandro Lopez-Lira 4
- Kailai Yang 4
- Yangyang Yu 4
- Duanyu Feng 3
- Polydoros Giannouris 3
- Xiao-Yang Liu 3
- Triantafillos Papadopoulos 3
- Jun’ichi Tsujii 3
- Yan Wang 3
- Benyou Wang 3
- Hao Wang 3
- Ruoyu Xiang 3
- Zhuohan Xie 3
- Chen Xu 3
- Xi Chen 2
- Pan Du 2
- Rania Elbadry 2
- Ruoli Gan 2
- Weiguang Han 2
- Haochen Jia 2
- Salem Lahlou 2
- Mingquan Lin 2
- Zheheng Luo 2
- Preslav Nakov 2
- Jian-Yun Nie 2
- Kaleb E. Smith 2
- Efstathia Soufleri 2
- Veselin Stoyanov 2
- Jinyan Su 2
- Prayag Tiwari 2
- Keyi Wang 2
- Yuxia Wang 2
- Mengxi Xiao 2
- Guojun Xiong 2
- Ziyang Xu 2
- Chenhan Yuan 2
- Zining Zhu 2
- Sarfraz Ahmad 1
- Momina Ahsan 1
- Muhra AlMahri 1
- Saeed Almheiri 1
- Abdulrazzaq Alnajjar 1
- Masaki Asada 1
- Debopriyo Banerjee 1
- Dani Bouch 1
- Zhenyang Cai 1
- Tanmoy Chakraborty 1
- Nuo Chen 1
- Zhengyu Chen 1
- Ming-Bin Chen 1
- Chung-Chi Chen 1
- Hsin-Hsi Chen 1
- Arman Cohan 1
- Lei Cui 1
- Yongfu Dai 1
- Yun Feng 1
- Heming Fu 1
- Penglei Gao 1
- Georgi Nenkov Georgiev 1
- Yuqing Guo 1
- Yilin Guo 1
- Yi Han 1
- Huan He 1
- Ahmed Heakl 1
- Yan Hu 1
- Jerry Huang 1
- Jiajia Huang 1
- Shashidhar Reddy Javaji 1
- Mingyang Jiang 1
- Mohsinul Kabir 1
- Marwa Elsaid Khalil 1
- Fajri Koto 1
- Ivan Koychev 1
- Ziyan Kuang 1
- Yanzhao Lai 1
- Mingcong Lei 1
- Dong Li 1
- Haonan Li 1
- Shengyuan Lin 1
- Zhihang Lin 1
- Zhiwei Liu 1
- Zhicheng Liu 1
- Xue Liu 1
- Peng Lu 1
- Hachem Madmoun 1
- Erxue Min 1
- Makoto Miwa 1
- Antonio Moreno-Sandoval 1
- Daniil Orel 1
- Meikang Qiu 1
- Yang Ren 1
- Tulika Saha 1
- Dhruv Sahnan 1
- Nanhan Shen 1
- Aaryamonvikram Singh 1
- Vijayasai Somasundaram 1
- Harry Stuart 1
- K.p. Subbalakshmi 1
- Jordan Suchow 1
- Jordan W. Suchow 1
- Md. Tariquzzaman 1
- Rushil Thareja 1
- Paul Thompson 1
- Wei Tian 1
- Gang Tian 1
- Xiaoyu Wang 1
- Suyuchen Wang 1
- Haining Wang 1
- Hua Wang 1
- Yan Wang 1
- Rui Xing 1
- Yitao Xu 1
- Chao Xu 1
- Shanshan Yang 1
- Yuzhe Yang 1
- Ken Yano 1
- Honghai Yu 1
- Ye Yuan 1
- Tianming Zhan 1
- Vincent Jim Zhang 1
- Fan Zhang 1
- Yifei Zhang 1
- Xiao Zhang 1
- Chen Zhang 1
- Yifang Zhang 1
- Yanchun Zhang 1
- Xiuzhen (Jenny) Zhang 1
- Jeff Zhao 1
- Yilun Zhao 1
- Yijia Zhao 1
- Tianlei Zhu 1
- Haoran Zhu 1