Kaleb E. Smith
Also published as: Kaleb E Smith
2026
MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application
Xueqing Peng | Lingfei Qian | Yan Wang | Ruoyu Xiang | Yueru He | Yang Ren | Mingyang Jiang | Vincent Jim Zhang | Yuqing Guo | Jeff Zhao | Huan He | Yi Han | Yun Feng | Yuechen Jiang | Yupeng Cao | Haohang Li | Yangyang Yu | Xiaoyu Wang | Penglei Gao | Shengyuan Lin | Keyi Wang | Shanshan Yang | Yilun Zhao | Zhiwei Liu | Peng Lu | Jerry Huang | Suyuchen Wang | Triantafillos Papadopoulos | Polydoros Giannouris | Efstathia Soufleri | Nuo Chen | Zhiyang Deng | Heming Fu | Yijia Zhao | Mingquan Lin | Meikang Qiu | Kaleb E Smith | Arman Cohan | Xiao-Yang Liu | Jimin Huang | Guojun Xiong | Alejandro Lopez-Lira | Xi Chen | Junichi Tsujii | Jian-Yun Nie | Sophia Ananiadou | Qianqian Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xueqing Peng | Lingfei Qian | Yan Wang | Ruoyu Xiang | Yueru He | Yang Ren | Mingyang Jiang | Vincent Jim Zhang | Yuqing Guo | Jeff Zhao | Huan He | Yi Han | Yun Feng | Yuechen Jiang | Yupeng Cao | Haohang Li | Yangyang Yu | Xiaoyu Wang | Penglei Gao | Shengyuan Lin | Keyi Wang | Shanshan Yang | Yilun Zhao | Zhiwei Liu | Peng Lu | Jerry Huang | Suyuchen Wang | Triantafillos Papadopoulos | Polydoros Giannouris | Efstathia Soufleri | Nuo Chen | Zhiyang Deng | Heming Fu | Yijia Zhao | Mingquan Lin | Meikang Qiu | Kaleb E Smith | Arman Cohan | Xiao-Yang Liu | Jimin Huang | Guojun Xiong | Alejandro Lopez-Lira | Xi Chen | Junichi Tsujii | Jian-Yun Nie | Sophia Ananiadou | Qianqian Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Real-world financial analysis involves information across multiple languages and modalities, from reports and news to scanned filings and meeting recordings. Yet most existing evaluations of LLMs in finance remain text-only, monolingual, and largely saturated by current models. To bridge these gaps, we present MultiFinBen, the first expert-annotated multilingual (five languages) and multimodal (text, vision, audio) benchmark for evaluating LLMs in realistic financial contexts. MultiFinBen introduces two new task families: multilingual financial reasoning, which tests cross-lingual evidence integration from filings and news, and financial OCR, which extracts structured text from scanned documents containing tables and charts. Rather than aggregating all available datasets, we apply a structured, difficulty-aware selection based on advanced model performance, ensuring balanced challenge and removing redundant tasks. Evaluating 21 leading LLMs shows that even frontier multimodal models like GPT-4o achieve only 46.01% overall, stronger on vision and audio but dropping sharply in multilingual settings. These findings expose persistent limitations in multilingual, multimodal, and expert-level financial reasoning. All datasets, evaluation scripts, and leaderboards are publicly released.
2025
Does Biomedical Training Lead to Better Medical Performance?
Amin Dada | Osman Alperen Koraş | Marie Bauer | Jean-Philippe Corbeil | Amanda Butler Contreras | Constantin Marc Seibold | Kaleb E Smith | Julian Friedrich | Jens Kleesiek
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Amin Dada | Osman Alperen Koraş | Marie Bauer | Jean-Philippe Corbeil | Amanda Butler Contreras | Constantin Marc Seibold | Kaleb E Smith | Julian Friedrich | Jens Kleesiek
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Large Language Models (LLMs) hold significant potential for improving healthcare applications, with biomedically adapted models promising enhanced performance on medical tasks. However, the effectiveness of biomedical domain adaptation for clinical tasks remains uncertain. In this study, we conduct a direct comparison of 12 biomedically adapted models and their general-domain base counterparts across six clinical tasks. Our results reveal that 11 out of 12 biomedical models exhibit performance declines, challenging prior findings that reported positive effects of biomedical adaptation. Notably, previous positive results primarily relied on multiple-choice evaluations, which may not reflect performance in real-world clinical applications. To promote reproducibility and further research, we open-source our evaluation pipeline, providing a resource for the development of models with practical benefits in healthcare settings.
FLAG-TRADER: Fusion LLM-Agent with Gradient-based Reinforcement Learning for Financial Trading
Guojun Xiong | Zhiyang Deng | Keyi Wang | Yupeng Cao | Haohang Li | Yangyang Yu | Xueqing Peng | Mingquan Lin | Kaleb E Smith | Xiao-Yang Liu | Jimin Huang | Sophia Ananiadou | Qianqian Xie
Findings of the Association for Computational Linguistics: ACL 2025
Guojun Xiong | Zhiyang Deng | Keyi Wang | Yupeng Cao | Haohang Li | Yangyang Yu | Xueqing Peng | Mingquan Lin | Kaleb E Smith | Xiao-Yang Liu | Jimin Huang | Sophia Ananiadou | Qianqian Xie
Findings of the Association for Computational Linguistics: ACL 2025
Large language models (LLMs) fine-tuned on multimodal financial data have demonstrated impressive reasoning capabilities in various financial tasks. However, they often struggle with multi-step, goal-oriented scenarios in interactive financial markets, such as trading, where complex agentic approaches are required to improve decision-making. To address this, we propose FLAG-Trader, a unified architecture integrating linguistic processing (via LLMs) with gradient-driven reinforcement learning (RL) policy optimization, in which a partially fine-tuned LLM acts as the policy network, leveraging pre-trained knowledge while adapting to the financial domain through parameter-efficient fine-tuning. Through policy gradient optimization driven by trading rewards, our framework not only enhances LLM performance in trading but also improves results on other financial-domain tasks. We present extensive empirical evidence to validate these enhancements.
2024
Comprehensive Study on German Language Models for Clinical and Biomedical Text Understanding
Ahmad Idrissi-Yaghir | Amin Dada | Henning Schäfer | Kamyar Arzideh | Giulia Baldini | Jan Trienes | Max Hasin | Jeanette Bewersdorff | Cynthia S. Schmidt | Marie Bauer | Kaleb E. Smith | Jiang Bian | Yonghui Wu | Jörg Schlötterer | Torsten Zesch | Peter A. Horn | Christin Seifert | Felix Nensa | Jens Kleesiek | Christoph M. Friedrich
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Ahmad Idrissi-Yaghir | Amin Dada | Henning Schäfer | Kamyar Arzideh | Giulia Baldini | Jan Trienes | Max Hasin | Jeanette Bewersdorff | Cynthia S. Schmidt | Marie Bauer | Kaleb E. Smith | Jiang Bian | Yonghui Wu | Jörg Schlötterer | Torsten Zesch | Peter A. Horn | Christin Seifert | Felix Nensa | Jens Kleesiek | Christoph M. Friedrich
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Recent advances in natural language processing (NLP) can be largely attributed to the advent of pre-trained language models such as BERT and RoBERTa. While these models demonstrate remarkable performance on general datasets, they can struggle in specialized domains such as medicine, where unique domain-specific terminologies, domain-specific abbreviations, and varying document structures are common. This paper explores strategies for adapting these models to domain-specific requirements, primarily through continuous pre-training on domain-specific data. We pre-trained several German medical language models on 2.4B tokens derived from translated public English medical data and 3B tokens of German clinical data. The resulting models were evaluated on various German downstream tasks, including named entity recognition (NER), multi-label classification, and extractive question answering. Our results suggest that models augmented by clinical and translation-based pre-training typically outperform general domain models in medical contexts. We conclude that continuous pre-training has demonstrated the ability to match or even exceed the performance of clinical models trained from scratch. Furthermore, pre-training on clinical data or leveraging translated texts have proven to be reliable methods for domain adaptation in medical NLP tasks.
Search
Fix author
Co-authors
- Sophia Ananiadou 2
- Marie Bauer 2
- Yupeng Cao 2
- Amin Dada 2
- Zhiyang Deng 2
- Jimin Huang 2
- Jens Kleesiek 2
- Haohang Li 2
- Mingquan Lin 2
- Xiao-Yang Liu 2
- Xueqing Peng 2
- Keyi Wang 2
- Qianqian Xie 2
- Guojun Xiong 2
- Yangyang Yu 2
- Kamyar Arzideh 1
- Giulia Baldini 1
- Jeanette Bewersdorff 1
- Jiang Bian 1
- Nuo Chen 1
- Xi Chen 1
- Arman Cohan 1
- Amanda Butler Contreras 1
- Jean-Philippe Corbeil 1
- Yun Feng 1
- Julian Friedrich 1
- Christoph M. Friedrich 1
- Heming Fu 1
- Penglei Gao 1
- Polydoros Giannouris 1
- Yuqing Guo 1
- Yi Han 1
- Max Hasin 1
- Yueru He 1
- Huan He 1
- Peter A. Horn 1
- Jerry Huang 1
- Ahmad Idrissi-Yaghir 1
- Mingyang Jiang 1
- Yuechen Jiang 1
- Osman Alperen Koraş 1
- Shengyuan Lin 1
- Zhiwei Liu 1
- Alejandro Lopez-Lira 1
- Peng Lu 1
- Felix Nensa 1
- Jian-Yun Nie 1
- Triantafillos Papadopoulos 1
- Lingfei Qian 1
- Meikang Qiu 1
- Yang Ren 1
- Jörg Schlötterer 1
- Cynthia S. Schmidt 1
- Henning Schäfer 1
- Constantin Marc Seibold 1
- Christin Seifert 1
- Efstathia Soufleri 1
- Jan Trienes 1
- Jun’ichi Tsujii 1
- Yan Wang 1
- Xiaoyu Wang 1
- Suyuchen Wang 1
- Yonghui Wu 1
- Ruoyu Xiang 1
- Shanshan Yang 1
- Torsten Zesch 1
- Vincent Jim Zhang 1
- Jeff Zhao 1
- Yilun Zhao 1
- Yijia Zhao 1