Fuwei Wang
2025
Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models
Xiaojun Wu
|
Junxi Liu
|
Huan-Yi Su
|
Zhouchi Lin
|
Yiyan Qi
|
Chengjin Xu
|
Jiajun Su
|
Jiajie Zhong
|
Fuwei Wang
|
Saizhuo Wang
|
Fengrui Hua
|
Jia Li
|
Jian Guo
Findings of the Association for Computational Linguistics: EMNLP 2025
As large language models (LLMs) increasingly permeate the financial sector, there is a pressing need for a standardized method to comprehensively assess their performance. Existing financial benchmarks often suffer from limited language and task coverage, low-quality datasets, and inadequate adaptability for LLM evaluation. To address these limitations, we introduce Golden Touchstone, a comprehensive bilingual benchmark for financial LLMs, encompassing eight core financial NLP tasks in both Chinese and English. Developed from extensive open-source data collection and industry-specific demands, this benchmark thoroughly assesses models’ language understanding and generation capabilities. Through comparative analysis of major models such as GPT-4o, Llama3, FinGPT, and FinMA, we reveal their strengths and limitations in processing complex financial information. Additionally, we open-source Touchstone-GPT, a financial LLM trained through continual pre-training and instruction tuning, which demonstrates strong performance on the bilingual benchmark but still has limitations in specific tasks. This research provides a practical evaluation tool for financial LLMs and guides future development and optimization.The source code for Golden Touchstone and model weight of Touchstone-GPT have been made publicly available at https://github.com/IDEA-FinAI/Golden-Touchstone.