Mikio Takeuchi


2025

pdf bib
Evaluating Large Language Models with Enterprise Benchmarks
Bing Zhang | Mikio Takeuchi | Ryo Kawahara | Shubhi Asthana | Md. Maruf Hossain | Guang-Jie Ren | Kate Soule | Yifan Mai | Yada Zhu
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)

The advancement of large language models (LLMs) has led to a greater challenge of having a rigorous and systematic evaluation of complex tasks performed, especially in enterprise applications. Therefore, LLMs need to be benchmarked with enterprise datasets for a variety of NLP tasks. This work explores benchmarking strategies focused on LLM evaluation, with a specific emphasis on both English and Japanese. The proposed evaluation framework encompasses 25 publicly available domain-specific English benchmarks from diverse enterprise domains like financial services, legal, climate, cyber security, and 2 public Japanese finance benchmarks. The diverse performance of 8 models across different enterprise tasks highlights the importance of selecting the right model based on the specific requirements of each task. Code and prompts are available on GitHub.