LLM Evaluate: An Industry-Focused Evaluation Tool for Large Language Models

Harsh Saini; Md Tahmid Rahman Laskar; Cheng Chen; Elham Mohammadi; David Rossouw

LLM Evaluate: An Industry-Focused Evaluation Tool for Large Language Models

Harsh Saini, Md Tahmid Rahman Laskar, Cheng Chen, Elham Mohammadi, David Rossouw

Abstract

Large Language Models (LLMs) have demonstrated impressive capability to solve a wide range of tasks in recent years. This has inspired researchers and practitioners in the real-world industrial domain to build useful products via leveraging LLMs. However, extensive evaluations of LLMs, in terms of accuracy, memory management, and inference latency, while ensuring the reproducibility of the results are crucial before deploying LLM-based solutions for real-world usage. In addition, when evaluating LLMs on internal customer data, an on-premise evaluation system is necessary to protect customer privacy rather than sending customer data to third-party APIs for evaluation. In this paper, we demonstrate how we build an on-premise system for LLM evaluation to address the challenges in the evaluation of LLMs in real-world industrial settings. We demonstrate the complexities of consolidating various datasets, models, and inference-related artifacts in complex LLM inference pipelines. For this purpose, we also present a case study in a real-world industrial setting. The demonstration of the LLM evaluation tool development would help researchers and practitioners in building on-premise systems for LLM evaluation ensuring privacy, reliability, robustness, and reproducibility.

Anthology ID:: 2025.coling-industry.24
Volume:: Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert, Kareem Darwish, Apoorv Agarwal
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 286–294
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2025.coling-industry.24/
DOI:
Bibkey:
Cite (ACL):: Harsh Saini, Md Tahmid Rahman Laskar, Cheng Chen, Elham Mohammadi, and David Rossouw. 2025. LLM Evaluate: An Industry-Focused Evaluation Tool for Large Language Models. In Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 286–294, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: LLM Evaluate: An Industry-Focused Evaluation Tool for Large Language Models (Saini et al., COLING 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2025.coling-industry.24.pdf

PDF Cite Search Fix data