LLM Evaluate: An Industry-Focused Evaluation Tool for Large Language Models
Harsh Saini, Md Tahmid Rahman Laskar, Cheng Chen, Elham Mohammadi, David Rossouw
Abstract
Large Language Models (LLMs) have demonstrated impressive capability to solve a wide range of tasks in recent years. This has inspired researchers and practitioners in the real-world industrial domain to build useful products via leveraging LLMs. However, extensive evaluations of LLMs, in terms of accuracy, memory management, and inference latency, while ensuring the reproducibility of the results are crucial before deploying LLM-based solutions for real-world usage. In addition, when evaluating LLMs on internal customer data, an on-premise evaluation system is necessary to protect customer privacy rather than sending customer data to third-party APIs for evaluation. In this paper, we demonstrate how we build an on-premise system for LLM evaluation to address the challenges in the evaluation of LLMs in real-world industrial settings. We demonstrate the complexities of consolidating various datasets, models, and inference-related artifacts in complex LLM inference pipelines. For this purpose, we also present a case study in a real-world industrial setting. The demonstration of the LLM evaluation tool development would help researchers and practitioners in building on-premise systems for LLM evaluation ensuring privacy, reliability, robustness, and reproducibility.- Anthology ID:
- 2025.coling-industry.24
- Volume:
- Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
- Month:
- January
- Year:
- 2025
- Address:
- Abu Dhabi, UAE
- Editors:
- Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert, Kareem Darwish, Apoorv Agarwal
- Venue:
- COLING
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 286–294
- Language:
- URL:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2025.coling-industry.24/
- DOI:
- Cite (ACL):
- Harsh Saini, Md Tahmid Rahman Laskar, Cheng Chen, Elham Mohammadi, and David Rossouw. 2025. LLM Evaluate: An Industry-Focused Evaluation Tool for Large Language Models. In Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 286–294, Abu Dhabi, UAE. Association for Computational Linguistics.
- Cite (Informal):
- LLM Evaluate: An Industry-Focused Evaluation Tool for Large Language Models (Saini et al., COLING 2025)
- PDF:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2025.coling-industry.24.pdf