Holistic Evaluation of Large Language Models: Assessing Robustness, Accuracy, and Toxicity for Real-World Applications

David Cecchini; Arshaan Nazir; Kalyan Chakravarthy; Veysel Kocaman

doi:10.18653/v1/2024.trustnlp-1.11

Holistic Evaluation of Large Language Models: Assessing Robustness, Accuracy, and Toxicity for Real-World Applications

David Cecchini, Arshaan Nazir, Kalyan Chakravarthy, Veysel Kocaman

Abstract

Large Language Models (LLMs) have been widely used in real-world applications. However, as LLMs evolve and new datasets are released, it becomes crucial to build processes to evaluate and control the models’ performance. In this paper, we describe how to add Robustness, Accuracy, and Toxicity scores to model comparison tables, or leaderboards. We discuss the evaluation metrics, the approaches considered, and present the results of the first evaluation round for model Robustness, Accuracy, and Toxicity scores. Our results show that GPT 4 achieves top performance on robustness and accuracy test, while Llama 2 achieves top performance on the toxicity test. We note that newer open-source models such as open chat 3.5 and neural chat 7B can perform well on these three test categories. Finally, domain-specific tests and models are also planned to be added to the leaderboard to allow for a more detailed evaluation of models in specific areas such as healthcare, legal, and finance.

Anthology ID:: 2024.trustnlp-1.11
Volume:: Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Anaelia Ovalle, Kai-Wei Chang, Yang Trista Cao, Ninareh Mehrabi, Jieyu Zhao, Aram Galstyan, Jwala Dhamala, Anoop Kumar, Rahul Gupta
Venues:: TrustNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 109–117
Language:
URL:: https://aclanthology.org/2024.trustnlp-1.11
DOI:: 10.18653/v1/2024.trustnlp-1.11
Bibkey:
Cite (ACL):: David Cecchini, Arshaan Nazir, Kalyan Chakravarthy, and Veysel Kocaman. 2024. Holistic Evaluation of Large Language Models: Assessing Robustness, Accuracy, and Toxicity for Real-World Applications. In Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024), pages 109–117, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: Holistic Evaluation of Large Language Models: Assessing Robustness, Accuracy, and Toxicity for Real-World Applications (Cecchini et al., TrustNLP-WS 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-4/2024.trustnlp-1.11.pdf

PDF Search