What Factors Influence LLMs’ Judgments? A Case Study on Question Answering
Lei Chen, Bobo Li, Li Zheng, Haining Wang, Zixiang Meng, Runfeng Shi, Hao Fei, Jun Zhou, Fei Li, Chong Teng, Donghong Ji
Abstract
Large Language Models (LLMs) are now being considered as judges of high efficiency to evaluate the quality of answers generated by candidate models. However, their judgments may be influenced by complex scenarios and inherent biases, raising concerns about their reliability. This study aims to bridge this gap by introducing four unexplored factors and examining the performance of LLMs as judges, namely answer quantity, inducing statements, judging strategy, and judging style. Additionally, we introduce a new dimension of question difficulty to provide a more comprehensive understanding of LLMs’ judgments across varying question intricacies. We employ ChatGPT, GPT-4, Gemini, and Claude-2 as judges and conduct experiments on Vicuna Benchmark and MT-bench. Our study reveals that LLMs’ judging abilities are susceptible to the influence of these four factors, and analyzing from the newly proposed dimension of question difficulty is highly necessary. We also provide valuable insights into optimizing LLMs’ performance as judges, enhancing their reliability and adaptability across diverse evaluation scenarios.- Anthology ID:
- 2024.lrec-main.1519
- Volume:
- Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
- Month:
- May
- Year:
- 2024
- Address:
- Torino, Italia
- Editors:
- Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
- Venues:
- LREC | COLING
- SIG:
- Publisher:
- ELRA and ICCL
- Note:
- Pages:
- 17473–17485
- Language:
- URL:
- https://aclanthology.org/2024.lrec-main.1519
- DOI:
- Cite (ACL):
- Lei Chen, Bobo Li, Li Zheng, Haining Wang, Zixiang Meng, Runfeng Shi, Hao Fei, Jun Zhou, Fei Li, Chong Teng, and Donghong Ji. 2024. What Factors Influence LLMs’ Judgments? A Case Study on Question Answering. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 17473–17485, Torino, Italia. ELRA and ICCL.
- Cite (Informal):
- What Factors Influence LLMs’ Judgments? A Case Study on Question Answering (Chen et al., LREC-COLING 2024)
- PDF:
- https://preview.aclanthology.org/ingest-bitext-workshop/2024.lrec-main.1519.pdf