Revisiting the Self-Consistency Challenges in Multi-Choice Question Formats for Large Language Model Evaluation

Wenjie Zhou; Qiang Wang; Mingzhou Xu; Ming Chen; Xiangyu Duan

Revisiting the Self-Consistency Challenges in Multi-Choice Question Formats for Large Language Model Evaluation

Wenjie Zhou, Qiang Wang, Mingzhou Xu, Ming Chen, Xiangyu Duan

Abstract

Multi-choice questions (MCQ) are a common method for assessing the world knowledge of large language models (LLMs), demonstrated by benchmarks such as MMLU and C-Eval. However, recent findings indicate that even top-tier LLMs, such as ChatGPT and GPT4, might display inconsistencies when faced with slightly varied inputs. This raises concerns about the credibility of MCQ-based evaluations. To address this issue, we introduced three knowledge-equivalent question variants: option position shuffle, option label replacement, and conversion to a True/False format. We rigorously tested a range of LLMs, varying in model size (from 6B to 70B) and types—pretrained language model (PLM), supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF). Our findings from MMLU and C-Eval revealed that accuracy for individual questions lacks robustness, particularly in smaller models (<30B) and PLMs. Consequently, we advocate that consistent accuracy may serve as a more reliable metric for evaluating and ranking LLMs.

Anthology ID:: 2024.lrec-main.1229
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 14103–14110
Language:
URL:: https://aclanthology.org/2024.lrec-main.1229
DOI:
Bibkey:
Cite (ACL):: Wenjie Zhou, Qiang Wang, Mingzhou Xu, Ming Chen, and Xiangyu Duan. 2024. Revisiting the Self-Consistency Challenges in Multi-Choice Question Formats for Large Language Model Evaluation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 14103–14110, Torino, Italia. ELRA and ICCL.
Cite (Informal):: Revisiting the Self-Consistency Challenges in Multi-Choice Question Formats for Large Language Model Evaluation (Zhou et al., LREC-COLING 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-4/2024.lrec-main.1229.pdf

PDF Search