The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance

Kyle Moore; Jesse Roberts; Thao Pham; Oseremhen Ewaleifoh; Douglas Fisher

doi:10.18653/v1/2024.findings-emnlp.126

The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance

Kyle Moore, Jesse Roberts, Thao Pham, Oseremhen Ewaleifoh, Douglas Fisher

Abstract

Cloze testing is a common method for measuring the behavior of large language models on a number of benchmark tasks. Using the MMLU dataset, we show that the base-rate probability (BRP) differences across answer tokens are significant and affect task performance ie. guess A if uncertain. We find that counterfactual prompting does sufficiently mitigate the BRP effect. The BRP effect is found to have a similar effect to test taking strategies employed by humans leading to the conflation of task performance and test-taking ability. We propose the Nvr-X-MMLU task, a variation of MMLU, which helps to disambiguate test-taking ability from task performance and reports the latter.

Anthology ID:: 2024.findings-emnlp.126
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2283–2288
Language:
URL:: https://preview.aclanthology.org/icon-24-ingestion/2024.findings-emnlp.126/
DOI:: 10.18653/v1/2024.findings-emnlp.126
Bibkey:
Cite (ACL):: Kyle Moore, Jesse Roberts, Thao Pham, Oseremhen Ewaleifoh, and Douglas Fisher. 2024. The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 2283–2288, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance (Moore et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/icon-24-ingestion/2024.findings-emnlp.126.pdf
Data:: 2024.findings-emnlp.126.data.zip

PDF Search Data Fix data