Christophe Marsala
2025
Metric assessment protocol in the context of answer fluctuation on MCQ tasks
Ekaterina Goliakova
|
Xavier Renard
|
Marie-Jeanne Lesot
|
Thibault Laugel
|
Christophe Marsala
|
Marcin Detyniecki
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Using multiple-choice questions (MCQs) has become a standard for assessing LLM capabilities efficiently. A variety of metrics can be employed for this task. However, previous research has not conducted a thorough assessment of them. At the same time, MCQ evaluation suffers from answer fluctuation: models produce different results given slight changes in prompts. We suggest a metric assessment protocol in which evaluation methodologies are analyzed through their connection with fluctuation rates, as well as original performance. Our results show that there is a strong link between existing metrics and the answer changing, even when computed without any additional prompt variants. Highest association on the protocol is demonstrated by a novel metric, worst accuracy.