Christophe Marsala


2025

pdf bib
Metric assessment protocol in the context of answer fluctuation on MCQ tasks
Ekaterina Goliakova | Xavier Renard | Marie-Jeanne Lesot | Thibault Laugel | Christophe Marsala | Marcin Detyniecki
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

Using multiple-choice questions (MCQs) has become a standard for assessing LLM capabilities efficiently. A variety of metrics can be employed for this task. However, previous research has not conducted a thorough assessment of them. At the same time, MCQ evaluation suffers from answer fluctuation: models produce different results given slight changes in prompts. We suggest a metric assessment protocol in which evaluation methodologies are analyzed through their connection with fluctuation rates, as well as original performance. Our results show that there is a strong link between existing metrics and the answer changing, even when computed without any additional prompt variants. Highest association on the protocol is demonstrated by a novel metric, worst accuracy.