Fine-Tune on the Format: First Improving Multiple-Choice Evaluation for Intermediate LLM Checkpoints

Alec Bunn; Sarah Wiegreffe; Ben Bogin

Fine-Tune on the Format: First Improving Multiple-Choice Evaluation for Intermediate LLM Checkpoints

Abstract

Evaluation of intermediate language model checkpoints during training is critical for effective model development and selection. How-ever, reliable evaluation using the popular multiple-choice question (MCQ) format is challenging, as small and non instruction-tunedmodels often lack the symbolic reasoning required for the task. This is despite the fact that MCQ evaluation is often used and needed todistinguish between the performance of different training runs. In particular, when prompted with a question and a set of labeled answerchoices (e.g., “A. . . . , B. . . . , C. . . . ”), many models struggle to emit the correct label (e.g., “C”), even when they can select the correct string answer choice. We propose an alternative evaluation method: fine-tuning the model on an auxiliary MCQ dataset prior to outputting labels. We validate this approach empirically by showing that training on auxiliary data improves MCQ ability on all our test datasets except 1. This approach provides a more accurate signal of model capability at intermediate checkpoints, as it disentangles the evaluation of core knowledge from the model’s emerging ability to follow formatting instructions.

Anthology ID:: 2025.gem-1.46
Volume:: Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Month:: July
Year:: 2025
Address:: Vienna, Austria and virtual meeting
Editors:: Kaustubh Dhole, Miruna Clinciu
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 511–521
Language:
URL:: https://preview.aclanthology.org/transition-to-people-yaml/2025.gem-1.46/
DOI:
Bibkey:
Cite (ACL):: Alec Bunn, Sarah Wiegreffe, and Ben Bogin. 2025. Fine-Tune on the Format: First Improving Multiple-Choice Evaluation for Intermediate LLM Checkpoints. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), pages 511–521, Vienna, Austria and virtual meeting. Association for Computational Linguistics.
Cite (Informal):: Fine-Tune on the Format: First Improving Multiple-Choice Evaluation for Intermediate LLM Checkpoints (Bunn et al., GEM 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/transition-to-people-yaml/2025.gem-1.46.pdf

PDF Cite Search Fix data