The Inadequacy of Automatic Evaluation Metrics in Question Answering: A Case-Study in Portuguese

Júlia da Rocha Junqueira, Viviane P. Moreira


Abstract
Questions and answers are among the most fundamental forms of human communication. Question Answering (QA) is the task of correctly generating answers based on a context. To assess the success of the task, the answers are typically evaluated using traditional metrics such as BLEU, ROUGE, and METEOR. However, these metrics often fail to reflect the actual quality of the outputs. More recently, new evaluation metrics and the LLM-as-a-judge paradigm have also been applied to the evaluation of QA. To gain a deeper understanding of the capabilities and limitations of QA metrics, this work performs a comparative analysis of both traditional and more recent approaches for QA evaluation. Experiments were conducted on the Pirá dataset (in Portuguese) using four LLMs to generate answers. Additionally, human evaluation was performed to assess aspects such as correctness, completeness, clarity, and relevance of the generated content. We demonstrate that lexical metrics are limited in evaluating QA. We also observed that human evaluators favor models that provide higher information density, even when this contradicts prompt constraints, whereas lexical metrics penalize this verbosity. This divergence confirms that traditional metrics are insufficient for capturing the trade-off between instruction adherence and the semantic richness valued by native speakers.
Anthology ID:
2026.propor-1.54
Volume:
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Month:
April
Year:
2026
Address:
Salvador, Brazil
Editors:
Marlo Souza, Iria de-Dios-Flores, Diana Santos, Larissa Freitas, Jackson Wilke da Cruz Souza, Eugénio Ribeiro
Venue:
PROPOR
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
551–561
Language:
URL:
https://preview.aclanthology.org/ingest-dnd/2026.propor-1.54/
DOI:
Bibkey:
Cite (ACL):
Júlia da Rocha Junqueira and Viviane P. Moreira. 2026. The Inadequacy of Automatic Evaluation Metrics in Question Answering: A Case-Study in Portuguese. In Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1, pages 551–561, Salvador, Brazil. Association for Computational Linguistics.
Cite (Informal):
The Inadequacy of Automatic Evaluation Metrics in Question Answering: A Case-Study in Portuguese (Junqueira & Moreira, PROPOR 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-dnd/2026.propor-1.54.pdf