Beyond BLEU: Ethical Risks of Misleading Evaluation in Domain-Specific QA with LLMs

Ayoub Nainia; Régine Vignes-Lebbe; Hajar Mousannif; Jihad Zahir

Beyond BLEU: Ethical Risks of Misleading Evaluation in Domain-Specific QA with LLMs

Ayoub Nainia, Régine Vignes-Lebbe, Hajar Mousannif, Jihad Zahir

Abstract

Large Language Models (LLMs) are increasingly used in scientific question answering (QA), including high-stakes fields such as biodiversity informatics. However, standard evaluation metrics such as BLEU, ROUGE, Exact Match (EM), and BERTScore remain poorly aligned with the factual and domain-specific requirements of these tasks. In this work, we investigate the gap between automatic metrics and expert judgment in botanical QA by comparing metric scores with human ratings across five dimensions: accuracy, completeness, relevance, fluency, and terminology usage. Our results show that standard metrics often misrepresent response quality, particularly in the presence of paraphrasing, omission, or domain-specific language. Through both quantitative analysis and qualitative examples, we show that high-scoring responses may still exhibit critical factual errors or omissions. These findings highlight the need for domain-aware evaluation frameworks that incorporate expert feedback and raise important ethical concerns about the deployment of LLMs in scientific contexts.

Anthology ID:: 2025.r2lm-1.9
Volume:: Proceedings of the First Workshop on Comparative Performance Evaluation: From Rules to Language Models
Month:: September
Year:: 2025
Address:: Varna, Bulgaria
Editors:: Alicia Picazo-Izquierdo, Ernesto Luis Estevanell-Valladares, Ruslan Mitkov, Rafael Muñoz Guillena, Raúl García Cerdá
Venues:: R2LM | WS
SIG:
Publisher:: INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:: 77–86
Language:
URL:: https://preview.aclanthology.org/corrections-2026-01/2025.r2lm-1.9/
DOI:
Bibkey:
Cite (ACL):: Ayoub Nainia, Régine Vignes-Lebbe, Hajar Mousannif, and Jihad Zahir. 2025. Beyond BLEU: Ethical Risks of Misleading Evaluation in Domain-Specific QA with LLMs. In Proceedings of the First Workshop on Comparative Performance Evaluation: From Rules to Language Models, pages 77–86, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):: Beyond BLEU: Ethical Risks of Misleading Evaluation in Domain-Specific QA with LLMs (Nainia et al., R2LM 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/corrections-2026-01/2025.r2lm-1.9.pdf

PDF Cite Search Fix data