Using Large Language Models to Evaluate Biomedical Query-Focused Summarisation

Hashem Hijazi; Diego Molla; Vincent Nguyen; Sarvnaz Karimi

Using Large Language Models to Evaluate Biomedical Query-Focused Summarisation

Hashem Hijazi, Diego Molla, Vincent Nguyen, Sarvnaz Karimi

Abstract

Biomedical question-answering systems remain popular for biomedical experts interacting with the literature to answer their medical questions. However, these systems are difficult to evaluate in the absence of costly human experts. Therefore, automatic evaluation metrics are often used in this space. Traditional automatic metrics such as ROUGE or BLEU, which rely on token overlap, have shown a low correlation with humans. We present a study that uses large language models (LLMs) to automatically evaluate systems from an international challenge on biomedical semantic indexing and question answering, called BioASQ. We measure the agreement of LLM-produced scores against human judgements. We show that LLMs correlate similarly to lexical methods when using basic prompting techniques. However, by aggregating evaluators with LLMs or by fine-tuning, we find that our methods outperform the baselines by a large margin, achieving a Spearman correlation of 0.501 and 0.511, respectively.

Anthology ID:: 2024.bionlp-1.18
Volume:: Proceedings of the 23rd Workshop on Biomedical Natural Language Processing
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Dina Demner-Fushman, Sophia Ananiadou, Makoto Miwa, Kirk Roberts, Junichi Tsujii
Venues:: BioNLP | WS
SIG:: SIGBIOMED
Publisher:: Association for Computational Linguistics
Note:
Pages:: 236–242
Language:
URL:: https://aclanthology.org/2024.bionlp-1.18
DOI:
Bibkey:
Cite (ACL):: Hashem Hijazi, Diego Molla, Vincent Nguyen, and Sarvnaz Karimi. 2024. Using Large Language Models to Evaluate Biomedical Query-Focused Summarisation. In Proceedings of the 23rd Workshop on Biomedical Natural Language Processing, pages 236–242, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: Using Large Language Models to Evaluate Biomedical Query-Focused Summarisation (Hijazi et al., BioNLP-WS 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-4/2024.bionlp-1.18.pdf

PDF Search