Abstract
Biomedical question-answering systems remain popular for biomedical experts interacting with the literature to answer their medical questions. However, these systems are difficult to evaluate in the absence of costly human experts. Therefore, automatic evaluation metrics are often used in this space. Traditional automatic metrics such as ROUGE or BLEU, which rely on token overlap, have shown a low correlation with humans. We present a study that uses large language models (LLMs) to automatically evaluate systems from an international challenge on biomedical semantic indexing and question answering, called BioASQ. We measure the agreement of LLM-produced scores against human judgements. We show that LLMs correlate similarly to lexical methods when using basic prompting techniques. However, by aggregating evaluators with LLMs or by fine-tuning, we find that our methods outperform the baselines by a large margin, achieving a Spearman correlation of 0.501 and 0.511, respectively.- Anthology ID:
- 2024.bionlp-1.18
- Volume:
- Proceedings of the 23rd Workshop on Biomedical Natural Language Processing
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Dina Demner-Fushman, Sophia Ananiadou, Makoto Miwa, Kirk Roberts, Junichi Tsujii
- Venues:
- BioNLP | WS
- SIG:
- SIGBIOMED
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 236–242
- Language:
- URL:
- https://aclanthology.org/2024.bionlp-1.18
- DOI:
- Cite (ACL):
- Hashem Hijazi, Diego Molla, Vincent Nguyen, and Sarvnaz Karimi. 2024. Using Large Language Models to Evaluate Biomedical Query-Focused Summarisation. In Proceedings of the 23rd Workshop on Biomedical Natural Language Processing, pages 236–242, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- Using Large Language Models to Evaluate Biomedical Query-Focused Summarisation (Hijazi et al., BioNLP-WS 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2024.bionlp-1.18.pdf