D-NLP at SemEval-2024 Task 2: Evaluating Clinical Inference Capabilities of Large Language Models

Duygu Altinok


Abstract
Large language models (LLMs) have garnered significant attention and widespread usage due to their impressive performance in various tasks. However, they are not without their own set of challenges, including issues such as hallucinations, factual inconsistencies, and limitations in numerical-quantitative reasoning. Evaluating LLMs in miscellaneous reasoning tasks remains an active area of research. Prior to the breakthrough of LLMs, Transformers had already proven successful in the medical domain, effectively employed for various natural language understanding (NLU) tasks. Following this trend, LLMs have also been trained and utilized in the medical domain, raising concerns regarding factual accuracy, adherence tosafety protocols, and inherent limitations. In this paper, we focus on evaluating the natural language inference capabilities of popular open-source and closed-source LLMs using clinical trial reports as the dataset. We present the performance results of each LLM and further analyze their performance on a development set, particularly focusing on challenging instances that involve medical abbreviations and require numerical-quantitative reasoning. Gemini, our leading LLM, achieved a test set F1-score of 0.748, securing the ninth position on the task scoreboard. Our work is the first of its kind, offering a thorough examination of the inference capabilities of LLMs within the medical domain.
Anthology ID:
2024.semeval-1.91
Volume:
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Atul Kr. Ojha, A. Seza Doğruöz, Harish Tayyar Madabushi, Giovanni Da San Martino, Sara Rosenthal, Aiala Rosá
Venue:
SemEval
SIG:
SIGLEX
Publisher:
Association for Computational Linguistics
Note:
Pages:
613–627
Language:
URL:
https://aclanthology.org/2024.semeval-1.91
DOI:
Bibkey:
Cite (ACL):
Duygu Altinok. 2024. D-NLP at SemEval-2024 Task 2: Evaluating Clinical Inference Capabilities of Large Language Models. In Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), pages 613–627, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
D-NLP at SemEval-2024 Task 2: Evaluating Clinical Inference Capabilities of Large Language Models (Altinok, SemEval 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/2024.semeval-1.91.pdf
Supplementary material:
 2024.semeval-1.91.SupplementaryMaterial.txt