Assessing the Difficulty of Inference Types in Natural Language Inference for Clinical Trials

Mathilde Aguiar, Pierre Zweigenbaum, Nona Naderi


Abstract
Large Language Models (LLMs) achieve competitive results on Natural Language Inference when applied to clinical trials; however, it is not yet clear which type of inference LLMs perform well or poorly on. We address this by proposing new supplementary annotations for the existing NLI4CT dataset on the types of inferences observed in clinical trials. Our dataset supplements NLI4CT with a total of 1,949 new annotations using our carefully crafted guidelines for 17 types of inferences. To investigate how inference types affect the performance of LLMs, we prompt Flan-T5, Llama, Mistral, and Qwen and evaluate their performance using our newly annotated dataset. We found that logical inferences negatively affect the overall performance of Qwen3-4B, Qwen2.5-7B, and Qwen2.5-14B, whereas numerical inferences negatively affect the performance of Flan-T5-XL and Mixtral. Further analysis shows that MMed-Llama-3 struggles to understand the structure of clinical trial reports. Other parameters, such as the number of inference types involved or the section type in the premise, also influence the performance of the models. Our code and dataset are publicly available.
Anthology ID:
2026.lrec-main.413
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
5290–5300
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.413/
DOI:
Bibkey:
Cite (ACL):
Mathilde Aguiar, Pierre Zweigenbaum, and Nona Naderi. 2026. Assessing the Difficulty of Inference Types in Natural Language Inference for Clinical Trials. International Conference on Language Resources and Evaluation, main:5290–5300.
Cite (Informal):
Assessing the Difficulty of Inference Types in Natural Language Inference for Clinical Trials (Aguiar et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.413.pdf
Optionalsupplementarymaterial:
 2026.lrec-main.413.OptionalSupplementaryMaterial.txt