PharmaQA.IT: an Italian dataset for Q A in the pharmaceutical domain

Kamyar Zeinalipour, Andrea Zugarini, Asya Zanollo, Leonardo Rigutini


Abstract
The growing use of Large Language Models (LLMs) for medical Question Answering (QA) requires reliable, evidence-grounded benchmarks beyond English. In Italy, Riassunti delle Caratteristiche del Prodotto (RCP) issued by the Italian Medicines Agency (AIFA) are the main regulatory source on medicines, yet no QA dataset exists on these documents, limiting the development and evaluation of trustworthy Italian QA systems.We introduce PharmaQA.IT, an Italian extractive QA dataset built from RCPs in PharmaER.IT. Using a semi-automatic pipeline, we (i) select informative pages from 1,077 leaflets, (ii) prompt a multimodal LLM on page images with professional personas to generate candidate question–answer pairs, and (iii) validate and normalise them with expert revision. The final dataset contains 861 high-quality question–answer pairs on indications, contraindications, dosage, warnings, interactions, and pharmacological properties.We frame PharmaQA.IT as an extractive QA benchmark with structured JSON outputs and evaluate a range of open and proprietary LLMs. Results show that open models approach closed-source performance under a chunking-and-retrieval setup. PharmaQA.IT, together with all code, prompts, and evaluation scripts, will be publicly released to support research on trustworthy Italian biomedical QA.PharmaQA.IT, together with all code, prompts, and evaluation scripts, is publicly available on Hugging Face to support research on trustworthy Italian biomedical QA.
Anthology ID:
2026.eacl-industry.70
Volume:
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Yevgen Matusevych, Gülşen Eryiğit, Nikolaos Aletras
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
937–947
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-industry.70/
DOI:
Bibkey:
Cite (ACL):
Kamyar Zeinalipour, Andrea Zugarini, Asya Zanollo, and Leonardo Rigutini. 2026. PharmaQA.IT: an Italian dataset for Q A in the pharmaceutical domain. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track), pages 937–947, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
PharmaQA.IT: an Italian dataset for Q A in the pharmaceutical domain (Zeinalipour et al., EACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-industry.70.pdf