PharmaQA.IT: an Italian dataset for Q A in the pharmaceutical domain
Kamyar Zeinalipour, Andrea Zugarini, Asya Zanollo, Leonardo Rigutini
Abstract
The growing use of Large Language Models (LLMs) for medical Question Answering (QA) requires reliable, evidence-grounded benchmarks beyond English. In Italy, Riassunti delle Caratteristiche del Prodotto (RCP) issued by the Italian Medicines Agency (AIFA) are the main regulatory source on medicines, yet no QA dataset exists on these documents, limiting the development and evaluation of trustworthy Italian QA systems.We introduce PharmaQA.IT, an Italian extractive QA dataset built from RCPs in PharmaER.IT. Using a semi-automatic pipeline, we (i) select informative pages from 1,077 leaflets, (ii) prompt a multimodal LLM on page images with professional personas to generate candidate question–answer pairs, and (iii) validate and normalise them with expert revision. The final dataset contains 861 high-quality question–answer pairs on indications, contraindications, dosage, warnings, interactions, and pharmacological properties.We frame PharmaQA.IT as an extractive QA benchmark with structured JSON outputs and evaluate a range of open and proprietary LLMs. Results show that open models approach closed-source performance under a chunking-and-retrieval setup. PharmaQA.IT, together with all code, prompts, and evaluation scripts, will be publicly released to support research on trustworthy Italian biomedical QA.PharmaQA.IT, together with all code, prompts, and evaluation scripts, is publicly available on Hugging Face to support research on trustworthy Italian biomedical QA.- Anthology ID:
- 2026.eacl-industry.70
- Volume:
- Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Yevgen Matusevych, Gülşen Eryiğit, Nikolaos Aletras
- Venue:
- EACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 937–947
- Language:
- URL:
- https://preview.aclanthology.org/ingest-eacl/2026.eacl-industry.70/
- DOI:
- Cite (ACL):
- Kamyar Zeinalipour, Andrea Zugarini, Asya Zanollo, and Leonardo Rigutini. 2026. PharmaQA.IT: an Italian dataset for Q A in the pharmaceutical domain. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track), pages 937–947, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- PharmaQA.IT: an Italian dataset for Q A in the pharmaceutical domain (Zeinalipour et al., EACL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-eacl/2026.eacl-industry.70.pdf