RoBiologyDataChoiceQA: A Romanian Dataset for improving Biology understanding of Large Language Models

Dragos-Dumitru Ghinea, Adela-Nicoleta Corbeanu, Marius-Adrian Dumitran


Abstract
In recent years, large language models (LLMs) have demonstrated significant potential across various natural language processing (NLP) tasks. However, their performance in domainspecific applications and non-English languages remains less explored. This study introduces a novel Romanian-language dataset 1 for multiple-choice biology questions, carefully curated to assess LLM comprehension and reasoning capabilities in scientific contexts. Containing approximately 14,000 questions, the dataset provides a comprehensive resource for evaluating and improving LLM performance in biology. We benchmark several popular LLMs, analyzing their accuracy, reasoning patterns, and ability to understand domain-specific terminology and linguistic nuances. Additionally, we perform comprehensive experiments to evaluate the impact of prompt engineering, fine-tuning, and other optimization techniques on model performance. Our findings highlight both the strengths and limitations of current LLMs in handling specialized knowledge tasks in lowresource languages, offering valuable insights for future research and development.
Anthology ID:
2025.mrl-main.37
Volume:
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)
Month:
November
Year:
2025
Address:
Suzhuo, China
Editors:
David Ifeoluwa Adelani, Catherine Arnett, Duygu Ataman, Tyler A. Chang, Hila Gonen, Rahul Raja, Fabian Schmidt, David Stap, Jiayi Wang
Venues:
MRL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
551–567
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.mrl-main.37/
DOI:
Bibkey:
Cite (ACL):
Dragos-Dumitru Ghinea, Adela-Nicoleta Corbeanu, and Marius-Adrian Dumitran. 2025. RoBiologyDataChoiceQA: A Romanian Dataset for improving Biology understanding of Large Language Models. In Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025), pages 551–567, Suzhuo, China. Association for Computational Linguistics.
Cite (Informal):
RoBiologyDataChoiceQA: A Romanian Dataset for improving Biology understanding of Large Language Models (Ghinea et al., MRL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.mrl-main.37.pdf