IfGPT: A Dataset in Bulgarian for Large Language Models

Svetla Peneva Koeva, Ivelina Stoyanova, Jordan Konstantinov Kralev


Abstract
The paper presents the large dataset IfGPT, which contains available corpora and datasets for Bulgarian, and describes methods to continuously expand it with unduplicated and unbiased Bulgarian data. The samples in the dataset are annotated with metadata that enable effective extraction of domain- and application-oriented datasets for fine-tuning or Retrieval Augmented Generation (RAG) of large language models (LLMs). The paper focuses on the description of the extended metadata of the IfGPT dataset and its management in a graph database.
Anthology ID:
2025.lowresnlp-1.7
Volume:
Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages
Month:
September
Year:
2025
Address:
Varna, Bulgaria
Editors:
Ernesto Luis Estevanell-Valladares, Alicia Picazo-Izquierdo, Tharindu Ranasinghe, Besik Mikaberidze, Simon Ostermann, Daniil Gurgurov, Philipp Mueller, Claudia Borg, Marián Šimko
Venues:
LowResNLP | WS
SIG:
Publisher:
INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:
65–75
Language:
URL:
https://preview.aclanthology.org/corrections-2026-01/2025.lowresnlp-1.7/
DOI:
Bibkey:
Cite (ACL):
Svetla Peneva Koeva, Ivelina Stoyanova, and Jordan Konstantinov Kralev. 2025. IfGPT: A Dataset in Bulgarian for Large Language Models. In Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages, pages 65–75, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):
IfGPT: A Dataset in Bulgarian for Large Language Models (Koeva et al., LowResNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/corrections-2026-01/2025.lowresnlp-1.7.pdf