IfGPT: A Dataset in Bulgarian for Large Language Models
Svetla Peneva Koeva, Ivelina Stoyanova, Jordan Konstantinov Kralev
Abstract
The paper presents the large dataset IfGPT, which contains available corpora and datasets for Bulgarian, and describes methods to continuously expand it with unduplicated and unbiased Bulgarian data. The samples in the dataset are annotated with metadata that enable effective extraction of domain- and application-oriented datasets for fine-tuning or Retrieval Augmented Generation (RAG) of large language models (LLMs). The paper focuses on the description of the extended metadata of the IfGPT dataset and its management in a graph database.- Anthology ID:
- 2025.lowresnlp-1.7
- Volume:
- Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages
- Month:
- September
- Year:
- 2025
- Address:
- Varna, Bulgaria
- Editors:
- Ernesto Luis Estevanell-Valladares, Alicia Picazo-Izquierdo, Tharindu Ranasinghe, Besik Mikaberidze, Simon Ostermann, Daniil Gurgurov, Philipp Mueller, Claudia Borg, Marián Šimko
- Venues:
- LowResNLP | WS
- SIG:
- Publisher:
- INCOMA Ltd., Shoumen, Bulgaria
- Note:
- Pages:
- 65–75
- Language:
- URL:
- https://preview.aclanthology.org/corrections-2026-01/2025.lowresnlp-1.7/
- DOI:
- Cite (ACL):
- Svetla Peneva Koeva, Ivelina Stoyanova, and Jordan Konstantinov Kralev. 2025. IfGPT: A Dataset in Bulgarian for Large Language Models. In Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages, pages 65–75, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
- Cite (Informal):
- IfGPT: A Dataset in Bulgarian for Large Language Models (Koeva et al., LowResNLP 2025)
- PDF:
- https://preview.aclanthology.org/corrections-2026-01/2025.lowresnlp-1.7.pdf