Svetla Peneva Koeva

2026

Assessing the broad general knowledge of Large Language Models (LLMs) across multiple domains in Bulgarian remains challenging due to the limited availability of Bulgarian evaluation benchmarks. To address this gap, we introduce the Bulgarian Massive Multitask Language Understanding benchmark (MMLU-BG), designed to evaluate whether LLMs possess generalised knowledge capabilities beyond simple text prediction in Bulgarian. This paper presents the structure, the development protocol, and the size of the MMLU-BG benchmark. It is tested in comparison with the original MMLU for English across seven LLMs selected according to specific criteria. The experiments demonstrate that the MMLU-BG benchmark assesses multi-domain versatility and highlights the models’ strengths and weaknesses across different subject areas.

bib abs

A Large Dataset Representing Bulgarian, with the Bulgarian National Corpus as Its Core
Svetla Peneva Koeva | Ivelina Stoyanova
Proceedings of the 12th Workshop on Challenges in the Management of Large Corpora

Many thanks to all reviewers for the detailed comments and suggestions. Misspelling, formatting errors and other minor issues have all been corrected, and are not listed below. 1. (Reviewer 1) Comment: For example, the relation of some described corpora to the Bulgarian National Corpus (like MIC21, Bulgarian MARCELL, General News in Bulgarian, ...) is not clear to me. Are they part of BulNC or the other way around? It’s stated that the BulNC is part of CURLICAT, so I would assume the relation with the other corpora is similar? Response: The conclusion now clarifies the relations between BulNC and IfGPT and its subsets. The paper was also restructured to clarify these issues. 2. (Reviewer 1) Comment: Some sections on the paper reference related work, where again the relation is not clear to me, like on page 3 "Another direction, which still presents significant challenges, is towards multilingual data. O’Keeffe et al. (2024) describe a pipeline for capturing professional video-call interactions, including screen recordings, speaker tracking, and facial expression data. Macaire et al. (2024) present a speech text pictogram corpus for French (230 hours), targeted at augmentative and alternative communication research. Lai and Pustejovsky (2024) develop an annotation scheme for iconic, deictic, and beat gestures anchored in Abstract Meaning Representation." As these publications follow the MIC21 corpus, it’s not clear to me, how they relate. Response: These references are reduced; a citation for the MIC corpus is provided where more details are available on the related works of MIC. 3. (Reviewer 1) Comment: I would be very interested in the benefits of a Graph database for the Metadata and I think it would be very valuable to describe, why the Graph database is more appropriate for Metadata than commonly used alternatives. The described relations are not totally convincing to me. Response: A paragraph is added in the Metadata management section on the justification of the use of a graph database. 4. (Reviewer 1) Comment: The publicly accessible web interface should be linked to. It’s also not clear to me, if it provides a fulltext search or only a keyword search in the Metadata (i.e. in the Graph database). Response: Links are provided to both the search interface of BulNC (full text search, mainly for linguistic research) and to IfGPT metadata search interface (allowing selection of subdatasets for NLP tasks and LLM fine-tuning). 5. (Reviewer 2) Comment: My first doubt is why all the data is presented as part or at least related to the Bulgarian National Corpus - to me, national corpora are reference language corpora with all the characteristics this entails, like a carefully balanced corpus representative of contemporary standard Bulgarian. I would not expect to see in this context mentioned artificially produced language for LLMs, multilingual or foreign language corpora or image corpora. To me it would make a lot more sense the re-cast the paper as presenting (newly) available language resources for Bulgarian (maybe in the context of the Bulgarian CLARIN / CLARIAH, which is not even mentioned!) rather than shoe-horning them to the BulNC. Response: The ties between the BulNC and the large dataset IfGPT has been clarified in the Introduction, in the text and in the Conclusion. 6. (Reviewer 2) Comment: Second, it would help the reader that, rather than just listing all the resources, a table with their key characteristics would be provided first, and then each introduced. Similarly, the endless repetitions of "so-and-so many JSON files" for each domain in Sec. 2.5 are not really helpful, as, first, texts or words are a better metric than files, and, second, this information would also be better presented in a table. Response: A table has now been provided listing the resources and their key properties, including size. 7. (Reviewer 2) Comment: "The BulNC-based dataset is publicly accessible through a dedicated web interface" I don’t see that these datasets are in any way BulNC-based. Response: Clarifications are made on this issue in the Introduction and the Conclusion. 8. (Reviewer 3) Comment: I am missing (at least a sketchy) description of tools used for tagging and/or parsing the corpus data. Response: A reference to the Bulgarian Language Processing Tool Set is now provided. 9. (Reviewer 3) Comment: It is also not clear why the corpus is still using the web interface developed before 2014, and not any newer tool (with a FLOSS license, such as CQPweb or NoSketch Engine). Response: This has been clarified in section 2.1. 10. (Reviewer 3) Comment: I would also expect the text to include summary statistics (preferably in a single table), so the sizes of the respective resources can be compared. Response: Table 1 is now provided for this purpose.

bib abs

Recent Developments of the Bulgarian National Corpus
Svetla Peneva Koeva | Ivelina Stoyanova
Proceedings of the 12th Workshop on Challenges in the Management of Large Corpora

1. (Reviewer 1) Comment: Some terminology could do with more explanation, such as MARCELL and CURLICAT, however, the general point is very clear. Response: We have added some more details on the international projects that involved the creation of large datasets included in BulNC. 2. (Reviewer 1) Comment: I am especially interested in the enrichment of the corpus with multimodal data. I think this is something that many large corpus managers would be interested in exploring. Response: We added some more details on the multimodal dataset, the organisation of the multimodal data, the ontology description, and the applications. 3. (Reviewer 2) Comment: It does not, however, give the reader a clear idea of current priorities and future directions, and it largely fails to put the work in context of other research. Response: A clarification has been made in the conclusion that we aim at extending the large dataset with more data and extensive metadata description in order to facilitate development of language technologies and fine-tuning of LLMs. 4. (Reviewer 2) Comment: - "Like many other large reference corpora" - please give reference so that readers can know in what context you see your own work Response: We expanded the Introduction with a paragraph citing other related work on large reference corpora: "There are two main approaches to providing search interfaces for large reference corpora. ..." Also, in the conclusion we included more details on large datasets used for LLMs, which provide context for our future work. 5. (Reviewer 2) Comment: - "linguistic and corpus research" - do you see these as two different (sub)disciplines? Explain or use a different wording Response: Thank you for the remark, it is well founded and we reformulated it as ‘linguistic and NLP research’. 6. (Reviewer 2) Comment:- JSONL and CSV - explain what that is and how you are using it for linguistic data – the formats themselves are just very general specifications for textual data. Are you using any particular linguistic standards? Response: More details are provided in the text with respect to the BulNC processing pipelines, and the handling of different file formats. Due to the limited volume of the paper we have not provided details on the linguistic annotation of the corpus. 7. (Reviewer 2) Comment: - "now called the IfGPT dataset" – is that relevant? What does the acronym stand for? Response: In the fourth paragraph of Section 1, we explain: "These efforts led to the development of the large BulNC-based dataset within the project IfGPT: Infrastructure for Fine-tuning Pre-trained Large Language Models (thus, also called the IfGPT dataset), with a special focus on the efficient management of large text data." 8. (Reviewer 2) Comment: - Table 1 - please explain what the different corpus components actually are Response: We clarified in the following way: "Further extensions of the dataset include newly collected and processed texts from various time periods. Older texts, such as news articles, periodicals, and books published before 1990, are also collected and processed using OCR." 9. (Reviewer 2) Comment:- "25 languages" - which ones? How where they chosen? Response: We clarified in the following way: "The selection of languages was based on the availability of wordnets in various languages in the Extended Open Multilingual Wordnet." 10. (Reviewer 2) Comment: - "therefore has a complex graph-based structure" – I don’t see why this follows from the fact that BulNC is designed to support corpus and language research. Can you explain? How does your graph-based structure relate to other approaches to metadata? Response: Justification for the use of Neo4J database is provided: "The metadata are managed using a graph database, Neo4J, that is designed to handle large volumes of interconnected data efficiently and maintains performance under complex queries using the Cypher query language." 11. (Reviewer 2) Comment: - "Corpus Query Tool specifically developed for the BulNC" – reference? URL? Response: Reference is provided both to the BulNC search interface and the IfGPT metadata web search. 12. (Reviewer 2) Comment: - References – these are exclusively self-references. Do you not want to put your work into the context of other CMLC contributions? Response: More references are provided. See 4. 13. (Reviewer 2) Comment: - General remark: You’re leaving implicit what BulNC is *not* doing. Can you devote at least one sentence to your approach to spoken language? What about CMC, learner language, etc.? Response: Currently, we have not extended our work towards including spoken language data or other specialised datasets (e.g., learner data, etc.). 14. (Reviewer 3) Comment: The difference between "BulNC", "BulNC-based dataset" and "IfGPT dataset" is never really made clear and needs to be inferred by the reader. The presentation would benefit greatly from one or two sentences early on that explicitly spell out the relationship between BulNC, the BulNC-based dataset, and the IfGPT dataset. Response: We thank the reviewers for highlighting the need to clarify the relationship between "BulNC", the "BulNC-based dataset", and the "IfGPT dataset". Answer is as 7. above.

2025

pdf bib abs

IfGPT: A Dataset in Bulgarian for Large Language Models
Svetla Peneva Koeva | Ivelina Stoyanova | Jordan Konstantinov Kralev
Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages

The paper presents the large dataset IfGPT, which contains available corpora and datasets for Bulgarian, and describes methods to continuously expand it with unduplicated and unbiased Bulgarian data. The samples in the dataset are annotated with metadata that enable effective extraction of domain- and application-oriented datasets for fine-tuning or Retrieval Augmented Generation (RAG) of large language models (LLMs). The paper focuses on the description of the extended metadata of the IfGPT dataset and its management in a graph database.