Tsvetelina Stefanova
2026
UD-CHILDES-BG: a dependency treebank of Bulgarian child and child-directed speech
Mila Marcheva-Nash | Yasena Chantova | Tsvetina Kirilova | Ivelina Pavlova | Tsvetelina Stefanova | Yoana Vasileva | Weiwei Sun
Proceedings of the 20th Linguistic Annotation Workshop (LAW XX)
Mila Marcheva-Nash | Yasena Chantova | Tsvetina Kirilova | Ivelina Pavlova | Tsvetelina Stefanova | Yoana Vasileva | Weiwei Sun
Proceedings of the 20th Linguistic Annotation Workshop (LAW XX)
This paper presents (i) UD-CHILDES-BG, a manually corrected Universal Dependencies treebank of Bulgarian child and child-directed speech, (ii) a quantitative and phenomenon-based evaluation of inter-annotator agreement on developmental data, and (iii) a systematic analysis of parser errors in this underrepresented domain. We manually correct 4,338 dependency parses (10% of the CHILDES-BG corpus), of which 14% are double-annotated. Inter-annotator agreement on UAS/LAS is 91.71/86.12 for child-directed speech (CDS) and 88.14/81.40 for child speech (CS). Parser performance on the manually corrected portion is 92.70/85.54 for CDS and 90.97/81.52 for CS, compared to a reported 93.37/90.21 on the test set of adult written language. Our analyses reveal that CDS and CS pose challenges for dependency annotation and parsing, particularly in discourse-related structures, which are less common in adult written language.
2025
Automatic Detection of the Bulgarian Evidential Renarrative
Irina Temnikova | Ruslana Margova | Stefan Minkov | Tsvetelina Stefanova | Nevena Grigorova | Silvia Gargova | Venelin Kovatchev
Journal Computational Linguistics in Bulgaria
Irina Temnikova | Ruslana Margova | Stefan Minkov | Tsvetelina Stefanova | Nevena Grigorova | Silvia Gargova | Venelin Kovatchev
Journal Computational Linguistics in Bulgaria
Manual and automatic verification of the trustworthiness of information is an important task. Knowing whether the author of a statement was an eyewitness to the reported event(s) is a useful clue. In linguistics, such information is expressed through “evidentiality”. Evidentials are especially important in Bulgarian, as Bulgarian journalists often use a specific type of evidential (“renarrative”) to report events that they did not directly observe, nor verify. Unfortunately, there are no automatic tools to detect Bulgarian renarrative. This article presents the first two automatic solutions for this task. Specifically - a fine-tuned BERT classifier (renarrative BERT detector, BGRenBERT), achieving 0.98 Accuracy on the test split, and a renarrative rulebased detector (BGRenRules), created with regular expressions, matching a parser’s output. Both solutions detect Bulgarian texts containing the most frequently encountered forms of renarrative. Additionally, we compare the results of the two detectors with the manual annotation of subsets of two Bulgarian fake text datasets. BGRenRules obtains substantially higher results than BGRenBERT. The error analysis shows that the errors from BGRenRules most frequently correspond to cases in which humans also have doubts. The training dataset (BgRenData), the annotated dataset subsets, and the two detectors are made publicly accessible on Zenodo, GitHub, and HuggingFace. We expect that these new resources will be of invaluable assistance to 1) Bulgarian-language researchers, 2) researchers of other languages with similar phenomena, especially those working on verifying information.
2024
SM-FEEL-BG - the First Bulgarian Datasets and Classifiers for Detecting Feelings, Emotions, and Sentiments of Bulgarian Social Media Text
Irina Temnikova | Iva Marinova | Silvia Gargova | Ruslana Margova | Alexander Komarov | Tsvetelina Stefanova | Veneta Kireva | Dimana Vyatrova | Nevena Grigorova | Yordan Mandevski | Stefan Minkov
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Irina Temnikova | Iva Marinova | Silvia Gargova | Ruslana Margova | Alexander Komarov | Tsvetelina Stefanova | Veneta Kireva | Dimana Vyatrova | Nevena Grigorova | Yordan Mandevski | Stefan Minkov
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This article introduces SM-FEEL-BG – the first Bulgarian-language package, containing 6 datasets with Social Media (SM) texts with emotion, feeling, and sentiment labels and 4 classifiers trained on them. All but one dataset from these are freely accessible for research purposes. The largest dataset contains 6000 Twitter, Telegram, and Facebook texts, manually annotated with 21 fine-grained emotion/feeling categories. The fine-grained labels are automatically merged into three coarse-grained sentiment categories, producing a dataset with two parallel sets of labels. Several classification experiments are run on different subsets of the fine-grained categories and their respective sentiment labels with a Bulgarian fine-tuned BERT. The highest Acc. reached was 0.61 for 16 emotions and 0.70 for 11 emotions (incl. 310 ChatGPT 4-generated texts). The sentiments Acc. of the 11 emotions dataset was also the highest (0.79). As Facebook posts cannot be shared, we ran experiments on the Twitter and Telegram subset of the 11 emotions dataset, obtaining 0.73 Acc. for emotions and 0.80 for sentiments. The article describes the annotation procedures, guidelines, experiments, and results. We believe that this package will be of significant benefit to researchers working on emotion detection and sentiment analysis in Bulgarian.