Anna Glazkova

2025

pdf bib abs
BERT-like Models for Slavic Morpheme Segmentation
Dmitry Morozov | Lizaveta Astapenka | Anna Glazkova | Timur Garipov | Olga Lyashevskaya
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Automatic morpheme segmentation algorithms are applicable in various tasks, such as building tokenizers and language education. For Slavic languages, the development of such algorithms is complicated by the rich derivational capabilities of these languages. Previous research has shown that, on average, these algorithms have already reached expert-level quality. However, a key unresolved issue is the significant decline in performance when segmenting words containing roots not present in the training data. This problem can be partially addressed by using pre-trained language models to better account for word semantics. In this work, we explored the possibility of fine-tuning BERT-like models for morpheme segmentation using data from Belarusian, Czech, and Russian. We found that for Czech and Russian, our models outperform all previously proposed approaches, achieving word-level accuracy of 92.5-95.1%. For Belarusian, this task was addressed for the first time. The best-performing approach for Belarusian was an ensemble of convolutional neural networks with word-level accuracy of 90.45%.

Pre-trained language models have significantly advanced natural language processing (NLP), particularly in analyzing languages with complex morphological structures. This study addresses lemmatization for the Russian language, the errors in which can critically affect the performance of information retrieval, question answering, and other tasks. We present the results of experiments on generative lemmatization using pre-trained language models. Our findings demonstrate that combining generative models with the existing solutions allows achieving performance that surpasses current results for the lemmatization of Russian. This paper also introduces Rubic2, a new ensemble approach that combines the generative BART-base model, fine-tuned on a manually annotated data set of 2.1 million tokens, with the neural model called Rubic which is currently used for morphological annotation and lemmatization in the Russian National Corpus. Extensive experiments show that Rubic2 outperforms current solutions for the lemmatization of Russian, offering superior results across various text domains and contributing to advancements in NLP applications.

pdf bib abs
From Data to Grassroots Initiatives: Leveraging Transformer-Based Models for Detecting Green Practices in Social Media
Anna Glazkova | Olga Zakharova
Proceedings of the 1st Workshop on Ecology, Environment, and Natural Language Processing (NLP4Ecology2025)

Green practices are everyday activities that support a sustainable relationship between people and the environment. Detecting these practices in social media helps track their prevalence and develop recommendations to promote eco-friendly actions. This study compares machine learning methods for identifying mentions of green waste practices as a multi-label text classification task. We focus on transformer-based models, which currently achieve state-of-the-art performance across various text classification tasks. Along with encoder-only models, we evaluate encoder-decoder and decoder-only architectures, including instruction-based large language models. Experiments on the GreenRu dataset, which consists of Russian social media texts, show the prevalence of the mBART encoder-decoder model. The findings of this study contribute to the advancement of natural language processing tools for ecological and environmental research, as well as the broader development of multi-label text classification methods in other domains.

2023

pdf bib abs
Data Augmentation for Fake News Detection by Combining Seq2seq and NLI
Anna Glazkova
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

State-of-the-art data augmentation methods help improve the generalization of deep learning models. However, these methods often generate examples that contradict the preserving class labels. This is crucial for some natural language processing tasks, such as fake news detection. In this work, we combine sequence-to-sequence and natural language inference models for data augmentation in the fake news detection domain using short news texts, such as tweets and news titles. This approach allows us to generate new training examples that do not contradict facts from the original texts. We use the non-entailment probability for the pair of the original and generated texts as a loss function for a transformer-based sequence-to-sequence model. The proposed approach has demonstrated the effectiveness on three classification benchmarks in fake news detection in terms of the F1-score macro and ROC AUC. Moreover, we showed that our approach retains the class label of the original text more accurately than other transformer-based methods.

pdf bib abs
tmn at SemEval-2023 Task 9: Multilingual Tweet Intimacy Detection Using XLM-T, Google Translate, and Ensemble Learning
Anna Glazkova
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

The paper describes a transformer-based system designed for SemEval-2023 Task 9: Multilingual Tweet Intimacy Analysis. The purpose of the task was to predict the intimacy of tweets in a range from 1 (not intimate at all) to 5 (very intimate). The official training set for the competition consisted of tweets in six languages (English, Spanish, Italian, Portuguese, French, and Chinese). The test set included the given six languages as well as external data with four languages not presented in the training set (Hindi, Arabic, Dutch, and Korean). We presented a solution based on an ensemble of XLM-T, a multilingual RoBERTa model adapted to the Twitter domain. To improve the performance on unseen languages, each tweet was supplemented by its English translation. We explored the effectiveness of translated data for the languages seen in fine-tuning compared to unseen languages and estimated strategies for using translated data in transformer-based models. Our solution ranked 4th on the leaderboard while achieving an overall Pearson’s r of 0.5989 over the test set. The proposed system improves up to 0.088 Pearson’s r over a score averaged across all 45 submissions.

2022

pdf bib abs
Detecting generated scientific papers using an ensemble of transformer models
Anna Glazkova | Maksim Glazkov
Proceedings of the Third Workshop on Scholarly Document Processing

The paper describes neural models developed for the DAGPap22 shared task hosted at the Third Workshop on Scholarly Document Processing. This shared task targets the automatic detection of generated scientific papers. Our work focuses on comparing different transformer-based models as well as using additional datasets and techniques to deal with imbalanced classes. As a final submission, we utilized an ensemble of SciBERT, RoBERTa, and DeBERTa fine-tuned using random oversampling technique. Our model achieved 99.24% in terms of F1-score. The official evaluation results have put our system at the third place.

2021

pdf bib abs
MIPT-NSU-UTMN at SemEval-2021 Task 5: Ensembling Learning with Pre-trained Language Models for Toxic Spans Detection
Mikhail Kotyushev | Anna Glazkova | Dmitry Morozov
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper describes our system for SemEval-2021 Task 5 on Toxic Spans Detection. We developed ensemble models using BERT-based neural architectures and post-processing to combine tokens into spans. We evaluated several pre-trained language models using various ensemble techniques for toxic span identification and achieved sizable improvements over our baseline fine-tuned BERT models. Finally, our system obtained a F1-score of 67.55% on test data.

2020

pdf bib abs
UTMN at SemEval-2020 Task 11: A Kitchen Solution to Automatic Propaganda Detection
Elena Mikhalkova | Nadezhda Ganzherli | Anna Glazkova | Yuliya Bidulya
Proceedings of the Fourteenth Workshop on Semantic Evaluation

The article describes a fast solution to propaganda detection at SemEval-2020 Task 11, based on feature adjustment. We use per-token vectorization of features and a simple Logistic Regression classifier to quickly test different hypotheses about our data. We come up with what seems to us the best solution, however, we are unable to align it with the result of the metric suggested by the organizers of the task. We test how our system handles class and feature imbalance by varying the number of samples of two classes (Propaganda and None) in the training set, the size of a context window in which a token is vectorized and combination of vectorization means. The result of our system at SemEval2020 Task 11 is F-score=0.37.