Dina Pisarevskaya

2022

pdf abs
WikiOmnia: filtration and evaluation of the generated QA corpus on the whole Russian Wikipedia
Dina Pisarevskaya | Tatiana Shavrina
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

The General QA field has been developing the methodology referencing the Stanford Question answering dataset (SQuAD) as the significant benchmark. Compiling factual questions datasets requires manual annotations, limiting the training data’s potential size. We present the WikiOmnia dataset, a new publicly available set of QA pairs and corresponding Russian Wikipedia article summary sections, composed with a fully automated generation and filtration pipeline. To ensure high quality of generated QA pairs, diverse manual and automated evaluation techniques were applied. The WikiOmnia pipeline is available open-source and is also tested for creating SQuAD-formatted QA on other domains, like news texts, fiction, and social media. The resulting dataset includes two parts: raw data on the whole Russian Wikipedia (7,930,873 QA pairs with paragraphs for ruGPT-3 XL and 7,991,040 QA pairs with paragraphs for ruT5-large) and cleaned data with strict automatic verification (over 160,000 QA pairs with paragraphs for ruGPT-3 XL and over 3,400,000 QA pairs with paragraphs for ruT5-large).

pdf abs
Team dina at SemEval-2022 Task 8: Pre-trained Language Models as Baselines for Semantic Similarity
Dina Pisarevskaya | Arkaitz Zubiaga
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

This paper describes the participation of the team “dina” in the Multilingual News Similarity task at SemEval 2022. To build our system for the task, we experimented with several multilingual language models which were originally pre-trained for semantic similarity but were not further fine-tuned. We use these models in combination with state-of-the-art packages for machine translation and named entity recognition with the expectation of providing valuable input to the model. Our work assesses the applicability of such “pure” models to solve the multilingual semantic similarity task in the case of news articles. Our best model achieved a score of 0.511, but shows that there is room for improvement.

2020

pdf abs
Fake news detection for the Russian language
Gleb Kuzmin | Daniil Larionov | Dina Pisarevskaya | Ivan Smirnov
Proceedings of the 3rd International Workshop on Rumours and Deception in Social Media (RDSM)

In this paper, we trained and compared different models for fake news detection in Russian. For this task, we used such language features as bag-of-n-grams and bag of Rhetorical Structure Theory features, and BERT embeddings. We also compared the score of our models with the human score on this task and showed that our models deal with fake news detection better. We investigated the nature of fake news by dividing it into two non-overlapping classes: satire and fake news. As a result, we obtained the set of models for fake news detection; the best of these models achieved 0.889 F1-score on the test set for 2 classes and 0.9076 F1-score on 3 classes task.

2019

pdf abs
Towards the Data-driven System for Rhetorical Parsing of Russian Texts
Artem Shelmanov | Dina Pisarevskaya | Elena Chistova | Svetlana Toldova | Maria Kobozeva | Ivan Smirnov
Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019

Results of the first experimental evaluation of machine learning models trained on Ru-RSTreebank – first Russian corpus annotated within RST framework – are presented. Various lexical, quantitative, morphological, and semantic features were used. In rhetorical relation classification, ensemble of CatBoost model with selected features and a linear SVM model provides the best score (macro F1 = 54.67 ± 0.38). We discover that most of the important features for rhetorical relation classification are related to discourse connectives derived from the connectives lexicon for Russian and from other sources.

2017

pdf abs
Deception Detection in News Reports in the Russian Language: Lexics and Discourse
Dina Pisarevskaya
Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism

News verification and automated fact checking tend to be very important issues in our world. The research is initial. We collected a corpus for Russian (174 news reports, truthful and fake ones). We held two experiments, for both we applied SVMs algorithm (linear/rbf kernel) and Random Forest to classify the news reports into 2 classes: truthful/deceptive. In the first experiment, we used 18 markers on lexics level, mostly frequencies of POS tags in texts. In the second experiment, on discourse level we used frequencies of rhetorical relations types in texts. The classification task in the first experiment is solved better by SVMs (rbf kernel) (f-measure 0.65). The model based on RST features shows best results with Random Forest Classifier (f-measure 0.54) and should be modified. In the next research, the combination of different deception detection markers for the Russian language should be taken in order to make a better predictive model.

pdf bib abs
Deception Detection for the Russian Language: Lexical and Syntactic Parameters
Dina Pisarevskaya | Tatiana Litvinova | Olga Litvinova
Proceedings of the 1st Workshop on Natural Language Processing and Information Retrieval associated with RANLP 2017

The field of automated deception detection in written texts is methodologically challenging. Different linguistic levels (lexics, syntax and semantics) are basically used for different types of English texts to reveal if they are truthful or deceptive. Such parameters as POS tags and POS tags n-grams, punctuation marks, sentiment polarity of words, psycholinguistic features, fragments of syntaсtic structures are taken into consideration. The importance of different types of parameters was not compared for the Russian language before and should be investigated before moving to complex models and higher levels of linguistic processing. On the example of the Russian Deception Bank Corpus we estimate the impact of three groups of features (POS features including bigrams, sentiment and psycholinguistic features, syntax and readability features) on the successful deception detection and find out that POS features can be used for binary text classification, but the results should be double-checked and, if possible, improved.