2020
pdf
bib
abs
Fake news detection for the Russian language
Gleb Kuzmin
|
Daniil Larionov
|
Dina Pisarevskaya
|
Ivan Smirnov
Proceedings of the 3rd International Workshop on Rumours and Deception in Social Media (RDSM)
In this paper, we trained and compared different models for fake news detection in Russian. For this task, we used such language features as bag-of-n-grams and bag of Rhetorical Structure Theory features, and BERT embeddings. We also compared the score of our models with the human score on this task and showed that our models deal with fake news detection better. We investigated the nature of fake news by dividing it into two non-overlapping classes: satire and fake news. As a result, we obtained the set of models for fake news detection; the best of these models achieved 0.889 F1-score on the test set for 2 classes and 0.9076 F1-score on 3 classes task.
2019
pdf
bib
abs
Towards the Data-driven System for Rhetorical Parsing of Russian Texts
Artem Shelmanov
|
Dina Pisarevskaya
|
Elena Chistova
|
Svetlana Toldova
|
Maria Kobozeva
|
Ivan Smirnov
Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019
Results of the first experimental evaluation of machine learning models trained on Ru-RSTreebank – first Russian corpus annotated within RST framework – are presented. Various lexical, quantitative, morphological, and semantic features were used. In rhetorical relation classification, ensemble of CatBoost model with selected features and a linear SVM model provides the best score (macro F1 = 54.67 ± 0.38). We discover that most of the important features for rhetorical relation classification are related to discourse connectives derived from the connectives lexicon for Russian and from other sources.
2017
pdf
bib
Rhetorical relations markers in Russian RST Treebank
Svetlana Toldova
|
Dina Pisarevskaya
|
Margarita Ananyeva
|
Maria Kobozeva
|
Alexander Nasedkin
|
Sofia Nikiforova
|
Irina Pavlova
|
Alexey Shelepov
Proceedings of the 6th Workshop on Recent Advances in RST and Related Formalisms
pdf
bib
abs
Deception Detection in News Reports in the Russian Language: Lexics and Discourse
Dina Pisarevskaya
Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism
News verification and automated fact checking tend to be very important issues in our world. The research is initial. We collected a corpus for Russian (174 news reports, truthful and fake ones). We held two experiments, for both we applied SVMs algorithm (linear/rbf kernel) and Random Forest to classify the news reports into 2 classes: truthful/deceptive. In the first experiment, we used 18 markers on lexics level, mostly frequencies of POS tags in texts. In the second experiment, on discourse level we used frequencies of rhetorical relations types in texts. The classification task in the first experiment is solved better by SVMs (rbf kernel) (f-measure 0.65). The model based on RST features shows best results with Random Forest Classifier (f-measure 0.54) and should be modified. In the next research, the combination of different deception detection markers for the Russian language should be taken in order to make a better predictive model.
pdf
bib
abs
Deception Detection for the Russian Language: Lexical and Syntactic Parameters
Dina Pisarevskaya
|
Tatiana Litvinova
|
Olga Litvinova
Proceedings of the 1st Workshop on Natural Language Processing and Information Retrieval associated with RANLP 2017
The field of automated deception detection in written texts is methodologically challenging. Different linguistic levels (lexics, syntax and semantics) are basically used for different types of English texts to reveal if they are truthful or deceptive. Such parameters as POS tags and POS tags n-grams, punctuation marks, sentiment polarity of words, psycholinguistic features, fragments of syntaсtic structures are taken into consideration. The importance of different types of parameters was not compared for the Russian language before and should be investigated before moving to complex models and higher levels of linguistic processing. On the example of the Russian Deception Bank Corpus we estimate the impact of three groups of features (POS features including bigrams, sentiment and psycholinguistic features, syntax and readability features) on the successful deception detection and find out that POS features can be used for binary text classification, but the results should be double-checked and, if possible, improved.