Victoria Bobicev

2018

pdf abs
Thumbs Up and Down: Sentiment Analysis of Medical Online Forums
Victoria Bobicev | Marina Sokolova
Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task

In the current study, we apply multi-class and multi-label sentence classification to sentiment analysis of online medical forums. We aim to identify major health issues discussed in online social media and the types of sentiments those issues evoke. We use ontology of personal health information for Information Extraction and apply Machine Learning methods in automated recognition of the expressed sentiments.

pdf abs
Using PPM for Health Related Text Detection
Victoria Bobicev | Victoria Lazu | Daniela Istrati
Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task

This paper describes the participation of the LILU team in SMM4H challenge on social media mining for health related events description such as drug intakes or vaccinations.

2017

pdf abs
Inter-Annotator Agreement in Sentiment Analysis: Machine Learning Perspective
Victoria Bobicev | Marina Sokolova
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Manual text annotation is an essential part of Big Text analytics. Although annotators work with limited parts of data sets, their results are extrapolated by automated text classification and affect the final classification results. Reliability of annotations and adequacy of assigned labels are especially important in the case of sentiment annotations. In the current study we examine inter-annotator agreement in multi-class, multi-label sentiment annotation of messages. We used several annotation agreement measures, as well as statistical analysis and Machine Learning to assess the resulting annotations.

pdf abs
Good News vs. Bad News: What are they talking about?
Olga Kanishcheva | Victoria Bobicev
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Today’s massive news streams demand the automate analysis which is provided by various online news explorers. However, most of them do not provide sentiment analysis. The main problem of sentiment analysis of news is the differences between the writers and readers attitudes to the news text. News can be good or bad but have to be delivered in neutral words as pure facts. Although there are applications for sentiment analysis of news, the task of news analysis is still a very actual problem because the latest news impacts people’s lives daily. In this paper, we explored the problem of sentiment analysis for Ukrainian and Russian news, developed a corpus of Ukrainian and Russian news and annotated each text using one of three categories: positive, negative and neutral. Each text was marked by at least three independent annotators via the web interface, the inter-annotator agreement was analyzed and the final label for each text was computed. These texts were used in the machine learning experiments. Further, we investigated what kinds of named entities such as Locations, Organizations, Persons are perceived as good or bad by the readers and which of them were the cause for text annotation ambiguity.

pdf
Syntactic Semantic Correspondence in Dependency Grammar
Cătălina Mărănduc | Cătălin Mititelu | Victoria Bobicev
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories

pdf abs
Tools for Building a Corpus to Study the Historical and Geographical Variation of the Romanian Language
Victoria Bobicev | Cătălina Mărănduc | Cenel Augusto Perez
Proceedings of the First Workshop on Language technology for Digital Humanities in Central and (South-)Eastern Europe

Contemporary standard language corpora are ideal for NLP. There are few morphologically and syntactically annotated corpora for Romanian, and those existing or in progress only deal with the Contemporary Romanian standard. However, the necessity to study the dynamics of natural languages gave rise to balanced corpora, containing non-standard texts. In this paper, we describe the creation of tools for processing non-standard Romanian to build a big balanced corpus. We want to preserve in annotated form as many early stages of language as possible. We have already built a corpus in Old Romanian. We also intend to include the South-Danube dialects, remote to the standard language, along with regional forms closer to the standard. We try to preserve data about endangered idioms such as Aromanian, Meglenoromanian and Istroromanian dialects, and calculate the distance between different regional variants, including the language spoken in the Republic of Moldova. This distance, as well as the mutual understanding between the speakers, is the correct criterion for the classification of idioms as different languages, or as dialects, or as regional variants close to the standard.

2016

pdf abs
Automatic Detection of Arabicized Berber and Arabic Varieties
Wafia Adouane | Nasredine Semmar | Richard Johansson | Victoria Bobicev
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

Automatic Language Identification (ALI) is the detection of the natural language of an input text by a machine. It is the first necessary step to do any language-dependent natural language processing task. Various methods have been successfully applied to a wide range of languages, and the state-of-the-art automatic language identifiers are mainly based on character n-gram models trained on huge corpora. However, there are many languages which are not yet automatically processed, for instance minority and informal languages. Many of these languages are only spoken and do not exist in a written format. Social media platforms and new technologies have facilitated the emergence of written format for these spoken languages based on pronunciation. The latter are not well represented on the Web, commonly referred to as under-resourced languages, and the current available ALI tools fail to properly recognize them. In this paper, we revisit the problem of ALI with the focus on Arabicized Berber and dialectal Arabic short texts. We introduce new resources and evaluate the existing methods. The results show that machine learning models combined with lexicons are well suited for detecting Arabicized Berber and different Arabic varieties and distinguishing between them, giving a macro-average F-score of 92.94%.

The paper describes a method of word phonosemantics estimation. We treat phonosemantics as a subconscious emotional perception of word sounding independent on the word meaning. The method is based on the data about emotional perception of sounds obtained from a number of respondents. A program estimates words emotional characteristics using the data about sounds. The program output was compared with humans judgment. The results of the experiments showed that in most cases computer description of a word based on phonosemantic calculations is similar with our own impressions of the words sounding. On the other hand the word meaning dominates in emotional perception of the word and phonosemantic part comes out for the words with unknown meaning.