Rafal Rzepka


2022

pdf
Creation of Polish Online News Corpus for Political Polarization Studies
Joanna Szwoch | Mateusz Staszkow | Rafal Rzepka | Kenji Araki
Proceedings of the LREC 2022 workshop on Natural Language Processing for Political Sciences

In this paper we describe a Polish news corpus as an attempt to create a filtered, organized and representative set of texts coming from contemporary online press articles from two major Polish TV news providers: commercial TVN24 and state-owned TVP Info. The process consists of web scraping, data cleaning and formatting. A random sample was selected from prepared data to perform a classification task. The random forest achieved the best prediction results out of all considered models. We believe that this dataset is a valuable contribution to existing Polish language corpora as online news are considered to be formal and relatively mistake-free, therefore, a reliable source of correct written language, unlike other online platforms such as blogs or social media. Furthermore, to our knowledge, such corpus from this period of time has not been created before. In the future we would like to expand this dataset with articles coming from other online news providers, repeat the classification task on a bigger scale, utilizing other algorithms. Our data analysis outcomes might be a relevant basis to improve research on a political polarization and propaganda techniques in media.

2021

pdf
Tell Me What You Read: Automatic Expertise-Based Annotator Assignment for Text Annotation in Expert Domains
Hiyori Yoshikawa | Tomoya Iwakura | Kimi Kaneko | Hiroaki Yoshida | Yasutaka Kumano | Kazutaka Shimada | Rafal Rzepka | Patrycja Swieczkowska
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

This paper investigates the effectiveness of automatic annotator assignment for text annotation in expert domains. In the task of creating high-quality annotated corpora, expert domains often cover multiple sub-domains (e.g. organic and inorganic chemistry in the chemistry domain) either explicitly or implicitly. Therefore, it is crucial to assign annotators to documents relevant with their fine-grained domain expertise. However, most of existing methods for crowdsoucing estimate reliability of each annotator or annotated instance only after the annotation process. To address the issue, we propose a method to estimate the domain expertise of each annotator before the annotation process using information easily available from the annotators beforehand. We propose two measures to estimate the annotator expertise: an explicit measure using the predefined categories of sub-domains, and an implicit measure using distributed representations of the documents. The experimental results on chemical name annotation tasks show that the annotation accuracy improves when both explicit and implicit measures for annotator assignment are combined.

2020

pdf
Can Existing Methods Debias Languages Other than English? First Attempt to Analyze and Mitigate Japanese Word Embeddings
Masashi Takeshita | Yuki Katsumata | Rafal Rzepka | Kenji Araki
Proceedings of the Second Workshop on Gender Bias in Natural Language Processing

It is known that word embeddings exhibit biases inherited from the corpus, and those biases reflect social stereotypes. Recently, many studies have been conducted to analyze and mitigate biases in word embeddings. Unsupervised Bias Enumeration (UBE) (Swinger et al., 2019) is one of approach to analyze biases for English, and Hard Debias (Bolukbasi et al., 2016) is the common technique to mitigate gender bias. These methods focused on English, or, in smaller extent, on Indo-European languages. However, it is not clear whether these methods can be generalized to other languages. In this paper, we apply these analyzing and mitigating methods, UBE and Hard Debias, to Japanese word embeddings. Additionally, we examine whether these methods can be used for Japanese. We experimentally show that UBE and Hard Debias cannot be sufficiently adapted to Japanese embeddings.

2016

pdf
Automatic Evaluation of Commonsense Knowledge for Refining Japanese ConceptNet
Seiya Shudo | Rafal Rzepka | Kenji Araki
Proceedings of the 12th Workshop on Asian Language Resources (ALR12)

In this paper we present two methods for automatic common sense knowledge evaluation for Japanese entries in ConceptNet ontology. Our proposed methods utilize text-mining approach: one with relation clue words and WordNet synonyms, and one without. Both methods were tested with a blog corpus. The system based on our proposed methods reached relatively high precision score for three relations (MadeOf, UsedFor, AtLocation), which is comparable with previous research using commercial search engines and simpler input. We analyze errors and discuss problems of common sense evaluation, both manual and automatic and propose ideas for further improvements.

2014

pdf
Emotive or Non-emotive: That is The Question
Michal Ptaszynski | Fumito Masui | Rafal Rzepka | Kenji Araki
Proceedings of the 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

2013

pdf
Detecting Cyberbullying Entries on Informal School Websites Based on Category Relevance Maximization
Taisei Nitta | Fumito Masui | Michal Ptaszynski | Yasutomo Kimura | Rafal Rzepka | Kenji Araki
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2012

pdf
Automatically Annotating A Five-Billion-Word Corpus of Japanese Blogs for Affect and Sentiment Analysis
Michal Ptaszynski | Rafal Rzepka | Kenji Araki | Yoshio Momouchi
Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis

2008

pdf
A Casual Conversation System Using Modality and Word Associations Retrieved from the Web
Shinsuke Higuchi | Rafal Rzepka | Kenji Araki
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing