Salar Mohtaj

2022

pdf abs
MuLVE, A Multi-Language Vocabulary Evaluation Data Set
Anik Jacobsen | Salar Mohtaj | Sebastian Möller
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Vocabulary learning is vital to foreign language learning. Correct and adequate feedback is essential to successful and satisfying vocabulary training. However, many vocabulary and language evaluation systems perform on simple rules and do not account for real-life user learning data. This work introduces Multi-Language Vocabulary Evaluation Data Set (MuLVE), a data set consisting of vocabulary cards and real-life user answers, labeled indicating whether the user answer is correct or incorrect. The data source is user learning data from the Phase6 vocabulary trainer. The data set contains vocabulary questions in German and English, Spanish, and French as target language and is available in four different variations regarding pre-processing and deduplication. We experiment to fine-tune pre-trained BERT language models on the downstream task of vocabulary evaluation with the proposed MuLVE data set. The results provide outstanding results of > 95.5 accuracy and F2-score. The data set is available on the European Language Grid.

pdf abs
PerPaDa: A Persian Paraphrase Dataset based on Implicit Crowdsourcing Data Collection
Salar Mohtaj | Fatemeh Tavakkoli | Habibollah Asghari
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper we introduce PerPaDa, a Persian paraphrase dataset that is collected from users’ input in a plagiarism detection system. As an implicit crowdsourcing experience, we have gathered a large collection of original and paraphrased sentences from Hamtajoo; a Persian plagiarism detection system, in which users try to conceal cases of text re-use in their documents by paraphrasing and re-submitting manuscripts for analysis. The compiled dataset contains 2446 instances of paraphrasing. In order to improve the overall quality of the collected data, some heuristics have been used to exclude sentences that don’t meet the proposed criteria. The introduced corpus is much larger than the available datasets for the task of paraphrase identification in Persian. Moreover, there is less bias in the data compared to the similar datasets, since the users did not try some fixed predefined rules in order to generate similar texts to their original inputs.

pdf abs
TUB at WANLP22 Shared Task: Using Semantic Similarity for Propaganda Detection in Arabic
Salar Mohtaj | Sebastian Möller
Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP)

Propaganda and the spreading of fake news through social media have become a serious problem in recent years. In this paper we present our approach for the shared task on propaganda detection in Arabic in which the goal is to identify propaganda techniques in the Arabic social media text. We propose a semantic similarity detection model to compare text in the test set with the sentences in the train set to find the most similar instances. The label of the target text is obtained from the most similar texts in the train set. The proposed model obtained the micro F1 score of 0.494 on the text data set.

pdf bib
Proceedings of the GermEval 2022 Workshop on Text Complexity Assessment of German Text
Sebastian Möller | Salar Mohtaj | Babak Naderi
Proceedings of the GermEval 2022 Workshop on Text Complexity Assessment of German Text

pdf bib abs
Overview of the GermEval 2022 Shared Task on Text Complexity Assessment of German Text
Salar Mohtaj | Babak Naderi | Sebastian Möller
Proceedings of the GermEval 2022 Workshop on Text Complexity Assessment of German Text

In this paper we present the GermEval 2022 shared task on Text Complexity Assessment of German text. Text forms an integral part of exchanging information and interacting with the world, correlating with quality and experience of life. Text complexity is one of the factors which affects a reader’s understanding of a text. The mapping of a body of text to a mathematical unit quantifying the degree of readability is the basis of complexity assessment. As readability might be influenced by representation, we only target the text complexity for readers in this task. We designed the task as text regression in which participants developed models to predict complexity of pieces of text for a German learner in a range from 1 to 7. The shared task is organized in two phases; the development and the test phases. Among 24 participants who registered for the shared task, ten teams submitted their results on the test data.

2020

pdf abs
Claim extraction from text using transfer learning.
Acharya Ashish Prabhakar | Salar Mohtaj | Sebastian Möller
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

Building an end to end fake news detection system consists of detecting claims in text and later verifying them for their authenticity. Although most of the recent works have focused on political claims, fake news can also be propagated in the form of religious intolerance, conspiracy theories etc. Since there is a lack of training data specific to all these scenarios, we compiled a homogeneous and balanced dataset by combining some of the currently available data. Moreover, it is shown in the paper that how recent advancements in transfer learning can be leveraged to detect claims, in general. The obtained result shows that the recently developed transformers can transfer the tendency of research from claim detection to the problem of check worthiness of claims in domains of interest.