Question answering (QA) is one of the most common NLP tasks that relates to named entity recognition, fact extraction, semantic search and some other fields. In industry, it is much valued in chat-bots and corporate information systems. It is also a challenging task that attracted the attention of a very general audience at the quiz show Jeopardy! In this article we describe a Jeopardy!-like Russian QA data set collected from the official Russian quiz database Ch-g-k. The data set includes 379,284 quiz-like questions with 29,375 from the Russian analogue of Jeopardy! (Own Game). We observe its linguistic features and the related QA-task. We conclude about perspectives of a QA challenge based on the collected data set.
The article describes a fast solution to propaganda detection at SemEval-2020 Task 11, based on feature adjustment. We use per-token vectorization of features and a simple Logistic Regression classifier to quickly test different hypotheses about our data. We come up with what seems to us the best solution, however, we are unable to align it with the result of the metric suggested by the organizers of the task. We test how our system handles class and feature imbalance by varying the number of samples of two classes (Propaganda and None) in the training set, the size of a context window in which a token is vectorized and combination of vectorization means. The result of our system at SemEval2020 Task 11 is F-score=0.37.
Text-processing algorithms that annotate main components of a story-line are presently in great need of corpora and well-agreed annotation schemes. The Text World Theory of cognitive linguistics offers a model that generalizes a narrative structure in the form of world building elements (characters, time and space) as well as text worlds themselves and switches between them. We have conducted a survey on how text worlds and their elements are annotated in different projects and proposed our own annotation scheme and instructions. We tested them, first, on the science fiction story “We Can Remember It for You Wholesale” by Philip K. Dick. Then we corrected the guidelines and added computer annotation of verb forms with the purpose to get a higher raters’ agreement and tested them again on the short story “The Gift of the Magi” by O. Henry. As a result, the agreement among the three raters has risen. With due revision and tests, our annotation scheme and guidelines can be used for annotating narratives in corpora of literary texts, criminal evidence, teaching materials, quests, etc.
The paper describes our search for a universal algorithm of detecting intentional lexical ambiguity in different forms of creative language. At SemEval-2018 Task 3, we used PunFields, the system of automatic analysis of English puns that we introduced at SemEval-2017, to detect irony in tweets. Preliminary tests showed that it can reach the score of F1=0.596. However, at the competition, its result was F1=0.549.
The article describes a model of automatic interpretation of English puns, based on Roget’s Thesaurus, and its implementation, PunFields. In a pun, the algorithm discovers two groups of words that belong to two main semantic fields. The fields become a semantic vector based on which an SVM classifier learns to recognize puns. A rule-based model is then applied for recognition of intentionally ambiguous (target) words and their definitions. In SemEval Task 7 PunFields shows a considerably good result in pun classification, but requires improvement in searching for the target word and its definition.