Tatiana Batura


2022

pdf
Named Entity Inclusion in Abstractive Text Summarization
Sergey Berezin | Tatiana Batura
Proceedings of the Third Workshop on Scholarly Document Processing

We address the named entity omission - the drawback of many current abstractive text summarizers. We suggest a custom pretraining objective to enhance the model’s attention on the named entities in a text. At first, the named entity recognition model RoBERTa is trained to determine named entities in the text. After that this model is used to mask named entities in the text and the BART model is trained to reconstruct them. Next, BART model is fine-tuned on the summarization task. Our experiments showed that this pretraining approach drastically improves named entity inclusion precision and recall metrics.

pdf
Entity Linking over Nested Named Entities for Russian
Natalia Loukachevitch | Pavel Braslavski | Vladimir Ivanov | Tatiana Batura | Suresh Manandhar | Artem Shelmanov | Elena Tutubalina
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper, we describe entity linking annotation over nested named entities in the recently released Russian NEREL dataset for information extraction. The NEREL collection is currently the largest Russian dataset annotated with entities and relations. It includes 933 news texts with annotation of 29 entity types and 49 relation types. The paper describes the main design principles behind NEREL’s entity linking annotation, provides its statistics, and reports evaluation results for several entity linking baselines. To date, 38,152 entity mentions in 933 documents are linked to Wikidata. The NEREL dataset is publicly available.

pdf
NamedEntityRangers at SemEval-2022 Task 11: Transformer-based Approaches for Multilingual Complex Named Entity Recognition
Amina Miftahova | Alexander Pugachev | Artem Skiba | Katya Artemova | Tatiana Batura | Pavel Braslavski | Vladimir Ivanov
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

This paper presents the two submissions of NamedEntityRangers Team to the MultiCoNER Shared Task, hosted at SemEval-2022. We evaluate two state-of-the-art approaches, of which both utilize pre-trained multi-lingual language models differently. The first approach follows the token classification schema, in which each token is assigned with a tag. The second approach follows a recent template-free paradigm, in which an encoder-decoder model translates the input sequence of words to a special output, encoding named entities with predefined labels. We utilize RemBERT and mT5 as backbone models for these two approaches, respectively. Our results show that the oldie but goodie token classification outperforms the template-free method by a wide margin. Our code is available at: https://github.com/Abiks/MultiCoNER.

pdf
TERMinator: A System for Scientific Texts Processing
Elena Bruches | Olga Tikhobaeva | Yana Dementyeva | Tatiana Batura
Proceedings of the 29th International Conference on Computational Linguistics

This paper is devoted to the extraction of entities and semantic relations between them from scientific texts, where we consider scientific terms as entities. In this paper, we present a dataset that includes annotations for two tasks and develop a system called TERMinator for the study of the influence of language models on term recognition and comparison of different approaches for relation extraction. Experiments show that language models pre-trained on the target language are not always show the best performance. Also adding some heuristic approaches may improve the overall quality of the particular task. The developed tool and the annotated corpus are publicly available at https://github.com/iis-research-team/terminator and may be useful for other researchers.

2021

pdf
NEREL: A Russian Dataset with Nested Named Entities, Relations and Events
Natalia Loukachevitch | Ekaterina Artemova | Tatiana Batura | Pavel Braslavski | Ilia Denisov | Vladimir Ivanov | Suresh Manandhar | Alexander Pugachev | Elena Tutubalina
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In this paper, we present NEREL, a Russian dataset for named entity recognition and relation extraction. NEREL is significantly larger than existing Russian datasets: to date it contains 56K annotated named entities and 39K annotated relations. Its important difference from previous datasets is annotation of nested named entities, as well as relations within nested entities and at the discourse level. NEREL can facilitate development of novel models that can extract relations between nested named entities, as well as relations on both sentence and document levels. NEREL also contains the annotation of events involving named entities and their roles in the events. The NEREL collection is available via https://github.com/nerel-ds/NEREL.