Julia Otmakhova

Also published as: Yulia Otmakhova

2023

pdf abs
Automated Metrics for Medical Multi-Document Summarization Disagree with Human Evaluations
Lucy Lu Wang | Yulia Otmakhova | Jay DeYoung | Thinh Hung Truong | Bailey Kuehl | Erin Bransom | Byron Wallace
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Evaluating multi-document summarization (MDS) quality is difficult. This is especially true in the case of MDS for biomedical literature reviews, where models must synthesize contradicting evidence reported across different documents. Prior work has shown that rather than performing the task, models may exploit shortcuts that are difficult to detect using standard n-gram similarity metrics such as ROUGE. Better automated evaluation metrics are needed, but few resources exist to assess metrics when they are proposed. Therefore, we introduce a dataset of human-assessed summary quality facets and pairwise preferences to encourage and support the development of better automated evaluation methods for literature review MDS. We take advantage of community submissions to the Multi-document Summarization for Literature Review (MSLR) shared task to compile a diverse and representative sample of generated summaries. We analyze how automated summarization evaluation metrics correlate with lexical features of generated summaries, to other automated metrics including several we propose in this work, and to aspects of human-assessed summary quality. We find that not only do automated metrics fail to capture aspects of quality as assessed by humans, in many cases the system rankings produced by these metrics are anti-correlated with rankings according to human annotators.

2022

pdf abs
Cross-linguistic Comparison of Linguistic Feature Encoding in BERT Models for Typologically Different Languages
Yulia Otmakhova | Karin Verspoor | Jey Han Lau
Proceedings of the 4th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

Though recently there have been an increased interest in how pre-trained language models encode different linguistic features, there is still a lack of systematic comparison between languages with different morphology and syntax. In this paper, using BERT as an example of a pre-trained model, we compare how three typologically different languages (English, Korean, and Russian) encode morphology and syntax features across different layers. In particular, we contrast languages which differ in a particular aspect, such as flexibility of word order, head directionality, morphological type, presence of grammatical gender, and morphological richness, across four different tasks.

pdf abs
LED down the rabbit hole: exploring the potential of global attention for biomedical multi-document summarisation
Yulia Otmakhova | Thinh Hung Truong | Timothy Baldwin | Trevor Cohn | Karin Verspoor | Jey Han Lau
Proceedings of the Third Workshop on Scholarly Document Processing

In this paper we report the experiments performed for the submission to the Multidocument summarisation for Literature Review (MSLR) Shared Task. In particular, we adopt Primera model to the biomedical domain by placing global attention on important biomedical entities in several ways. We analyse the outputs of 23 resulting models and report some patterns related to the presence of additional global attention, number of training steps and the inputs configuration.

pdf abs
M3: Multi-level dataset for Multi-document summarisation of Medical studies
Yulia Otmakhova | Karin Verspoor | Timothy Baldwin | Antonio Jimeno Yepes | Jey Han Lau
Findings of the Association for Computational Linguistics: EMNLP 2022

We present M3 (Multi-level dataset for Multi-document summarisation of Medical studies), a benchmark dataset for evaluating the quality of summarisation systems in the biomedical domain. The dataset contains sets of multiple input documents and target summaries of three levels of complexity: documents, sentences, and propositions. The dataset also includes several levels of annotation, including biomedical entities, direction, and strength of relations between them, and the discourse relationships between the input documents (“contradiction” or “agreement”). We showcase usage scenarios of the dataset by testing 10 generic and domain-specific summarisation models in a zero-shot setting, and introduce a probing task based on counterfactuals to test if models are aware of the direction and strength of the conclusions generated from input studies.

pdf abs
The patient is more dead than alive: exploring the current state of the multi-document summarisation of the biomedical literature
Yulia Otmakhova | Karin Verspoor | Timothy Baldwin | Jey Han Lau
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Although multi-document summarisation (MDS) of the biomedical literature is a highly valuable task that has recently attracted substantial interest, evaluation of the quality of biomedical summaries lacks consistency and transparency. In this paper, we examine the summaries generated by two current models in order to understand the deficiencies of existing evaluation approaches in the context of the challenges that arise in the MDS task. Based on this analysis, we propose a new approach to human evaluation and identify several challenges that must be overcome to develop effective biomedical MDS systems.

pdf abs
Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal Negation
Thinh Hung Truong | Yulia Otmakhova | Timothy Baldwin | Trevor Cohn | Jey Han Lau | Karin Verspoor
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Negation is poorly captured by current language models, although the extent of this problem is not widely understood. We introduce a natural language inference (NLI) test suite to enable probing the capabilities of NLP methods, with the aim of understanding sub-clausal negation. The test suite contains premise–hypothesis pairs where the premise contains sub-clausal negation and the hypothesis is constructed by making minimal modifications to the premise in order to reflect different possible interpretations. Aside from adopting standard NLI labels, our test suite is systematically constructed under a rigorous linguistic framework. It includes annotation of negation types and constructions grounded in linguistic theory, as well as the operations used to construct hypotheses. This facilitates fine-grained analysis of model performance. We conduct experiments using pre-trained language models to demonstrate that our test suite is more challenging than existing benchmarks focused on negation, and show how our annotation supports a deeper understanding of the current NLI capabilities in terms of negation and quantification.

2020

pdf abs
Improved Topic Representations of Medical Documents to Assist COVID-19 Literature Exploration
Yulia Otmakhova | Karin Verspoor | Timothy Baldwin | Simon Šuster
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

Efficient discovery and exploration of biomedical literature has grown in importance in the context of the COVID-19 pandemic, and topic-based methods such as latent Dirichlet allocation (LDA) are a useful tool for this purpose. In this study we compare traditional topic models based on word tokens with topic models based on medical concepts, and propose several ways to improve topic coherence and specificity.