Marco Basaldella


Self-Alignment Pretraining for Biomedical Entity Representations
Fangyu Liu | Ehsan Shareghi | Zaiqiao Meng | Marco Basaldella | Nigel Collier
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Despite the widespread success of self-supervised learning via masked language models (MLM), accurately capturing fine-grained semantic relationships in the biomedical domain remains a challenge. This is of paramount importance for entity-level tasks such as entity linking where the ability to model entity relations (especially synonymy) is pivotal. To address this challenge, we propose SapBERT, a pretraining scheme that self-aligns the representation space of biomedical entities. We design a scalable metric learning framework that can leverage UMLS, a massive collection of biomedical ontologies with 4M+ concepts. In contrast with previous pipeline-based hybrid systems, SapBERT offers an elegant one-model-for-all solution to the problem of medical entity linking (MEL), achieving a new state-of-the-art (SOTA) on six MEL benchmarking datasets. In the scientific domain, we achieve SOTA even without task-specific supervision. With substantial improvement over various domain-specific pretrained MLMs such as BioBERT, SciBERTand and PubMedBERT, our pretraining scheme proves to be both effective and robust.

pdf bib
Adversarial Training for News Stance Detection: Leveraging Signals from a Multi-Genre Corpus.
Costanza Conforti | Jakob Berndt | Marco Basaldella | Mohammad Taher Pilehvar | Chryssi Giannitsarou | Flavio Toxvaerd | Nigel Collier
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

Cross-target generalization constitutes an important issue for news Stance Detection (SD). In this short paper, we investigate adversarial cross-genre SD, where knowledge from annotated user-generated data is leveraged to improve news SD on targets unseen during training. We implement a BERT-based adversarial network and show experimental performance improvements over a set of strong baselines. Given the abundance of user-generated data, which are considerably less expensive to retrieve and annotate than news articles, this constitutes a promising research direction.


COMETA: A Corpus for Medical Entity Linking in the Social Media
Marco Basaldella | Fangyu Liu | Ehsan Shareghi | Nigel Collier
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Whilst there has been growing progress in Entity Linking (EL) for general language, existing datasets fail to address the complex nature of health terminology in layman’s language. Meanwhile, there is a growing need for applications that can understand the public’s voice in the health domain. To address this we introduce a new corpus called COMETA, consisting of 20k English biomedical entity mentions from Reddit expert-annotated with links to SNOMED CT, a widely-used medical knowledge graph. Our corpus satisfies a combination of desirable properties, from scale and coverage to diversity and quality, that to the best of our knowledge has not been met by any of the existing resources in the field. Through benchmark experiments on 20 EL baselines from string- to neural-based models we shed light on the ability of these systems to perform complex inference on entities and concepts under 2 challenging evaluation scenarios. Our experimental results on COMETA illustrate that no golden bullet exists and even the best mainstream techniques still have a significant performance gap to fill, while the best solution relies on combining different views of data.

Natural Language Processing for Achieving Sustainable Development: the Case of Neural Labelling to Enhance Community Profiling
Costanza Conforti | Stephanie Hirmer | Dai Morgan | Marco Basaldella | Yau Ben Or
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

In recent years, there has been an increasing interest in the application of Artificial Intelligence – and especially Machine Learning – to the field of Sustainable Development (SD). However, until now, NLP has not been systematically applied in this context. In this paper, we show the high potential of NLP to enhance project sustainability. In particular, we focus on the case of community profiling in developing countries, where, in contrast to the developed world, a notable data gap exists. Here, NLP could help to address the cost and time barrier of structuring qualitative data that prohibits its widespread use and associated benefits. We propose the new extreme multi-class multi-label Automatic UserPerceived Value classification task. We release Stories2Insights, an expert-annotated dataset of interviews carried out in Uganda, we provide a detailed corpus analysis, and we implement a number of strong neural baselines to address the task. Experimental results show that the problem is challenging, and leaves considerable room for future research at the intersection of NLP and SD.


BioReddit: Word Embeddings for User-Generated Biomedical NLP
Marco Basaldella | Nigel Collier
Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019)

Word embeddings, in their different shapes and iterations, have changed the natural language processing research landscape in the last years. The biomedical text processing field is no stranger to this revolution; however, scholars in the field largely trained their embeddings on scientific documents only, even when working on user-generated data. In this paper we show how training embeddings from a corpus collected from user-generated text from medical forums heavily influences the performance on downstream tasks, outperforming embeddings trained both on general purpose data or on scientific papers when applied on user-generated content.


Exploiting and Evaluating a Supervised, Multilanguage Keyphrase Extraction pipeline for under-resourced languages
Marco Basaldella | Muhammad Helmy | Elisa Antolli | Mihai Horia Popescu | Giuseppe Serra | Carlo Tasso
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

This paper evaluates different techniques for building a supervised, multilanguage keyphrase extraction pipeline for languages which lack a gold standard. Starting from an unsupervised English keyphrase extraction pipeline, we implement pipelines for Arabic, Italian, Portuguese, and Romanian, and we build test collections for languages which lack one. Then, we add a Machine Learning module trained on a well-known English language corpus and we evaluate the performance not only over English but on the other languages as well. Finally, we repeat the same evaluation after training the pipeline over an Arabic language corpus to check whether using a language-specific corpus brings a further improvement in performance. On the five languages we analyzed, results show an improvement in performance when using a machine learning algorithm, even if such algorithm is not trained and tested on the same language.


Evaluating anaphora and coreference resolution to improve automatic keyphrase extraction
Marco Basaldella | Giorgia Chiaradia | Carlo Tasso
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In this paper we analyze the effectiveness of using linguistic knowledge from coreference and anaphora resolution for improving the performance for supervised keyphrase extraction. In order to verify the impact of these features, we define a baseline keyphrase extraction system and evaluate its performance on a standard dataset using different machine learning algorithms. Then, we consider new sets of features by adding combinations of the linguistic features we propose and we evaluate the new performance of the system. We also use anaphora and coreference resolution to transform the documents, trying to simulate the cohesion process performed by the human mind. We found that our approach has a slightly positive impact on the performance of automatic keyphrase extraction, in particular when considering the ranking of the results.