Diego Antognini


2022

pdf bib
Unsupervised Term Extraction for Highly Technical Domains
Francesco Fusco | Peter Staar | Diego Antognini
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

Term extraction is an information extraction task at the root of knowledge discovery platforms. Developing term extractors that are able to generalize across very diverse and potentially highly technical domains is challenging, as annotations for domains requiring in-depth expertise are scarce and expensive to obtain. In this paper, we describe the term extraction subsystem of a commercial knowledge discovery platform that targets highly technical fields such as pharma, medical, and material science. To be able to generalize across domains, we introduce a fully unsupervised annotator (UA). It extracts terms by combining novel morphological signals from sub-word tokenization with term-to-topic and intra-term similarity metrics, computed using general-domain pre-trained sentence-encoders. The annotator is used to implement a weakly-supervised setup, where transformer-models are fine-tuned (or pre-trained) over the training data generated by running the UA over large unlabeled corpora. Our experiments demonstrate that our setup can improve the predictive performance while decreasing the inference latency on both CPUs and GPUs. Our annotators provide a very competitive baseline for all the cases where annotations are not available.

2021

pdf
Rationalization through Concepts
Diego Antognini | Boi Faltings
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf
HotelRec: a Novel Very Large-Scale Hotel Recommendation Dataset
Diego Antognini | Boi Faltings
Proceedings of the Twelfth Language Resources and Evaluation Conference

Today, recommender systems are an inevitable part of everyone’s daily digital routine and are present on most internet platforms. State-of-the-art deep learning-based models require a large number of data to achieve their best performance. Many datasets fulfilling this criterion have been proposed for multiple domains, such as Amazon products, restaurants, or beers. However, works and datasets in the hotel domain are limited: the largest hotel review dataset is below the million samples. Additionally, the hotel domain suffers from a higher data sparsity than traditional recommendation datasets and therefore, traditional collaborative-filtering approaches cannot be applied to such data. In this paper, we propose HotelRec, a very large-scale hotel recommendation dataset, based on TripAdvisor, containing 50 million reviews. To the best of our knowledge, HotelRec is the largest publicly available dataset in the hotel domain (50M versus 0.9M) and additionally, the largest recommendation dataset in a single domain and with textual reviews (50M versus 22M). We release HotelRec for further research: https://github.com/Diego999/HotelRec.

pdf
GameWikiSum: a Novel Large Multi-Document Summarization Dataset
Diego Antognini | Boi Faltings
Proceedings of the Twelfth Language Resources and Evaluation Conference

Today’s research progress in the field of multi-document summarization is obstructed by the small number of available datasets. Since the acquisition of reference summaries is costly, existing datasets contain only hundreds of samples at most, resulting in heavy reliance on hand-crafted features or necessitating additional, manually annotated data. The lack of large corpora therefore hinders the development of sophisticated models. Additionally, most publicly available multi-document summarization corpora are in the news domain, and no analogous dataset exists in the video game domain. In this paper, we propose GameWikiSum, a new domain-specific dataset for multi-document summarization, which is one hundred times larger than commonly used datasets, and in another domain than news. Input documents consist of long professional video game reviews as well as references of their gameplay sections in Wikipedia pages. We analyze the proposed dataset and show that both abstractive and extractive models can be trained on it. We release GameWikiSum for further research: https://github.com/Diego999/GameWikiSum.

2019

pdf
Learning to Create Sentence Semantic Relation Graphs for Multi-Document Summarization
Diego Antognini | Boi Faltings
Proceedings of the 2nd Workshop on New Frontiers in Summarization

Linking facts across documents is a challenging task, as the language used to express the same information in a sentence can vary significantly, which complicates the task of multi-document summarization. Consequently, existing approaches heavily rely on hand-crafted features, which are domain-dependent and hard to craft, or additional annotated data, which is costly to gather. To overcome these limitations, we present a novel method, which makes use of two types of sentence embeddings: universal embeddings, which are trained on a large unrelated corpus, and domain-specific embeddings, which are learned during training. To this end, we develop SemSentSum, a fully data-driven model able to leverage both types of sentence embeddings by building a sentence semantic relation graph. SemSentSum achieves competitive results on two types of summary, consisting of 665 bytes and 100 words. Unlike other state-of-the-art models, neither hand-crafted features nor additional annotated data are necessary, and the method is easily adaptable for other tasks. To our knowledge, we are the first to use multiple sentence embeddings for the task of multi-document summarization.