Roberto Navigli

2021

pdf bib abs
Framing Word Sense Disambiguation as a Multi-Label Problem for Model-Agnostic Knowledge Integration
Simone Conia | Roberto Navigli
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Recent studies treat Word Sense Disambiguation (WSD) as a single-label classification problem in which one is asked to choose only the best-fitting sense for a target word, given its context. However, gold data labelled by expert annotators suggest that maximizing the probability of a single sense may not be the most suitable training objective for WSD, especially if the sense inventory of choice is fine-grained. In this paper, we approach WSD as a multi-label classification problem in which multiple senses can be assigned to each target word. Not only does our simple method bear a closer resemblance to how human annotators disambiguate text, but it can also be seamlessly extended to exploit structured knowledge from semantic networks to achieve state-of-the-art results in English all-words WSD.

pdf bib
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
Chengqing Zong | Fei Xia | Wenjie Li | Roberto Navigli
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib abs
UniteD-SRL: A Unified Dataset for Span- and Dependency-Based Multilingual and Cross-Lingual Semantic Role Labeling
Rocco Tripodi | Simone Conia | Roberto Navigli
Findings of the Association for Computational Linguistics: EMNLP 2021

Multilingual and cross-lingual Semantic Role Labeling (SRL) have recently garnered increasing attention as multilingual text representation techniques have become more effective and widely available. While recent work has attained growing success, results on gold multilingual benchmarks are still not easily comparable across languages, making it difficult to grasp where we stand. For example, in CoNLL-2009, the standard benchmark for multilingual SRL, language-to-language comparisons are affected by the fact that each language has its own dataset which differs from the others in size, domains, sets of labels and annotation guidelines. In this paper, we address this issue and propose UniteD-SRL, a new benchmark for multilingual and cross-lingual, span- and dependency-based SRL. UniteD-SRL provides expert-curated parallel annotations using a common predicate-argument structure inventory, allowing direct comparisons across languages and encouraging studies on cross-lingual transfer in SRL. We release UniteD-SRL v1.0 at https://github.com/SapienzaNLP/united-srl.

pdf bib abs
REBEL: Relation Extraction By End-to-end Language generation
Pere-Lluís Huguet Cabot | Roberto Navigli
Findings of the Association for Computational Linguistics: EMNLP 2021

Extracting relation triplets from raw text is a crucial task in Information Extraction, enabling multiple applications such as populating or validating knowledge bases, factchecking, and other downstream tasks. However, it usually involves multiple-step pipelines that propagate errors or are limited to a small number of relation types. To overcome these issues, we propose the use of autoregressive seq2seq models. Such models have previously been shown to perform well not only in language generation, but also in NLU tasks such as Entity Linking, thanks to their framing as seq2seq tasks. In this paper, we show how Relation Extraction can be simplified by expressing triplets as a sequence of text and we present REBEL, a seq2seq model based on BART that performs end-to-end relation extraction for more than 200 different relation types. We show our model’s flexibility by fine-tuning it on an array of Relation Extraction and Relation Classification benchmarks, with it attaining state-of-the-art performance in most of them.

pdf bib abs
WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER
Simone Tedeschi | Valentino Maiorca | Niccolò Campolungo | Francesco Cecconi | Roberto Navigli
Findings of the Association for Computational Linguistics: EMNLP 2021

Multilingual Named Entity Recognition (NER) is a key intermediate task which is needed in many areas of NLP. In this paper, we address the well-known issue of data scarcity in NER, especially relevant when moving to a multilingual scenario, and go beyond current approaches to the creation of multilingual silver data for the task. We exploit the texts of Wikipedia and introduce a new methodology based on the effective combination of knowledge-based approaches and neural models, together with a novel domain adaptation technique, to produce high-quality training corpora for NER. We evaluate our datasets extensively on standard benchmarks for NER, yielding substantial improvements up to 6 span-based F1-score points over previous state-of-the-art systems for data creation.

pdf bib abs
Named Entity Recognition for Entity Linking: What Works and What’s Next
Simone Tedeschi | Simone Conia | Francesco Cecconi | Roberto Navigli
Findings of the Association for Computational Linguistics: EMNLP 2021

Entity Linking (EL) systems have achieved impressive results on standard benchmarks mainly thanks to the contextualized representations provided by recent pretrained language models. However, such systems still require massive amounts of data – millions of labeled examples – to perform at their best, with training times that often exceed several days, especially when limited computational resources are available. In this paper, we look at how Named Entity Recognition (NER) can be exploited to narrow the gap between EL systems trained on high and low amounts of labeled data. More specifically, we show how and to what extent an EL system can benefit from NER to enhance its entity representations, improve candidate selection, select more effective negative samples and enforce hard and soft constraints on its output entities. We release our software – code and model checkpoints – at https://github.com/Babelscape/ner4el.

pdf bib abs
SGL: Speaking the Graph Languages of Semantic Parsing via Multilingual Translation
Luigi Procopio | Rocco Tripodi | Roberto Navigli
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Graph-based semantic parsing aims to represent textual meaning through directed graphs. As one of the most promising general-purpose meaning representations, these structures and their parsing have gained a significant interest momentum during recent years, with several diverse formalisms being proposed. Yet, owing to this very heterogeneity, most of the research effort has focused mainly on solutions specific to a given formalism. In this work, instead, we reframe semantic parsing towards multiple formalisms as Multilingual Neural Machine Translation (MNMT), and propose SGL, a many-to-many seq2seq architecture trained with an MNMT objective. Backed by several experiments, we show that this framework is indeed effective once the learning procedure is enhanced with large parallel corpora coming from Machine Translation: we report competitive performances on AMR and UCCA parsing, especially once paired with pre-trained architectures. Furthermore, we find that models trained under this configuration scale remarkably well to tasks such as cross-lingual AMR parsing: SGL outperforms all its competitors by a large margin without even explicitly seeing non-English to AMR examples at training time and, once these examples are included as well, sets an unprecedented state of the art in this task. We release our code and our models for research purposes at https://github.com/SapienzaNLP/sgl.

pdf bib abs
Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources
Simone Conia | Andrea Bacciu | Roberto Navigli
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

While cross-lingual techniques are finding increasing success in a wide range of Natural Language Processing tasks, their application to Semantic Role Labeling (SRL) has been strongly limited by the fact that each language adopts its own linguistic formalism, from PropBank for English to AnCora for Spanish and PDT-Vallex for Czech, inter alia. In this work, we address this issue and present a unified model to perform cross-lingual SRL over heterogeneous linguistic resources. Our model implicitly learns a high-quality mapping for different formalisms across diverse languages without resorting to word alignment and/or translation techniques. We find that, not only is our cross-lingual system competitive with the current state of the art but that it is also robust to low-data scenarios. Most interestingly, our unified model is able to annotate a sentence in a single forward pass with all the inventories it was trained with, providing a tool for the analysis and comparison of linguistic theories across different languages. We release our code and model at https://github.com/SapienzaNLP/unify-srl.

pdf bib abs
ESC: Redesigning WSD with Extractive Sense Comprehension
Edoardo Barba | Tommaso Pasini | Roberto Navigli
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Word Sense Disambiguation (WSD) is a historical NLP task aimed at linking words in contexts to discrete sense inventories and it is usually cast as a multi-label classification task. Recently, several neural approaches have employed sense definitions to better represent word meanings. Yet, these approaches do not observe the input sentence and the sense definition candidates all at once, thus potentially reducing the model performance and generalization power. We cope with this issue by reframing WSD as a span extraction problem — which we called Extractive Sense Comprehension (ESC) — and propose ESCHER, a transformer-based neural architecture for this new formulation. By means of an extensive array of experiments, we show that ESC unleashes the full potential of our model, leading it to outdo all of its competitors and to set a new state of the art on the English WSD task. In the few-shot scenario, ESCHER proves to exploit training data efficiently, attaining the same performance as its closest competitor while relying on almost three times fewer annotations. Furthermore, ESCHER can nimbly combine data annotated with senses from different lexical resources, achieving performances that were previously out of everyone’s reach. The model along with data is available at https://github.com/SapienzaNLP/esc.

pdf bib
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Chengqing Zong | Fei Xia | Wenjie Li | Roberto Navigli
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Chengqing Zong | Fei Xia | Wenjie Li | Roberto Navigli
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib abs
IR like a SIR: Sense-enhanced Information Retrieval for Multiple Languages
Rexhina Blloshmi | Tommaso Pasini | Niccolò Campolungo | Somnath Banerjee | Roberto Navigli | Gabriella Pasi
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

With the advent of contextualized embeddings, attention towards neural ranking approaches for Information Retrieval increased considerably. However, two aspects have remained largely neglected: i) queries usually consist of few keywords only, which increases ambiguity and makes their contextualization harder, and ii) performing neural ranking on non-English documents is still cumbersome due to shortage of labeled datasets. In this paper we present SIR (Sense-enhanced Information Retrieval) to mitigate both problems by leveraging word sense information. At the core of our approach lies a novel multilingual query expansion mechanism based on Word Sense Disambiguation that provides sense definitions as additional semantic information for the query. Importantly, we use senses as a bridge across languages, thus allowing our model to perform considerably better than its supervised and unsupervised alternatives across French, German, Italian and Spanish languages on several CLEF benchmarks, while being trained on English Robust04 data only. We release SIR at https://github.com/SapienzaNLP/sir.

pdf bib abs
ConSeC: Word Sense Disambiguation as Continuous Sense Comprehension
Edoardo Barba | Luigi Procopio | Roberto Navigli
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Supervised systems have nowadays become the standard recipe for Word Sense Disambiguation (WSD), with Transformer-based language models as their primary ingredient. However, while these systems have certainly attained unprecedented performances, virtually all of them operate under the constraining assumption that, given a context, each word can be disambiguated individually with no account of the other sense choices. To address this limitation and drop this assumption, we propose CONtinuous SEnse Comprehension (ConSeC), a novel approach to WSD: leveraging a recent re-framing of this task as a text extraction problem, we adapt it to our formulation and introduce a feedback loop strategy that allows the disambiguation of a target word to be conditioned not only on its context but also on the explicit senses assigned to nearby words. We evaluate ConSeC and examine how its components lead it to surpass all its competitors and set a new state of the art on English WSD. We also explore how ConSeC fares in the cross-lingual setting, focusing on 8 languages with various degrees of resource availability, and report significant improvements over prior systems. We release our code at https://github.com/SapienzaNLP/consec.

pdf bib abs
Integrating Personalized PageRank into Neural Word Sense Disambiguation
Ahmed El Sheikh | Michele Bevilacqua | Roberto Navigli
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Neural Word Sense Disambiguation (WSD) has recently been shown to benefit from the incorporation of pre-existing knowledge, such as that coming from the WordNet graph. However, state-of-the-art approaches have been successful in exploiting only the local structure of the graph, with only close neighbors of a given synset influencing the prediction. In this work, we improve a classification model by recomputing logits as a function of both the vanilla independently produced logits and the global WordNet graph. We achieve this by incorporating an online neural approximated PageRank, which enables us to refine edge weights as well. This method exploits the global graph structure while keeping space requirements linear in the number of edges. We obtain strong improvements, matching the current state of the art. Code is available at https://github.com/SapienzaNLP/neural-pagerank-wsd

pdf bib abs
GeneSis: A Generative Approach to Substitutes in Context
Caterina Lacerra | Rocco Tripodi | Roberto Navigli
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

The lexical substitution task aims at generating a list of suitable replacements for a target word in context, ideally keeping the meaning of the modified text unchanged. While its usage has increased in recent years, the paucity of annotated data prevents the finetuning of neural models on the task, hindering the full fruition of recently introduced powerful architectures such as language models. Furthermore, lexical substitution is usually evaluated in a framework that is strictly bound to a limited vocabulary, making it impossible to credit appropriate, but out-of-vocabulary, substitutes. To assess these issues, we proposed GeneSis (Generating Substitutes in contexts), the first generative approach to lexical substitution. Thanks to a seq2seq model, we generate substitutes for a word according to the context it appears in, attaining state-of-the-art results on different benchmarks. Moreover, our approach allows silver data to be produced for further improving the performances of lexical substitution systems. Along with an extensive analysis of GeneSis results, we also present a human evaluation of the generated substitutes in order to assess their quality. We release the fine-tuned models, the generated datasets, and the code to reproduce the experiments at https://github.com/SapienzaNLP/genesis.

pdf bib abs
SPRING Goes Online: End-to-End AMR Parsing and Generation
Rexhina Blloshmi | Michele Bevilacqua | Edoardo Fabiano | Valentina Caruso | Roberto Navigli
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

In this paper we present SPRING Online Services, a Web interface and RESTful APIs for our state-of-the-art AMR parsing and generation system, SPRING (Symmetric PaRsIng aNd Generation). The Web interface has been developed to be easily used by the Natural Language Processing community, as well as by the general public. It provides, among other things, a highly interactive visualization platform and a feedback mechanism to obtain user suggestions for further improvements of the system’s output. Moreover, our RESTful APIs enable easy integration of SPRING in downstream applications where AMR structures are needed. Finally, we make SPRING Online Services freely available at http://nlp.uniroma1.it/spring and, in addition, we release extra model checkpoints to be used with the original SPRING Python code.

pdf bib abs
AMuSE-WSD: An All-in-one Multilingual System for Easy Word Sense Disambiguation
Riccardo Orlando | Simone Conia | Fabrizio Brignone | Francesco Cecconi | Roberto Navigli
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Over the past few years, Word Sense Disambiguation (WSD) has received renewed interest: recently proposed systems have shown the remarkable effectiveness of deep learning techniques in this task, especially when aided by modern pretrained language models. Unfortunately, such systems are still not available as ready-to-use end-to-end packages, making it difficult for researchers to take advantage of their performance. The only alternative for a user interested in applying WSD to downstream tasks is to rely on currently available end-to-end WSD systems, which, however, still rely on graph-based heuristics or non-neural machine learning algorithms. In this paper, we fill this gap and propose AMuSE-WSD, the first end-to-end system to offer high-quality sense information in 40 languages through a state-of-the-art neural model for WSD. We hope that AMuSE-WSD will provide a stepping stone for the integration of meaning into real-world applications and encourage further studies in lexical semantics. AMuSE-WSD is available online at http://nlp.uniroma1.it/amuse-wsd.

pdf bib abs
InVeRo-XL: Making Cross-Lingual Semantic Role Labeling Accessible with Intelligible Verbs and Roles
Simone Conia | Riccardo Orlando | Fabrizio Brignone | Francesco Cecconi | Roberto Navigli
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Notwithstanding the growing interest in cross-lingual techniques for Natural Language Processing, there has been a surprisingly small number of efforts aimed at the development of easy-to-use tools for cross-lingual Semantic Role Labeling. In this paper, we fill this gap and present InVeRo-XL, an off-the-shelf state-of-the-art system capable of annotating text with predicate sense and semantic role labels from 7 predicate-argument structure inventories in more than 40 languages. We hope that our system – with its easy-to-use RESTful API and Web interface – will become a valuable tool for the research community, encouraging the integration of sentence-level semantics into cross-lingual downstream tasks. InVeRo-XL is available online at http://nlp.uniroma1.it/invero.

pdf bib abs
SemEval-2021 Task 2: Multilingual and Cross-lingual Word-in-Context Disambiguation (MCL-WiC)
Federico Martelli | Najla Kalach | Gabriele Tola | Roberto Navigli
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

In this paper, we introduce the first SemEval task on Multilingual and Cross-Lingual Word-in-Context disambiguation (MCL-WiC). This task allows the largely under-investigated inherent ability of systems to discriminate between word senses within and across languages to be evaluated, dropping the requirement of a fixed sense inventory. Framed as a binary classification, our task is divided into two parts. In the multilingual sub-task, participating systems are required to determine whether two target words, each occurring in a different context within the same language, express the same meaning or not. Instead, in the cross-lingual part, systems are asked to perform the task in a cross-lingual scenario, in which the two target words and their corresponding contexts are provided in two different languages. We illustrate our task, as well as the construction of our manually-created dataset including five languages, namely Arabic, Chinese, English, French and Russian, and the results of the participating systems. Datasets and results are available at: https://github.com/SapienzaNLP/mcl-wic.

2020

pdf bib abs
XL-AMR: Enabling Cross-Lingual AMR Parsing with Transfer Learning Techniques
Rexhina Blloshmi | Rocco Tripodi | Roberto Navigli
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Abstract Meaning Representation (AMR) is a popular formalism of natural language that represents the meaning of a sentence as a semantic graph. It is agnostic about how to derive meanings from strings and for this reason it lends itself well to the encoding of semantics across languages. However, cross-lingual AMR parsing is a hard task, because training data are scarce in languages other than English and the existing English AMR parsers are not directly suited to being used in a cross-lingual setting. In this work we tackle these two problems so as to enable cross-lingual AMR parsing: we explore different transfer learning techniques for producing automatic AMR annotations across languages and develop a cross-lingual AMR parser, XL-AMR. This can be trained on the produced data and does not rely on AMR aligners or source-copy mechanisms as is commonly the case in English AMR parsing. The results of XL-AMR significantly surpass those previously reported in Chinese, German, Italian and Spanish. Finally we provide a qualitative analysis which sheds light on the suitability of AMR across languages. We release XL-AMR at github.com/SapienzaNLP/xl-amr.

pdf bib abs
With More Contexts Comes Better Performance: Contextualized Sense Embeddings for All-Round Word Sense Disambiguation
Bianca Scarlini | Tommaso Pasini | Roberto Navigli
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Contextualized word embeddings have been employed effectively across several tasks in Natural Language Processing, as they have proved to carry useful semantic information. However, it is still hard to link them to structured sources of knowledge. In this paper we present ARES (context-AwaRe Embeddings of Senses), a semi-supervised approach to producing sense embeddings for the lexical meanings within a lexical knowledge base that lie in a space that is comparable to that of contextualized word vectors. ARES representations enable a simple 1 Nearest-Neighbour algorithm to outperform state-of-the-art models, not only in the English Word Sense Disambiguation task, but also in the multilingual one, whilst training on sense-annotated data in English only. We further assess the quality of our embeddings in the Word-in-Context task, where, when used as an external source of knowledge, they consistently improve the performance of a neural model, leading it to compete with other more complex architectures. ARES embeddings for all WordNet concepts and the automatically-extracted contexts used for creating the sense representations are freely available at http://sensembert.org/ares.

pdf bib abs
Generationary or “How We Went beyond Word Sense Inventories and Learned to Gloss”
Michele Bevilacqua | Marco Maru | Roberto Navigli
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Mainstream computational lexical semantics embraces the assumption that word senses can be represented as discrete items of a predefined inventory. In this paper we show this needs not be the case, and propose a unified model that is able to produce contextually appropriate definitions. In our model, Generationary, we employ a novel span-based encoding scheme which we use to fine-tune an English pre-trained Encoder-Decoder system to generate glosses. We show that, even though we drop the need of choosing from a predefined sense inventory, our model can be employed effectively: not only does Generationary outperform previous approaches in the generative task of Definition Modeling in many settings, but it also matches or surpasses the state of the art in discriminative tasks such as Word Sense Disambiguation and Word-in-Context. Finally, we show that Generationary benefits from training on data from multiple inventories, with strong gains on various zero-shot benchmarks, including a novel dataset of definitions for free adjective-noun phrases. The software and reproduction materials are available at http://generationary.org.

pdf bib abs
InVeRo: Making Semantic Role Labeling Accessible with Intelligible Verbs and Roles
Simone Conia | Fabrizio Brignone | Davide Zanfardino | Roberto Navigli
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Semantic Role Labeling (SRL) is deeply dependent on complex linguistic resources and sophisticated neural models, which makes the task difficult to approach for non-experts. To address this issue we present a new platform named Intelligible Verbs and Roles (InVeRo). This platform provides access to a new verb resource, VerbAtlas, and a state-of-the-art pretrained implementation of a neural, span-based architecture for SRL. Both the resource and the system provide human-readable verb sense and semantic role information, with an easy to use Web interface and RESTful APIs available at http://nlp.uniroma1.it/invero.

pdf bib abs
Bridging the Gap in Multilingual Semantic Role Labeling: a Language-Agnostic Approach
Simone Conia | Roberto Navigli
Proceedings of the 28th International Conference on Computational Linguistics

Recent research indicates that taking advantage of complex syntactic features leads to favorable results in Semantic Role Labeling. Nonetheless, an analysis of the latest state-of-the-art multilingual systems reveals the difficulty of bridging the wide gap in performance between high-resource (e.g., English) and low-resource (e.g., German) settings. To overcome this issue, we propose a fully language-agnostic model that does away with morphological and syntactic features to achieve robustness across languages. Our approach outperforms the state of the art in all the languages of the CoNLL-2009 benchmark dataset, especially whenever a scarce amount of training data is available. Our objective is not to reject approaches that rely on syntax, rather to set a strong and consistent language-independent baseline for future innovations in Semantic Role Labeling. We release our model code and checkpoints at https://github.com/SapienzaNLP/multi-srl.

pdf bib abs
Conception: Multilingually-Enhanced, Human-Readable Concept Vector Representations
Simone Conia | Roberto Navigli
Proceedings of the 28th International Conference on Computational Linguistics

To date, the most successful word, word sense, and concept modelling techniques have used large corpora and knowledge resources to produce dense vector representations that capture semantic similarities in a relatively low-dimensional space. Most current approaches, however, suffer from a monolingual bias, with their strength depending on the amount of data available across languages. In this paper we address this issue and propose Conception, a novel technique for building language-independent vector representations of concepts which places multilinguality at its core while retaining explicit relationships between concepts. Our approach results in high-coverage representations that outperform the state of the art in multilingual and cross-lingual Semantic Word Similarity and Word Sense Disambiguation, proving particularly robust on low-resource languages. Conception – its software and the complete set of representations – is available at https://github.com/SapienzaNLP/conception.

pdf bib abs
Invited Talk: Generationary or: “How We Went beyond Sense Inventories and Learned to Gloss”
Roberto Navigli
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons

In this talk I present Generationary, an approach that goes beyond the mainstream assumption that word senses can be represented as discrete items of a predefined inventory, and put forward a unified model which produces contextualized definitions for arbitrary lexical items, from words to phrases and even sentences. Generationary employs a novel span-based encoding scheme to fine-tune an English pre-trained Encoder-Decoder system and generate new definitions. Our model outperforms previous approaches in the generative task of Definition Modeling in many settings, but it also matches or surpasses the state of the art in discriminative tasks such as Word Sense Disambiguation and Word-in-Context. I also show that Generationary benefits from training on definitions from multiple inventories, with strong gains across benchmarks, including a novel dataset of definitions for free adjective-noun phrases, and discuss interesting examples of generated definitions. Joint work with Michele Bevilacqua and Marco Maru.

pdf bib abs
Building Semantic Grams of Human Knowledge
Valentina Leone | Giovanni Siragusa | Luigi Di Caro | Roberto Navigli
Proceedings of the 12th Language Resources and Evaluation Conference

Word senses are typically defined with textual definitions for human consumption and, in computational lexicons, put in context via lexical-semantic relations such as synonymy, antonymy, hypernymy, etc. In this paper we embrace a radically different paradigm that provides a slot-filler structure, called “semagram”, to define the meaning of words in terms of their prototypical semantic information. We propose a semagram-based knowledge model composed of 26 semantic relationships which integrates features from a range of different sources, such as computational lexicons and property norms. We describe an annotation exercise regarding 50 concepts over 10 different categories and put forward different automated approaches for extending the semagram base to thousands of concepts. We finally evaluated the impact of the proposed resource on a semantic similarity task, showing significant improvements over state-of-the-art word embeddings.

pdf bib abs
Sense-Annotated Corpora for Word Sense Disambiguation in Multiple Languages and Domains
Bianca Scarlini | Tommaso Pasini | Roberto Navigli
Proceedings of the 12th Language Resources and Evaluation Conference

The knowledge acquisition bottleneck problem dramatically hampers the creation of sense-annotated data for Word Sense Disambiguation (WSD). Sense-annotated data are scarce for English and almost absent for other languages. This limits the range of action of deep-learning approaches, which today are at the base of any NLP task and are hungry for data. We mitigate this issue and encourage further research in multilingual WSD by releasing to the NLP community five large datasets annotated with word-senses in five different languages, namely, English, French, Italian, German and Spanish, and 5 distinct datasets in English, each for a different semantic domain. We show that supervised WSD models trained on our data attain higher performance than when trained on other automatically-created corpora. We release all our data containing more than 15 million annotated instances in 5 different languages at http://trainomatic.org/onesec.

pdf bib abs
Breaking Through the 80% Glass Ceiling: Raising the State of the Art in Word Sense Disambiguation by Incorporating Knowledge Graph Information
Michele Bevilacqua | Roberto Navigli
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Neural architectures are the current state of the art in Word Sense Disambiguation (WSD). However, they make limited use of the vast amount of relational information encoded in Lexical Knowledge Bases (LKB). We present Enhanced WSD Integrating Synset Embeddings and Relations (EWISER), a neural supervised architecture that is able to tap into this wealth of knowledge by embedding information from the LKB graph within the neural architecture, and to exploit pretrained synset embeddings, enabling the network to predict synsets that are not in the training set. As a result, we set a new state of the art on almost all the evaluation settings considered, also breaking through, for the first time, the 80% ceiling on the concatenation of all the standard all-words English WSD evaluation benchmarks. On multilingual all-words WSD, we report state-of-the-art results by training on nothing but English.

pdf bib abs
Fatality Killed the Cat or: BabelPic, a Multimodal Dataset for Non-Concrete Concepts
Agostina Calabrese | Michele Bevilacqua | Roberto Navigli
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Thanks to the wealth of high-quality annotated images available in popular repositories such as ImageNet, multimodal language-vision research is in full bloom. However, events, feelings and many other kinds of concepts which can be visually grounded are not well represented in current datasets. Nevertheless, we would expect a wide-coverage language understanding system to be able to classify images depicting recess and remorse, not just cats, dogs and bridges. We fill this gap by presenting BabelPic, a hand-labeled dataset built by cleaning the image-synset association found within the BabelNet Lexical Knowledge Base (LKB). BabelPic explicitly targets non-concrete concepts, thus providing refreshing new data for the community. We also show that pre-trained language-vision systems can be used to further expand the resource by exploiting natural language knowledge available in the LKB. BabelPic is available for download at http://babelpic.org.

pdf bib abs
Personalized PageRank with Syntagmatic Information for Multilingual Word Sense Disambiguation
Federico Scozzafava | Marco Maru | Fabrizio Brignone | Giovanni Torrisi | Roberto Navigli
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Exploiting syntagmatic information is an encouraging research focus to be pursued in an effort to close the gap between knowledge-based and supervised Word Sense Disambiguation (WSD) performance. We follow this direction in our next-generation knowledge-based WSD system, SyntagRank, which we make available via a Web interface and a RESTful API. SyntagRank leverages the disambiguated pairs of co-occurring words included in SyntagNet, a lexical-semantic combination resource, to perform state-of-the-art knowledge-based WSD in a multilingual setting. Our service provides both a user-friendly interface, available at http://syntagnet.org/, and a RESTful endpoint to query the system programmatically (accessible at http://api.syntagnet.org/).

2019

pdf bib abs
Just “OneSeC” for Producing Multilingual Sense-Annotated Data
Bianca Scarlini | Tommaso Pasini | Roberto Navigli
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The well-known problem of knowledge acquisition is one of the biggest issues in Word Sense Disambiguation (WSD), where annotated data are still scarce in English and almost absent in other languages. In this paper we formulate the assumption of One Sense per Wikipedia Category and present OneSeC, a language-independent method for the automatic extraction of hundreds of thousands of sentences in which a target word is tagged with its meaning. Our automatically-generated data consistently lead a supervised WSD model to state-of-the-art performance when compared with other automatic and semi-automatic methods. Moreover, our approach outperforms its competitors on multilingual and domain-specific settings, where it beats the existing state of the art on all languages and most domains. All the training data are available for research purposes at http://trainomatic.org/onesec.

pdf bib abs
LSTMEmbed: Learning Word and Sense Representations from a Large Semantically Annotated Corpus with Long Short-Term Memories
Ignacio Iacobacci | Roberto Navigli
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

While word embeddings are now a de facto standard representation of words in most NLP tasks, recently the attention has been shifting towards vector representations which capture the different meanings, i.e., senses, of words. In this paper we explore the capabilities of a bidirectional LSTM model to learn representations of word senses from semantically annotated corpora. We show that the utilization of an architecture that is aware of word order, like an LSTM, enables us to create better representations. We assess our proposed model on various standard benchmarks for evaluating semantic representations, reaching state-of-the-art performance on the SemEval-2014 word-to-sense similarity task. We release the code and the resulting word and sense embeddings at http://lcl.uniroma1.it/LSTMEmbed.

pdf bib abs
Quasi Bidirectional Encoder Representations from Transformers for Word Sense Disambiguation
Michele Bevilacqua | Roberto Navigli
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

While contextualized embeddings have produced performance breakthroughs in many Natural Language Processing (NLP) tasks, Word Sense Disambiguation (WSD) has not benefited from them yet. In this paper, we introduce QBERT, a Transformer-based architecture for contextualized embeddings which makes use of a co-attentive layer to produce more deeply bidirectional representations, better-fitting for the WSD task. As a result, we are able to train a WSD system that beats the state of the art on the concatenation of all evaluation datasets by over 3 points, also outperforming a comparable model using ELMo.

pdf bib abs
Game Theory Meets Embeddings: a Unified Framework for Word Sense Disambiguation
Rocco Tripodi | Roberto Navigli
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Game-theoretic models, thanks to their intrinsic ability to exploit contextual information, have shown to be particularly suited for the Word Sense Disambiguation task. They represent ambiguous words as the players of a non cooperative game and their senses as the strategies that the players can select in order to play the games. The interaction among the players is modeled with a weighted graph and the payoff as an embedding similarity function, that the players try to maximize. The impact of the word and sense embedding representations in the framework has been tested and analyzed extensively: experiments on standard benchmarks show state-of-art performances and different tests hint at the usefulness of using disambiguation to obtain contextualized word representations.

pdf bib abs
VerbAtlas: a Novel Large-Scale Verbal Semantic Resource and Its Application to Semantic Role Labeling
Andrea Di Fabio | Simone Conia | Roberto Navigli
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We present VerbAtlas, a new, hand-crafted lexical-semantic resource whose goal is to bring together all verbal synsets from WordNet into semantically-coherent frames. The frames define a common, prototypical argument structure while at the same time providing new concept-specific information. In contrast to PropBank, which defines enumerative semantic roles, VerbAtlas comes with an explicit, cross-frame set of semantic roles linked to selectional preferences expressed in terms of WordNet synsets, and is the first resource enriched with semantic information about implicit, shadow, and default arguments. We demonstrate the effectiveness of VerbAtlas in the task of dependency-based Semantic Role Labeling and show how its integration into a high-performance system leads to improvements on both the in-domain and out-of-domain test sets of CoNLL-2009. VerbAtlas is available at http://verbatlas.org.

pdf bib abs
SyntagNet: Challenging Supervised Word Sense Disambiguation with Lexical-Semantic Combinations
Marco Maru | Federico Scozzafava | Federico Martelli | Roberto Navigli
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Current research in knowledge-based Word Sense Disambiguation (WSD) indicates that performances depend heavily on the Lexical Knowledge Base (LKB) employed. This paper introduces SyntagNet, a novel resource consisting of manually disambiguated lexical-semantic combinations. By capturing sense distinctions evoked by syntagmatic relations, SyntagNet enables knowledge-based WSD systems to establish a new state of the art which challenges the hitherto unrivaled performances attained by supervised approaches. To the best of our knowledge, SyntagNet is the first large-scale manually-curated resource of this kind made available to the community (at http://syntagnet.org).

2018

This paper describes the SemEval 2018 Shared Task on Hypernym Discovery. We put forward this task as a complementary benchmark for modeling hypernymy, a problem which has traditionally been cast as a binary classification task, taking a pair of candidate words as input. Instead, our reformulated task is defined as follows: given an input term, retrieve (or discover) its suitable hypernyms from a target corpus. We proposed five different subtasks covering three languages (English, Spanish, and Italian), and two specific domains of knowledge in English (Medical and Music). Participants were allowed to compete in any or all of the subtasks. Overall, a total of 11 teams participated, with a total of 39 different systems submitted through all subtasks. Data, results and further information about the task can be found at https://competitions.codalab.org/competitions/17119.

pdf bib
Huge Automatically Extracted Training-Sets for Multilingual Word SenseDisambiguation
Tommaso Pasini | Francesco Elia | Roberto Navigli
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib abs
Train-O-Matic: Large-Scale Supervised Word Sense Disambiguation in Multiple Languages without Manual Training Data
Tommaso Pasini | Roberto Navigli
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Annotating large numbers of sentences with senses is the heaviest requirement of current Word Sense Disambiguation. We present Train-O-Matic, a language-independent method for generating millions of sense-annotated training instances for virtually all meanings of words in a language’s vocabulary. The approach is fully automatic: no human intervention is required and the only type of human knowledge used is a WordNet-like resource. Train-O-Matic achieves consistently state-of-the-art performance across gold standard datasets and languages, while at the same time removing the burden of manual annotation. All the training data is available for research purposes at http://trainomatic.org.

pdf bib abs
Neural Sequence Learning Models for Word Sense Disambiguation
Alessandro Raganato | Claudio Delli Bovi | Roberto Navigli
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Word Sense Disambiguation models exist in many flavors. Even though supervised ones tend to perform best in terms of accuracy, they often lose ground to more flexible knowledge-based solutions, which do not require training by a word expert for every disambiguation target. To bridge this gap we adopt a different perspective and rely on sequence learning to frame the disambiguation problem: we propose and study in depth a series of end-to-end neural architectures directly tailored to the task, from bidirectional Long Short-Term Memory to encoder-decoder models. Our extensive evaluation over standard benchmarks and in multiple languages shows that sequence learning enables more versatile all-words models that consistently lead to state-of-the-art results, even against word experts with engineered features.

pdf bib abs
Towards a Seamless Integration of Word Senses into Downstream NLP Applications
Mohammad Taher Pilehvar | Jose Camacho-Collados | Roberto Navigli | Nigel Collier
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Lexical ambiguity can impede NLP systems from accurate understanding of semantics. Despite its potential benefits, the integration of sense-level information into NLP systems has remained understudied. By incorporating a novel disambiguation algorithm into a state-of-the-art classification model, we create a pipeline to integrate sense-level information into downstream NLP applications. We show that a simple disambiguation of the input text can lead to consistent performance improvement on multiple topic categorization and polarity detection datasets, particularly when the fine granularity of the underlying sense inventory is reduced and the document is sufficiently large. Our results also point to the need for sense representation research to focus more on in vivo evaluations which target the performance in downstream NLP applications rather than artificial benchmarks.

pdf bib abs
EuroSense: Automatic Harvesting of Multilingual Sense Annotations from Parallel Text
Claudio Delli Bovi | Jose Camacho-Collados | Alessandro Raganato | Roberto Navigli
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Parallel corpora are widely used in a variety of Natural Language Processing tasks, from Machine Translation to cross-lingual Word Sense Disambiguation, where parallel sentences can be exploited to automatically generate high-quality sense annotations on a large scale. In this paper we present EuroSense, a multilingual sense-annotated resource based on the joint disambiguation of the Europarl parallel corpus, with almost 123 million sense annotations for over 155 thousand distinct concepts and entities from a language-independent unified sense inventory. We evaluate the quality of our sense annotations intrinsically and extrinsically, showing their effectiveness as training data for Word Sense Disambiguation.

pdf bib abs
SemEval-2017 Task 2: Multilingual and Cross-lingual Semantic Word Similarity
Jose Camacho-Collados | Mohammad Taher Pilehvar | Nigel Collier | Roberto Navigli
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper introduces a new task on Multilingual and Cross-lingual SemanticThis paper introduces a new task on Multilingual and Cross-lingual Semantic Word Similarity which measures the semantic similarity of word pairs within and across five languages: English, Farsi, German, Italian and Spanish. High quality datasets were manually curated for the five languages with high inter-annotator agreements (consistently in the 0.9 ballpark). These were used for semi-automatic construction of ten cross-lingual datasets. 17 teams participated in the task, submitting 24 systems in subtask 1 and 14 systems in subtask 2. Results show that systems that combine statistical knowledge from text corpora, in the form of word embeddings, and external knowledge from lexical resources are best performers in both subtasks. More information can be found on the task website: http://alt.qcri.org/semeval2017/task2/

pdf bib abs
Embedding Words and Senses Together via Joint Knowledge-Enhanced Training
Massimiliano Mancini | Jose Camacho-Collados | Ignacio Iacobacci | Roberto Navigli
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)

Word embeddings are widely used in Natural Language Processing, mainly due to their success in capturing semantic information from massive corpora. However, their creation process does not allow the different meanings of a word to be automatically separated, as it conflates them into a single vector. We address this issue by proposing a new model which learns word and sense embeddings jointly. Our model exploits large corpora and knowledge from semantic networks in order to produce a unified vector space of word and sense embeddings. We evaluate the main features of our approach both qualitatively and quantitatively in a variety of tasks, highlighting the advantages of the proposed method in comparison to state-of-the-art word- and sense-based models.

pdf bib abs
Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison
Alessandro Raganato | Jose Camacho-Collados | Roberto Navigli
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Word Sense Disambiguation is a long-standing task in Natural Language Processing, lying at the core of human language understanding. However, the evaluation of automatic systems has been problematic, mainly due to the lack of a reliable evaluation framework. In this paper we develop a unified evaluation framework and analyze the performance of various Word Sense Disambiguation systems in a fair setup. The results show that supervised systems clearly outperform knowledge-based models. Among the supervised systems, a linear classifier trained on conventional local features still proves to be a hard baseline to beat. Nonetheless, recent approaches exploiting neural networks on unlabeled corpora achieve promising results, surpassing this hard baseline in most test sets.

pdf bib abs
BabelDomains: Large-Scale Domain Labeling of Lexical Resources
Jose Camacho-Collados | Roberto Navigli
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

In this paper we present BabelDomains, a unified resource which provides lexical items with information about domains of knowledge. We propose an automatic method that uses knowledge from various lexical resources, exploiting both distributional and graph-based clues, to accurately propagate domain information. We evaluate our methodology intrinsically on two lexical resources (WordNet and BabelNet), achieving a precision over 80% in both cases. Finally, we show the potential of BabelDomains in a supervised learning setting, clustering training data by domain for hypernym discovery.

2016

pdf bib
Find the word that does not belong: A Framework for an Intrinsic Evaluation of Word Vector Representations
José Camacho-Collados | Roberto Navigli
Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP

pdf bib
Embeddings for Word Sense Disambiguation: An Evaluation Study
Ignacio Iacobacci | Mohammad Taher Pilehvar | Roberto Navigli
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib abs
A Large-Scale Multilingual Disambiguation of Glosses
José Camacho-Collados | Claudio Delli Bovi | Alessandro Raganato | Roberto Navigli
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Linking concepts and named entities to knowledge bases has become a crucial Natural Language Understanding task. In this respect, recent works have shown the key advantage of exploiting textual definitions in various Natural Language Processing applications. However, to date there are no reliable large-scale corpora of sense-annotated textual definitions available to the research community. In this paper we present a large-scale high-quality corpus of disambiguated glosses in multiple languages, comprising sense annotations of both concepts and named entities from a unified sense inventory. Our approach for the construction and disambiguation of the corpus builds upon the structure of a large multilingual semantic network and a state-of-the-art disambiguation system; first, we gather complementary information of equivalent definitions across different languages to provide context for disambiguation, and then we combine it with a semantic similarity-based refinement. As a result we obtain a multilingual corpus of textual definitions featuring over 38 million definitions in 263 languages, and we make it freely available at http://lcl.uniroma1.it/disambiguated-glosses. Experiments on Open Information Extraction and Sense Clustering show how two state-of-the-art approaches improve their performance by integrating our disambiguated corpus into their pipeline.

2015

pdf bib
Proceedings of the 1st Workshop on Semantics-Driven Statistical Machine Translation (S2MT 2015)
Deyi Xiong | Kevin Duh | Christian Hardmeier | Roberto Navigli
Proceedings of the 1st Workshop on Semantics-Driven Statistical Machine Translation (S2MT 2015)

pdf bib abs
Multilinguality at Your Fingertips : BabelNet, Babelfy and Beyond !
Roberto Navigli
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Conférences invitées

Multilinguality is a key feature of today’s Web, and it is this feature that we leverage and exploit in our research work at the Sapienza University of Rome’s Linguistic Computing Laboratory, which I am going to overview and showcase in this talk. I will start by presenting BabelNet 3.0, available at http://babelnet.org, a very large multilingual encyclopedic dictionary and semantic network, which covers 271 languages and provides both lexicographic and encyclopedic knowledge for all the open-class parts of speech, thanks to the seamless integration of WordNet, Wikipedia, Wiktionary, OmegaWiki, Wikidata and the Open Multilingual WordNet. Next, I will present Babelfy, available at http://babelfy.org, a unified approach that leverages BabelNet to jointly perform word sense disambiguation and entity linking in arbitrary languages, with performance on both tasks on a par with, or surpassing, those of task-specific state-of-the-art supervised systems. Finally I will describe the Wikipedia Bitaxonomy, available at http://wibitaxonomy.org, a new approach to the construction of a Wikipedia bitaxonomy, that is, the largest and most accurate currently available taxonomy of Wikipedia pages and taxonomy of categories, aligned to each other. I will also give an outline of future work on multilingual resources and processing, including state-of-the-art semantic similarity with sense embeddings.

pdf bib abs
Large-Scale Information Extraction from Textual Definitions through Deep Syntactic and Semantic Analysis
Claudio Delli Bovi | Luca Telesca | Roberto Navigli
Transactions of the Association for Computational Linguistics, Volume 3

We present DefIE, an approach to large-scale Information Extraction (IE) based on a syntactic-semantic analysis of textual definitions. Given a large corpus of definitions we leverage syntactic dependencies to reduce data sparsity, then disambiguate the arguments and content words of the relation strings, and finally exploit the resulting information to organize the acquired relations hierarchically. The output of DefIE is a high-quality knowledge base consisting of several million automatically acquired semantic relations.

pdf bib
SensEmbed: Learning Sense Embeddings for Word and Relational Similarity
Ignacio Iacobacci | Mohammad Taher Pilehvar | Roberto Navigli
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
A Unified Multilingual Semantic Representation of Concepts
José Camacho-Collados | Mohammad Taher Pilehvar | Roberto Navigli
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
A Framework for the Construction of Monolingual and Cross-lingual Word Similarity Datasets
José Camacho-Collados | Mohammad Taher Pilehvar | Roberto Navigli
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
Reading Between the Lines: Overcoming Data Sparsity for Accurate Classification of Lexical Relationships
Silvia Necşulescu | Sara Mendes | David Jurgens | Núria Bel | Roberto Navigli
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

pdf bib
SemEval-2015 Task 13: Multilingual All-Words Sense Disambiguation and Entity Linking
Andrea Moro | Roberto Navigli
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
SemEval-2015 Task 17: Taxonomy Extraction Evaluation (TExEval)
Georgeta Bordea | Paul Buitelaar | Stefano Faralli | Roberto Navigli
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
Knowledge Base Unification via Sense Embeddings and Disambiguation
Claudio Delli Bovi | Luis Espinosa-Anke | Roberto Navigli
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
NASARI: a Novel Approach to a Semantically-Aware Representation of Items
José Camacho-Collados | Mohammad Taher Pilehvar | Roberto Navigli
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
An Open-source Framework for Multi-level Semantic Similarity Measurement
Mohammad Taher Pilehvar | Roberto Navigli
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

2014

pdf bib
A Large-Scale Pseudoword-Based Evaluation Framework for State-of-the-Art Word Sense Disambiguation
Mohammad Taher Pilehvar | Roberto Navigli
Computational Linguistics, Volume 40, Issue 4 - December 2014

pdf bib abs
Annotating the MASC Corpus with BabelNet
Andrea Moro | Roberto Navigli | Francesco Maria Tucci | Rebecca J. Passonneau
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we tackle the problem of automatically annotating, with both word senses and named entities, the MASC 3.0 corpus, a large English corpus covering a wide range of genres of written and spoken text. We use BabelNet 2.0, a multilingual semantic network which integrates both lexicographic and encyclopedic knowledge, as our sense/entity inventory together with its semantic structure, to perform the aforementioned annotation task. Word sense annotated corpora have been around for more than twenty years, helping the development of Word Sense Disambiguation algorithms by providing both training and testing grounds. More recently Entity Linking has followed the same path, with the creation of huge resources containing annotated named entities. However, to date, there has been no resource that contains both kinds of annotation. In this paper we present an automatic approach for performing this annotation, together with its output on the MASC corpus. We use this corpus because its goal of integrating different types of annotations goes exactly in our same direction. Our overall aim is to stimulate research on the joint exploitation and disambiguation of word senses and named entities. Finally, we estimate the quality of our annotations using both manually-tagged named entities and word senses, obtaining an accuracy of roughly 70% for both named entities and word sense annotations.

pdf bib abs
Representing Multilingual Data as Linked Data: the Case of BabelNet 2.0
Maud Ehrmann | Francesco Cecconi | Daniele Vannella | John Philip McCrae | Philipp Cimiano | Roberto Navigli
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Recent years have witnessed a surge in the amount of semantic information published on the Web. Indeed, the Web of Data, a subset of the Semantic Web, has been increasing steadily in both volume and variety, transforming the Web into a ‘global database’ in which resources are linked across sites. Linguistic fields -- in a broad sense -- have not been left behind, and we observe a similar trend with the growth of linguistic data collections on the so-called ‘Linguistic Linked Open Data (LLOD) cloud’. While both Semantic Web and Natural Language Processing communities can obviously take advantage of this growing and distributed linguistic knowledge base, they are today faced with a new challenge, i.e., that of facilitating multilingual access to the Web of data. In this paper we present the publication of BabelNet 2.0, a wide-coverage multilingual encyclopedic dictionary and ontology, as Linked Data. The conversion made use of lemon, a lexicon model for ontologies particularly well-suited for this enterprise. The result is an interlinked multilingual (lexical) resource which can not only be accessed on the LOD, but also be used to enrich existing datasets with linguistic information, or to support the process of mapping datasets across languages.

pdf bib
(Digital) Goodies from the ERC Wishing Well: BabelNet, Babelfy, Video Games with a Purpose and the Wikipedia Bitaxonomy
Roberto Navigli
Proceedings of the 4th Workshop on Cognitive Aspects of the Lexicon (CogALex)

pdf bib
A Robust Approach to Aligning Heterogeneous Lexical Resources
Mohammad Taher Pilehvar | Roberto Navigli
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project
Tiziano Flati | Daniele Vannella | Tommaso Pasini | Roberto Navigli
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Validating and Extending Semantic Knowledge Bases using Video Games with a Purpose
Daniele Vannella | David Jurgens | Daniele Scarfini | Domenico Toscani | Roberto Navigli
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
WoSIT: A Word Sense Induction Toolkit for Search Result Clustering and Diversification
Daniele Vannella | Tiziano Flati | Roberto Navigli
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib
Proceedings of the Third Joint Conference on Lexical and Computational Semantics (*SEM 2014)
Johan Bos | Anette Frank | Roberto Navigli
Proceedings of the Third Joint Conference on Lexical and Computational Semantics (*SEM 2014)

pdf bib
SemEval-2014 Task 3: Cross-Level Semantic Similarity
David Jurgens | Mohammad Taher Pilehvar | Roberto Navigli
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf bib
Multilingual Word Sense Disambiguation and Entity Linking
Roberto Navigli | Andrea Moro
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Tutorial Abstracts

pdf bib abs
Entity Linking meets Word Sense Disambiguation: a Unified Approach
Andrea Moro | Alessandro Raganato | Roberto Navigli
Transactions of the Association for Computational Linguistics, Volume 2

Entity Linking (EL) and Word Sense Disambiguation (WSD) both address the lexical ambiguity of language. But while the two tasks are pretty similar, they differ in a fundamental respect: in EL the textual mention can be linked to a named entity which may or may not contain the exact mention, while in WSD there is a perfect match between the word form (better, its lemma) and a suitable word sense. In this paper we present Babelfy, a unified graph-based approach to EL and WSD based on a loose identification of candidate meanings coupled with a densest subgraph heuristic which selects high-coherence semantic interpretations. Our experiments show state-of-the-art performances on both tasks on 6 different datasets, including a multilingual setting. Babelfy is online at http://babelfy.org

pdf bib abs
It’s All Fun and Games until Someone Annotates: Video Games with a Purpose for Linguistic Annotation
David Jurgens | Roberto Navigli
Transactions of the Association for Computational Linguistics, Volume 2

Annotated data is prerequisite for many NLP applications. Acquiring large-scale annotated corpora is a major bottleneck, requiring significant time and resources. Recent work has proposed turning annotation into a game to increase its appeal and lower its cost; however, current games are largely text-based and closely resemble traditional annotation tasks. We propose a new linguistic annotation paradigm that produces annotations from playing graphical video games. The effectiveness of this design is demonstrated using two video games: one to create a mapping from WordNet senses to images, and a second game that performs Word Sense Disambiguation. Both games produce accurate results. The first game yields annotation quality equal to that of experts and a cost reduction of 73% over equivalent crowdsourcing; the second game provides a 16.3% improvement in accuracy over current state-of-the-art sense disambiguation games with WordNet.

pdf bib
A Knowledge-based Representation for Cross-Language Document Retrieval and Categorization
Marc Franco-Salvador | Paolo Rosso | Roberto Navigli
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

2013

pdf bib
Growing Multi-Domain Glossaries from a Few Seeds using Probabilistic Topic Models
Stefano Faralli | Roberto Navigli
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
SemEval-2013 Task 11: Word Sense Induction and Disambiguation within an End-User Application
Roberto Navigli | Daniele Vannella
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

pdf bib
SemEval-2013 Task 12: Multilingual Word Sense Disambiguation
Roberto Navigli | David Jurgens | Daniele Vannella
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

pdf bib
Paving the Way to a Large-scale Pseudosense-annotated Dataset
Mohammad Taher Pilehvar | Roberto Navigli
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
OntoLearn Reloaded: A Graph-Based Algorithm for Taxonomy Induction
Paola Velardi | Stefano Faralli | Roberto Navigli
Computational Linguistics, Volume 39, Issue 3 - September 2013

pdf bib
Clustering and Diversifying Web Search Results with Graph-Based Word Sense Induction
Antonio Di Marco | Roberto Navigli
Computational Linguistics, Volume 39, Issue 3 - September 2013

pdf bib
GlossBoot: Bootstrapping Multilingual Domain Glossaries from the Web
Flavio De Benedictis | Stefano Faralli | Roberto Navigli
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
SPred: Large-scale Harvesting of Semantic Predicates
Tiziano Flati | Roberto Navigli
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity
Mohammad Taher Pilehvar | David Jurgens | Roberto Navigli
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
A Java Framework for Multilingual Definition and Hypernym Extraction
Stefano Faralli | Roberto Navigli
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

2012

pdf bib
Multilingual WSD with Just a Few Lines of Code: the BabelNet API
Roberto Navigli | Simone Paolo Ponzetto
Proceedings of the ACL 2012 System Demonstrations

pdf bib
Joining Forces Pays Off: Multilingual Joint Word Sense Disambiguation
Roberto Navigli | Simone Paolo Ponzetto
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
A New Minimally-Supervised Framework for Domain Word Sense Disambiguation
Stefano Faralli | Roberto Navigli
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib abs
A New Method for Evaluating Automatically Learned Terminological Taxonomies
Paola Velardi | Roberto Navigli | Stefano Faralli | Juana Maria Ruiz Martinez
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Abstract Evaluating a taxonomy learned automatically against an existing gold standard is a very complex problem, because differences stem from the number, label, depth and ordering of the taxonomy nodes. In this paper we propose casting the problem as one of comparing two hierarchical clusters. To this end we defined a variation of the Fowlkes and Mallows measure (Fowlkes and Mallows, 1983). Our method assigns a similarity value B^i_(l,r) to the learned (l) and reference (r) taxonomy for each cut i of the corresponding anonymised hierarchies, starting from the topmost nodes down to the leaf concepts. For each cut i, the two hierarchies can be seen as two clusterings C^i_l , C^i_r of the leaf concepts. We assign a prize to early similarity values, i.e. when concepts are clustered in a similar way down to the lowest taxonomy levels (close to the leaf nodes). We apply our method to the evaluation of the taxonomy learning methods put forward by Navigli et al. (2011) and Kozareva and Hovy (2010).

2010

pdf bib abs
An Annotated Dataset for Extracting Definitions and Hypernyms from the Web
Roberto Navigli | Paola Velardi | Juana Maria Ruiz-Martínez
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper presents and analyzes an annotated corpus of definitions, created to train an algorithm for the automatic extraction of definitions and hypernyms from web documents. As an additional resource, we also include a corpus of non-definitions with syntactic patterns similar to those of definition sentences, e.g.: ""An android is a robot"" vs. ""Snowcap is unmistakable"". Domain and style independence is obtained thanks to the annotation of a large and domain-balanced corpus and to a novel pattern generalization algorithm based on word-class lattices (WCL). A lattice is a directed acyclic graph (DAG), a subclass of nondeterministic finite state automata (NFA). The lattice structure has the purpose of preserving the salient differences among distinct sequences, while eliminating redundant information. The WCL algorithm will be integrated into an improved version of the GlossExtractor Web application (Velardi et al., 2008). This paper is mostly concerned with a description of the corpus, the annotation strategy, and a linguistic analysis of the data. A summary of the WCL algorithm is also provided for the sake of completeness.

pdf bib
Inducing Word Senses to Improve Web Search Result Clustering
Roberto Navigli | Giuseppe Crisafulli
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf bib
BabelNet: Building a Very Large Multilingual Semantic Network
Roberto Navigli | Simone Paolo Ponzetto
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Learning Word-Class Lattices for Definition and Hypernym Extraction
Roberto Navigli | Paola Velardi
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Knowledge-Rich Word Sense Disambiguation Rivaling Supervised Systems
Simone Paolo Ponzetto | Roberto Navigli
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

WordNet is the reference sense inventory of most of the current Word Sense Disambiguation systems. Unfortunately, it encodes too fine-grained distinctions, making it difficult even for humans to solve the ambiguity of words in context. In this paper, we present a method for reducing the granularity of the WordNet sense inventory based on the mapping to a manually crafted dictionary encoding sense groups, namely the Oxford Dictionary of English. We assess the quality of the mapping and discuss the potential of the method.

pdf bib
Enriching a Formal Ontology with a Thesaurus: an Application in the Cultural Heritage Domain
Roberto Navigli | Paola Velardi
Proceedings of the 2nd Workshop on Ontology Learning and Population: Bridging the Gap between Text and Knowledge

pdf bib
Squibs: Consistent Validation of Manual and Automatic Sense Annotations with the Aid of Semantic Graphs
Roberto Navigli
Computational Linguistics, Volume 32, Number 2, June 2006

pdf bib
Experiments on the Validation of Sense Annotations Assisted by Lexical Chains
Roberto Navigli
11th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Online Word Sense Disambiguation with Structural Semantic Interconnections
Roberto Navigli
Demonstrations

pdf bib
Ensemble Methods for Unsupervised WSD
Samuel Brody | Roberto Navigli | Mirella Lapata
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf bib
Meaningful Clustering of Senses Helps Boost Word Sense Disambiguation Performance
Roberto Navigli
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf bib
Valido: A Visual Tool for Validating Sense Annotations
Roberto Navigli
Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions