Nicola Cancedda


2023

pdf bib
ERATE: Efficient Retrieval Augmented Text Embeddings
Vatsal Raina | Nora Kassner | Kashyap Popat | Patrick Lewis | Nicola Cancedda | Louis Martin
The Fourth Workshop on Insights from Negative Results in NLP

Embedding representations of text are useful for downstream natural language processing tasks. Several universal sentence representation methods have been proposed with a particular focus on self-supervised pre-training approaches to leverage the vast quantities of unlabelled data. However, there are two challenges for generating rich embedding representations for a new document. 1) The latest rich embedding generators are based on very large costly transformer-based architectures. 2) The rich embedding representation of a new document is limited to only the information provided without access to any explicit contextual and temporal information that could potentially further enrich the representation. We propose efficient retrieval-augmented text embeddings (ERATE) that tackles the first issue and offers a method to tackle the second issue. To the best of our knowledge, we are the first to incorporate retrieval to general purpose embeddings as a new paradigm, which we apply to the semantic similarity tasks of SentEval. Despite not reaching state-of-the-art performance, ERATE offers key insights that encourages future work into investigating the potential of retrieval-based embeddings.

2022

pdf
Multilingual Autoregressive Entity Linking
Nicola De Cao | Ledell Wu | Kashyap Popat | Mikel Artetxe | Naman Goyal | Mikhail Plekhanov | Luke Zettlemoyer | Nicola Cancedda | Sebastian Riedel | Fabio Petroni
Transactions of the Association for Computational Linguistics, Volume 10

We present mGENRE, a sequence-to- sequence system for the Multilingual Entity Linking (MEL) problem—the task of resolving language-specific mentions to a multilingual Knowledge Base (KB). For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token in an autoregressive fashion. The autoregressive formulation allows us to effectively cross-encode mention string and entity names to capture more interactions than the standard dot product between mention and entity vectors. It also enables fast search within a large KB even for mentions that do not appear in mention tables and with no need for large-scale vector indices. While prior MEL works use a single representation for each entity, we match against entity names of as many languages as possible, which allows exploiting language connections between source input and target name. Moreover, in a zero-shot setting on languages with no training data at all, mGENRE treats the target language as a latent variable that is marginalized at prediction time. This leads to over 50% improvements in average accuracy. We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks where we establish new state-of-the-art results. Source code available at https://github.com/facebookresearch/GENRE.

pdf
EDIN: An End-to-end Benchmark and Pipeline for Unknown Entity Discovery and Indexing
Nora Kassner | Fabio Petroni | Mikhail Plekhanov | Sebastian Riedel | Nicola Cancedda
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Existing work on Entity Linking mostly assumes that the reference knowledge base is complete, and therefore all mentions can be linked. In practice this is hardly ever the case, as knowledge bases are incomplete and because novel concepts arise constantly. We introduce the temporally segmented Unknown Entity Discovery and Indexing (EDIN)-benchmark where unknown entities, that is entities not part of the knowledge base and without descriptions and labeled mentions, have to be integrated into an existing entity linking system. By contrasting EDIN with zero-shot entity linking, we provide insight on the additional challenges it poses. Building on dense-retrieval based entity linking, we introduce the end-to-end EDIN-pipeline that detects, clusters, and indexes mentions of unknown entities in context. Experiments show that indexing a single embedding per entity unifying the information of multiple mentions works better than indexing mentions independently.

2014

pdf
Fast Domain Adaptation of SMT models without in-Domain Parallel Data
Prashant Mathur | Sriram Venkatapathy | Nicola Cancedda
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf
Assessing quick update methods of statistical translation models
Shachar Mirkin | Nicola Cancedda
Proceedings of the 10th International Workshop on Spoken Language Translation: Papers

The ability to quickly incorporate incoming training data into a running translation system is critical in a number of applications. Mechanisms based on incremental model update and the online EM algorithm hold the promise of achieving this objective in a principled way. Still, efficient tools for incremental training are yet to be available. In this paper we experiment with simple alternative solutions for interim model updates, within the popular Moses system. Short of updating the model in real time, such updates can execute in short timeframes even when operating on large models, and achieve a performance level close to, and in some cases exceeding, that of batch retraining.

pdf
Generation of Compound Words in Statistical Machine Translation into Compounding Languages
Sara Stymne | Nicola Cancedda | Lars Ahrenberg
Computational Linguistics, Volume 39, Issue 4 - December 2013

2012

pdf
Task-Driven Linguistic Analysis based on an Underspecified Features Representation
Stasinos Konstantopoulos | Valia Kordoni | Nicola Cancedda | Vangelis Karkaletsis | Dietrich Klakow | Jean-Michel Renders
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper we explore a task-driven approach to interfacing NLP components, where language processing is guided by the end-task that each application requires. The core idea is to generalize feature values into feature value distributions, representing under-specified feature values, and to fit linguistic pipelines with a back-channel of specification requests through which subsequent components can declare to preceding ones the importance of narrowing the value distribution of particular features that are critical for the current task.

pdf
Prediction of Learning Curves in Machine Translation
Prasanth Kolachina | Nicola Cancedda | Marc Dymetman | Sriram Venkatapathy
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Private Access to Phrase Tables for Statistical Machine Translation
Nicola Cancedda
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2011

pdf
Productive Generation of Compound Words in Statistical Machine Translation
Sara Stymne | Nicola Cancedda
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf
Confidence-Weighted Learning of Factored Discriminative Language Models
Viet Ha-Thuc | Nicola Cancedda
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf
Minimum Error Rate Training by Sampling the Translation Lattice
Samidh Chatterjee | Nicola Cancedda
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf bib
Intersecting Hierarchical and Phrase-Based Models of Translation: Formal Aspects and Algorithms
Marc Dymetman | Nicola Cancedda
Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation

pdf
A Dataset for Assessing Machine Translation Evaluation Metrics
Lucia Specia | Nicola Cancedda | Marc Dymetman
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We describe a dataset containing 16,000 translations produced by four machine translation systems and manually annotated for quality by professional translators. This dataset can be used in a range of tasks assessing machine translation evaluation metrics, from basic correlation analysis to training and test of machine learning-based metrics. By providing a standard dataset for such tasks, we hope to encourage the development of better MT evaluation metrics.

pdf
Machine Translation Using Overlapping Alignments and SampleRank
Benjamin Roth | Andrew McCallum | Marc Dymetman | Nicola Cancedda
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

We present a conditional-random-field approach to discriminatively-trained phrase-based machine translation in which training and decoding are both cast in a sampling framework and are implemented uniformly in a new probabilistic programming language for factor graphs. In traditional phrase-based translation, decoding infers both a "Viterbi" alignment and the target sentence. In contrast, in our approach, a rich overlapping-phrase alignment is produced by a fast deterministic method, while probabilistic decoding infers only the target sentence, which is then able to leverage arbitrary features of the entire source sentence, target sentence and alignment. By using SampleRank for learning we could in principle efficiently estimate hundreds of thousands of parameters. Test-time decoding is done by MCMC sampling with annealing. To demonstrate the potential of our approach we show preliminary experiments leveraging alignments that may contain overlapping bi-phrases.

2009

pdf
Complexity-Based Phrase-Table Filtering for Statistical Machine Translation
Nadi Tomeh | Nicola Cancedda | Marc Dymetman
Proceedings of Machine Translation Summit XII: Papers

pdf
Estimating the Sentence-Level Quality of Machine Translation Systems
Lucia Specia | Marco Turchi | Nicola Cancedda | Nello Cristianini | Marc Dymetman
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

pdf bib
Introduction
Nicola Cancedda
Proceedings of the 13th Annual conference of the European Association for Machine Translation

pdf
Sentence-level confidence estimation for MT
Lucia Specia | Nicola Cancedda | Marc Dymetman | Craig Saunders | Marco Turchi | Nello Cristianini | Zhuoran Wang | John Shawe-Taylor
Proceedings of the 13th Annual conference of the European Association for Machine Translation

pdf
Closing remarks
Nicola Cancedda
Proceedings of the 13th Annual conference of the European Association for Machine Translation

pdf
Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem
Mikhail Zaslavskiy | Marc Dymetman | Nicola Cancedda
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf
Source-Language Entailment Modeling for Translating Unknown Terms
Shachar Mirkin | Lucia Specia | Nicola Cancedda | Ido Dagan | Marc Dymetman | Idan Szpektor
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2008

pdf bib
Shaping research from user requirements, and other exotic things...
Nicola Cancedda
Proceedings of the 12th Annual Conference of the European Association for Machine Translation

2005

pdf
Translating with Non-contiguous Phrases
Michel Simard | Nicola Cancedda | Bruno Cavestro | Marc Dymetman | Eric Gaussier | Cyril Goutte | Kenji Yamada | Philippe Langlais | Arne Mauser
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

pdf
Une approche à la traduction automatique statistique par segments discontinus
Michel Simard | Nicola Cancedda | Bruno Cavestro | Marc Dymetman | Eric Gaussier | Cyril Goutte | Philippe Langlais | Arne Mauser | Kenji Yamada
Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article présente une méthode de traduction automatique statistique basée sur des segments non-continus, c’est-à-dire des segments formés de mots qui ne se présentent pas nécéssairement de façon contiguë dans le texte. On propose une méthode pour produire de tels segments à partir de corpus alignés au niveau des mots. On présente également un modèle de traduction statistique capable de tenir compte de tels segments, de même qu’une méthode d’apprentissage des paramètres du modèle visant à maximiser l’exactitude des traductions produites, telle que mesurée avec la métrique NIST. Les traductions optimales sont produites par le biais d’une recherche en faisceau. On présente finalement des résultats expérimentaux, qui démontrent comment la méthode proposée permet une meilleure généralisation à partir des données d’entraînement.

2002

pdf
Combining Labelled and Unlabelled Data: A Case Study on Fisher Kernels and Transductive Inference for Biological Entity Recognition
Cyril Goutte | Hervé Déjean | Eric Gaussier | Nicola Cancedda | Jean-Michel Renders
COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002)

2001

pdf
Probabilistic models for PP-attachment resolution and NP analysis
Eric Gaussier | Nicola Cancedda
Proceedings of the ACL 2001 Workshop on Computational Natural Language Learning (ConLL)

pdf
Learning Computational Grammars
John Nerbonne | Anja Belz | Nicola Cancedda | Hervé Déjean | James Hammerton | Rob Koeling | Stasinos Konstantopoulos | Miles Osborne | Franck Thollard | Erik F. Tjong Kim Sang
Proceedings of the ACL 2001 Workshop on Computational Natural Language Learning (ConLL)

2000

pdf
Experiments with Corpus-based LFG Specialization
Nicola Cancedda | Christer Samuelsson
Sixth Applied Natural Language Processing Conference

pdf bib
Corpus-Based Grammar Specialization
Nicola Cancedda | Christer Samuelsson
Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop