Ander Barrena


2021

pdf
Label Verbalization and Entailment for Effective Zero and Few-Shot Relation Extraction
Oscar Sainz | Oier Lopez de Lacalle | Gorka Labaka | Ander Barrena | Eneko Agirre
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Relation extraction systems require large amounts of labeled examples which are costly to annotate. In this work we reformulate relation extraction as an entailment task, with simple, hand-made, verbalizations of relations produced in less than 15 min per relation. The system relies on a pretrained textual entailment engine which is run as-is (no training examples, zero-shot) or further fine-tuned on labeled examples (few-shot or fully trained). In our experiments on TACRED we attain 63% F1 zero-shot, 69% with 16 examples per relation (17% points better than the best supervised system on the same conditions), and only 4 points short to the state-of-the-art (which uses 20 times more training data). We also show that the performance can be improved significantly with larger entailment models, up to 12 points in zero-shot, allowing to report the best results to date on TACRED when fully trained. The analysis shows that our few-shot systems are specially effective when discriminating between relations, and that the performance difference in low data regimes comes mainly from identifying no-relation cases.

2020

pdf
Give your Text Representation Models some Love: the Case for Basque
Rodrigo Agerri | Iñaki San Vicente | Jon Ander Campos | Ander Barrena | Xabier Saralegi | Aitor Soroa | Eneko Agirre
Proceedings of the Twelfth Language Resources and Evaluation Conference

Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups tend to use models that have been pre-trained and made available by third parties, rather than building their own. This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora. In addition, monolingual pre-trained models for non-English languages are not always available. At best, models for those languages are included in multilingual versions, where each language shares the quota of substrings and parameters with the rest of the languages. This is particularly true for smaller languages such as Basque. In this paper we show that a number of monolingual models (FastText word embeddings, FLAIR and BERT language models) trained with larger Basque corpora produce much better results than publicly available versions in downstream NLP tasks, including topic classification, sentiment classification, PoS tagging and NER. This work sets a new state-of-the-art in those tasks for Basque. All benchmarks and models used in this work are publicly available.

2018

pdf
Learning Text Representations for 500K Classification Tasks on Named Entity Disambiguation
Ander Barrena | Aitor Soroa | Eneko Agirre
Proceedings of the 22nd Conference on Computational Natural Language Learning

Named Entity Disambiguation algorithms typically learn a single model for all target entities. In this paper we present a word expert model and train separate deep learning models for each target entity string, yielding 500K classification tasks. This gives us the opportunity to benchmark popular text representation alternatives on this massive dataset. In order to face scarce training data we propose a simple data-augmentation technique and transfer-learning. We show that bag-of-word-embeddings are better than LSTMs for tasks with scarce training data, while the situation is reversed when having larger amounts. Transferring a LSTM which is learned on all datasets is the most effective context representation option for the word experts in all frequency bands. The experiments show that our system trained on out-of-domain Wikipedia data surpass comparable NED systems which have been trained on in-domain training data.

2016

pdf
Alleviating Poor Context with Background Knowledge for Named Entity Disambiguation
Ander Barrena | Aitor Soroa | Eneko Agirre
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

pdf
Combining Mention Context and Hyperlinks from Wikipedia for Named Entity Disambiguation
Ander Barrena | Aitor Soroa | Eneko Agirre
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

2014

pdf
“One Entity per Discourse” and “One Entity per Collocation” Improve Named-Entity Disambiguation
Ander Barrena | Eneko Agirre | Bernardo Cabaleiro | Anselmo Peñas | Aitor Soroa
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2012

pdf
Matching Cultural Heritage items to Wikipedia
Eneko Agirre | Ander Barrena | Oier Lopez de Lacalle | Aitor Soroa | Samuel Fernando | Mark Stevenson
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Digitised Cultural Heritage (CH) items usually have short descriptions and lack rich contextual information. Wikipedia articles, on the contrary, include in-depth descriptions and links to related articles, which motivate the enrichment of CH items with information from Wikipedia. In this paper we explore the feasibility of finding matching articles in Wikipedia for a given Cultural Heritage item. We manually annotated a random sample of items from Europeana, and performed a qualitative and quantitative study of the issues and problems that arise, showing that each kind of CH item is different and needs a nuanced definition of what ``matching article'' means. In addition, we test a well-known wikification (aka entity linking) algorithm on the task. Our results indicate that a substantial number of items can be effectively linked to their corresponding Wikipedia article.