Iñaki San Vicente

Also published as: Iñaki San Vicente


2023

pdf
Not Enough Data to Pre-train Your Language Model? MT to the Rescue!
Gorka Urbizu | Iñaki San Vicente | Xabier Saralegi | Ander Corral
Findings of the Association for Computational Linguistics: ACL 2023

In recent years, pre-trained transformer-based language models (LM) have become a key resource for implementing most NLP tasks. However, pre-training such models demands large text collections not available in most languages. In this paper, we study the use of machine-translated corpora for pre-training LMs. We answer the following research questions: RQ1: Is MT-based data an alternative to real data for learning a LM?; RQ2: Can real data be complemented with translated data and improve the resulting LM? In order to validate these two questions, several BERT models for Basque have been trained, combining real data and synthetic data translated from Spanish.The evaluation carried out on 9 NLU tasks indicates that models trained exclusively on translated data offer competitive results. Furthermore, models trained with real data can be improved with synthetic data, although further research is needed on the matter.

pdf
Scaling Laws for BERT in Low-Resource Settings
Gorka Urbizu | Iñaki San Vicente | Xabier Saralegi | Rodrigo Agerri | Aitor Soroa
Findings of the Association for Computational Linguistics: ACL 2023

Large language models are very resource intensive, both financially and environmentally, and require an amount of training data which is simply unobtainable for the majority of NLP practitioners. Previous work has researched the scaling laws of such models, but optimal ratios of model parameters, dataset size, and computation costs focused on the large scale. In contrast, we analyze the effect those variables have on the performance of language models in constrained settings, by building three lightweight BERT models (16M/51M/124M parameters) trained over a set of small corpora (5M/25M/125M words).We experiment on four languages of different linguistic characteristics (Basque, Spanish, Swahili and Finnish), and evaluate the models on MLM and several NLU tasks. We conclude that the power laws for parameters, data and compute for low-resource settings differ from the optimal scaling laws previously inferred, and data requirements should be higher. Our insights are consistent across all the languages we study, as well as across the MLM and downstream tasks. Furthermore, we experimentally establish when the cost of using a Transformer-based approach is worth taking, instead of favouring other computationally lighter solutions.

2022

pdf
BasqueGLUE: A Natural Language Understanding Benchmark for Basque
Gorka Urbizu | Iñaki San Vicente | Xabier Saralegi | Rodrigo Agerri | Aitor Soroa
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Natural Language Understanding (NLU) technology has improved significantly over the last few years and multitask benchmarks such as GLUE are key to evaluate this improvement in a robust and general way. These benchmarks take into account a wide and diverse set of NLU tasks that require some form of language understanding, beyond the detection of superficial, textual clues. However, they are costly to develop and language-dependent, and therefore they are only available for a small number of languages. In this paper, we present BasqueGLUE, the first NLU benchmark for Basque, a less-resourced language, which has been elaborated from previously existing datasets and following similar criteria to those used for the construction of GLUE and SuperGLUE. We also report the evaluation of two state-of-the-art language models for Basque on BasqueGLUE, thus providing a strong baseline to compare upon. BasqueGLUE is freely available under an open license.

pdf
Measuring Presence of Women and Men as Information Sources in News
Muitze Zulaika | Xabier Saralegi | Iñaki San Vicente
Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

In the news, statements from information sources are often quoted, made by individuals who interact in the news. Detecting those quotes and the gender of their sources is a key task when it comes to media analysis from a gender perspective. It is a challenging task: the structure of the quotes is variable, gender marks are not present in many languages, and quote authors are often omitted due to frequent use of coreferences. This paper proposes a strategy to measure the presence of women and men as information sources in news. We approach the problem of detecting sentences including quotes and the gender of the speaker as a joint task, by means of a supervised multiclass classifier of sentences. We have created the first datasets for Spanish and Basque by manually annotating quotes and the gender of the associated sources in news items. The results obtained show that BERT based approaches are significantly better than bag-of-words based classical ones, achieving accuracies close to 90%. We also analyse a bilingual learning strategy and generating additional training examples synthetically; both provide improvements up to 3.4% and 5.6%, respectively.

2021

pdf
GEPSA, a tool for monitoring social challenges in digital press
Iñaki San Vicente | Xabier Saralegi | Nerea Zubia
Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion

This papers presents a platform for monitoring press narratives with respect to several social challenges, including gender equality, migrations and minority languages. As narratives are encoded in natural language, we have to use natural processing techniques to automate their analysis. Thus, crawled news are processed by means of several NLP modules, including named entity recognition, keyword extraction,document classification for social challenge detection, and sentiment analysis. A Flask powered interface provides data visualization for a user-based analysis of the data. This paper presents the architecture of the system and describes in detail its different components. Evaluation is provided for the modules related to extraction and classification of information regarding social challenges.

2020

pdf
Building a Task-oriented Dialog System for Languages with no Training Data: the Case for Basque
Maddalen López de Lacalle | Xabier Saralegi | Iñaki San Vicente
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper presents an approach for developing a task-oriented dialog system for less-resourced languages in scenarios where training data is not available. Both intent classification and slot filling are tackled. We project the existing annotations in rich-resource languages by means of Neural Machine Translation (NMT) and posterior word alignments. We then compare training on the projected monolingual data with direct model transfer alternatives. Intent Classifiers and slot filling sequence taggers are implemented using a BiLSTM architecture or by fine-tuning BERT transformer models. Models learnt exclusively from Basque projected data provide better accuracies for slot filling. Combining Basque projected train data with rich-resource languages data outperforms consistently models trained solely on projected data for intent classification. At any rate, we achieve competitive performance in both tasks, with accuracies of 81% for intent classification and 77% for slot filling.

pdf
Give your Text Representation Models some Love: the Case for Basque
Rodrigo Agerri | Iñaki San Vicente | Jon Ander Campos | Ander Barrena | Xabier Saralegi | Aitor Soroa | Eneko Agirre
Proceedings of the Twelfth Language Resources and Evaluation Conference

Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups tend to use models that have been pre-trained and made available by third parties, rather than building their own. This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora. In addition, monolingual pre-trained models for non-English languages are not always available. At best, models for those languages are included in multilingual versions, where each language shares the quota of substrings and parameters with the rest of the languages. This is particularly true for smaller languages such as Basque. In this paper we show that a number of monolingual models (FastText word embeddings, FLAIR and BERT language models) trained with larger Basque corpora produce much better results than publicly available versions in downstream NLP tasks, including topic classification, sentiment classification, PoS tagging and NER. This work sets a new state-of-the-art in those tasks for Basque. All benchmarks and models used in this work are publicly available.

2016

pdf
Polarity Lexicon Building: to what Extent Is the Manual Effort Worth?
Iñaki San Vicente | Xabier Saralegi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Polarity lexicons are a basic resource for analyzing the sentiments and opinions expressed in texts in an automated way. This paper explores three methods to construct polarity lexicons: translating existing lexicons from other languages, extracting polarity lexicons from corpora, and annotating sentiments Lexical Knowledge Bases. Each of these methods require a different degree of human effort. We evaluate how much manual effort is needed and to what extent that effort pays in terms of performance improvement. Experiment setup includes generating lexicons for Basque, and evaluating them against gold standard datasets in different domains. Results show that extracting polarity lexicons from corpora is the best solution for achieving a good performance with reasonable human effort.

pdf
TweetMT: A Parallel Microblog Corpus
Iñaki San Vicente | Iñaki Alegría | Cristina España-Bonet | Pablo Gamallo | Hugo Gonçalo Oliveira | Eva Martínez Garcia | Antonio Toral | Arkaitz Zubiaga | Nora Aranberri
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We introduce TweetMT, a parallel corpus of tweets in four language pairs that combine five languages (Spanish from/to Basque, Catalan, Galician and Portuguese), all of which have an official status in the Iberian Peninsula. The corpus has been created by combining automatic collection and crowdsourcing approaches, and it is publicly available. It is intended for the development and testing of microtext machine translation systems. In this paper we describe the methodology followed to build the corpus, and present the results of the shared task in which it was tested.

2015

pdf
EliXa: A Modular and Flexible ABSA Platform
Iñaki San Vicente | Xabier Saralegi | Rodrigo Agerri
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

pdf
Simple, Robust and (almost) Unsupervised Generation of Polarity Lexicons for Multiple Languages
Iñaki San Vicente | Rodrigo Agerri | German Rigau
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf
TweetNorm_es: an annotated corpus for Spanish microtext normalization
Iñaki Alegria | Nora Aranberri | Pere Comas | Víctor Fresno | Pablo Gamallo | Lluis Padró | Iñaki San Vicente | Jordi Turmo | Arkaitz Zubiaga
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we introduce TweetNorm_es, an annotated corpus of tweets in Spanish language, which we make publicly available under the terms of the CC-BY license. This corpus is intended for development and testing of microtext normalization systems. It was created for Tweet-Norm, a tweet normalization workshop and shared task, and is the result of a joint annotation effort from different research groups. In this paper we describe the methodology defined to build the corpus as well as the guidelines followed in the annotation process. We also present a brief overview of the Tweet-Norm shared task, as the first evaluation environment where the corpus was used.

2012

pdf
Building a Basque-Chinese Dictionary by Using English as Pivot
Xabier Saralegi | Iker Manterola | Iñaki San Vicente
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Bilingual dictionaries are key resources in several fields such as translation, language learning or various NLP tasks. However, only major languages have such resources. Automatically built dictionaries by using pivot languages could be a useful resource in these circumstances. Pivot-based bilingual dictionary building is based on merging two bilingual dictionaries which share a common language (e.g. LA-LB, LB-LC) in order to create a dictionary for a new language pair (e.g LA-LC). This process may include wrong translations due to the polisemy of words. We built Basque-Chinese (Mandarin) dictionaries automatically from Basque-English and Chinese-English dictionaries. In order to prune wrong translations we used different methods adequate for less resourced languages. Inverse Consultation and Distributional Similarity methods are used because they just depend on easily available resources. Finally, we evaluated manually the quality of the built dictionaries and the adequacy of the methods. Both Inverse Consultation and Distributional Similarity provide good precision of translations but recall is seriously damaged. Distributional similarity prunes rare translations more accurately than other methods.

pdf
PaCo2: A Fully Automated tool for gathering Parallel Corpora from the Web
Iñaki San Vicente | Iker Manterola
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The importance of parallel corpora in the NLP field is fully acknowledged. This paper presents a tool that can build parallel corpora given just a seed word list and a pair of languages. Our approach is similar to others proposed in the literature, but introduces a new phase to the process. While most of the systems leave the task of finding websites containing parallel content up to the user, PaCo2 (Parallel Corpora Collector) takes care of that as well. The tool is language independent as far as possible, and adapting the system to work with new languages is fairly straightforward. Evaluation of the different modules has been carried out for Basque-Spanish, Spanish-English and Portuguese-English language pairs. Even though there is still room for improvement, results are positive. Results show that the corpora created have very good quality translations units, and the quality is maintained for the various language pairs. Details of the corpora created up until now are also provided.

2011

pdf
Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries
Xabier Saralegi | Iker Manterola | Iñaki San Vicente
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing