2020
pdf
abs
ELRI: A Decentralised Network of National Relay Stations to Collect, Prepare and Share Language Resources
Thierry Etchegoyhen
|
Borja Anza Porras
|
Andoni Azpeitia
|
Eva Martínez Garcia
|
José Luis Fonseca
|
Patricia Fonseca
|
Paulo Vale
|
Jane Dunne
|
Federico Gaspari
|
Teresa Lynn
|
Helen McHugh
|
Andy Way
|
Victoria Arranz
|
Khalid Choukri
|
Hervé Pusset
|
Alexandre Sicard
|
Rui Neto
|
Maite Melero
|
David Perez
|
António Branco
|
Ruben Branco
|
Luís Gomes
Proceedings of the 1st International Workshop on Language Technology Platforms
We describe the European Language Resource Infrastructure (ELRI), a decentralised network to help collect, prepare and share language resources. The infrastructure was developed within a project co-funded by the Connecting Europe Facility Programme of the European Union, and has been deployed in the four Member States participating in the project, namely France, Ireland, Portugal and Spain. ELRI provides sustainable and flexible means to collect and share language resources via National Relay Stations, to which members of public institutions can freely subscribe. The infrastructure includes fully automated data processing engines to facilitate the preparation, sharing and wider reuse of useful language resources that can help optimise human and automated translation services in the European Union.
2018
pdf
abs
Supervised and Unsupervised Minimalist Quality Estimators: Vicomtech’s Participation in the WMT 2018 Quality Estimation Task
Thierry Etchegoyhen
|
Eva Martínez Garcia
|
Andoni Azpeitia
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
We describe Vicomtech’s participation in the WMT 2018 shared task on quality estimation, for which we submitted minimalist quality estimators. The core of our approach is based on two simple features: lexical translation overlaps and language model cross-entropy scores. These features are exploited in two system variants: uMQE is an unsupervised system, where the final quality score is obtained by averaging individual feature scores; sMQE is a supervised variant, where the final score is estimated by a Support Vector Regressor trained on the available annotated datasets. The main goal of our minimalist approach to quality estimation is to provide reliable estimators that require minimal deployment effort, few resources, and, in the case of uMQE, do not depend on costly data annotation or post-editing. Our approach was applied to all language pairs in sentence quality estimation, obtaining competitive results across the board.
pdf
abs
STACC, OOV Density and N-gram Saturation: Vicomtech’s Participation in the WMT 2018 Shared Task on Parallel Corpus Filtering
Andoni Azpeitia
|
Thierry Etchegoyhen
|
Eva Martínez Garcia
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
We describe Vicomtech’s participation in the WMT 2018 Shared Task on parallel corpus filtering. We aimed to evaluate a simple approach to the task, which can efficiently process large volumes of data and can be easily deployed for new datasets in different language pairs and domains. We based our approach on STACC, an efficient and portable method for parallel sentence identification in comparable corpora. To address the specifics of the corpus filtering task, which features significant volumes of noisy data, the core method was expanded with a penalty based on the amount of unknown words in sentence pairs. Additionally, we experimented with a complementary data saturation method based on source sentence n-grams, with the goal of demoting parallel sentence pairs that do not contribute significant amounts of yet unobserved n-grams. Our approach requires no prior training and is highly efficient on the type of large datasets featured in the corpus filtering task. We achieved competitive results with this simple and portable method, ranking in the top half among competing systems overall.
pdf
bib
Evaluating Domain Adaptation for Machine Translation Across Scenarios
Thierry Etchegoyhen
|
Anna Fernández Torné
|
Andoni Azpeitia
|
Eva Martínez Garcia
|
Anna Matamala
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
abs
Neural Machine Translation of Basque
Thierry Etchegoyhen
|
Eva Martínez Garcia
|
Andoni Azpeitia
|
Gorka Labaka
|
Iñaki Alegria
|
Itziar Cortes Etxabe
|
Amaia Jauregi Carrera
|
Igor Ellakuria Santos
|
Maite Martin
|
Eusebi Calonge
Proceedings of the 21st Annual Conference of the European Association for Machine Translation
We describe the first experimental results in neural machine translation for Basque. As a synthetic language featuring agglutinative morphology, an extended case system, complex verbal morphology and relatively free word order, Basque presents a large number of challenging characteristics for machine translation in general, and for data-driven approaches such as attentionbased encoder-decoder models in particular. We present our results on a large range of experiments in Basque-Spanish translation, comparing several neural machine translation system variants with both rule-based and statistical machine translation systems. We demonstrate that significant gains can be obtained with a neural network approach for this challenging language pair, and describe optimal configurations in terms of word segmentation and decoding parameters, measured against test sets that feature multiple references to account for word order variability.
pdf
abs
ELRI - European Language Resources Infrastructure
Thierry Etchegoyhen
|
Borja Anza Porras
|
Andoni Azpeitia
|
Eva Martínez Garcia
|
Paulo Vale
|
José Luis Fonseca
|
Teresa Lynn
|
Jane Dunne
|
Federico Gaspari
|
Andy Way
|
Victoria Arranz
|
Khalid Choukri
|
Vladimir Popescu
|
Pedro Neiva
|
Rui Neto
|
Maite Melero
|
David Perez Fernandez
|
Antonio Branco
|
Ruben Branco
|
Luis Gomes
Proceedings of the 21st Annual Conference of the European Association for Machine Translation
We describe the European Language Resources Infrastructure project, whose main aim is the provision of an infrastructure to help collect, prepare and share language resources that can in turn improve translation services in Europe.
2017
pdf
abs
Weighted Set-Theoretic Alignment of Comparable Sentences
Andoni Azpeitia
|
Thierry Etchegoyhen
|
Eva Martínez Garcia
Proceedings of the 10th Workshop on Building and Using Comparable Corpora
This article presents the STACCw system for the BUCC 2017 shared task on parallel sentence extraction from comparable corpora. The original STACC approach, based on set-theoretic operations over bags of words, had been previously shown to be efficient and portable across domains and alignment scenarios. Wedescribe an extension of this approach with a new weighting scheme and show that it provides significant improvements on the datasets provided for the shared task.
pdf
Exploiting Relative Frequencies for Data Selection
Thierry Etchegoyhen
|
Andoni Azpeitia
|
Eva Martinez García
Proceedings of Machine Translation Summit XVI: Research Track
2016
pdf
DOCAL - Vicomtech’s Participation in the WMT16 Shared Task on Bilingual Document Alignment
Andoni Azpeitia
|
Thierry Etchegoyhen
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers
pdf
A Portable Method for Parallel and Comparable Document Alignment
Thierry Etchegoyhen
|
Andoni Azpeitia
Proceedings of the 19th Annual Conference of the European Association for Machine Translation
pdf
abs
Exploiting a Large Strongly Comparable Corpus
Thierry Etchegoyhen
|
Andoni Azpeitia
|
Naiara Pérez
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This article describes a large comparable corpus for Basque and Spanish and the methods employed to build a parallel resource from the original data. The EITB corpus, a strongly comparable corpus in the news domain, is to be shared with the research community, as an aid for the development and testing of methods in comparable corpora exploitation, and as basis for the improvement of data-driven machine translation systems for this language pair. Competing approaches were explored for the alignment of comparable segments in the corpus, resulting in the design of a simple method which outperformed a state-of-the-art method on the corpus test sets. The method we present is highly portable, computationally efficient, and significantly reduces deployment work, a welcome result for the exploitation of comparable corpora.
pdf
Set-Theoretic Alignment for Comparable Corpora
Thierry Etchegoyhen
|
Andoni Azpeitia
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
2015
pdf
bib
The reception of intralingual and interlingual automatic subtitling: An exploratory study within the HBB4ALL project
Anna Matamala
|
Andreu Oliver
|
Aitor Álvarez
|
Andoni Azpeitia
Proceedings of Translating and the Computer 37
2014
pdf
abs
Generating Polarity Lexicons with WordNet propagation in 5 languages
Isa Maks
|
Ruben Izquierdo
|
Francesca Frontini
|
Rodrigo Agerri
|
Piek Vossen
|
Andoni Azpeitia
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
In this paper we focus on the creation of general-purpose (as opposed to domain-specific) polarity lexicons in five languages: French, Italian, Dutch, English and Spanish using WordNet propagation. WordNet propagation is a commonly used method to generate these lexicons as it gives high coverage of general purpose language and the semantically rich WordNets where concepts are organised in synonym , antonym and hyperonym/hyponym structures seem to be well suited to the identification of positive and negative words. However, WordNets of different languages may vary in many ways such as the way they are compiled, the number of synsets, number of synonyms and number of semantic relations they include. In this study we investigate whether this variability translates into differences of performance when these WordNets are used for polarity propagation. Although many variants of the propagation method are developed for English, little is known about how they perform with WordNets of other languages. We implemented a propagation algorithm and designed a method to obtain seed lists similar with respect to quality and size, for each of the five languages. We evaluated the results against gold standards also developed according to a common method in order to achieve as less variance as possible between the different languages.