Antoni Oliver


2024

pdf
Training an NMT system for legal texts of a low-resource language variety South Tyrolean German - Italian
Antoni Oliver | Sergi Alvarez-Vidal | Egon Stemle | Elena Chiocchetti
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)

This paper illustrates the process of training and evaluating NMT systems for a language pair that includes a low-resource language variety.A parallel corpus of legal texts for Italian and South Tyrolean German has been compiled, with South Tyrolean German being the low-resourced language variety. As the size of the compiled corpus is insufficient for the training, we have combined the corpus with several parallel corpora using data weighting at sentence level. We then performed an evaluation of each combination and of two popular commercial systems.

pdf
Expanding the FLORES+ Multilingual Benchmark with Translations for Aragonese, Aranese, Asturian, and Valencian
Juan Antonio Perez-Ortiz | Felipe Sánchez-Martínez | Víctor M. Sánchez-Cartagena | Miquel Esplà-Gomis | Aaron Galiano Jimenez | Antoni Oliver | Claudi Aventín-Boya | Alejandro Pardos | Cristina Valdés | Jusèp Loís Sans Socasau | Juan Pablo Martínez
Proceedings of the Ninth Conference on Machine Translation

In this paper, we describe the process of creating the FLORES+ datasets for several Romance languages spoken in Spain, namely Aragonese, Aranese, Asturian, and Valencian. The Aragonese and Aranese datasets are entirely new additions to the FLORES+ multilingual benchmark. An initial version of the Asturian dataset was already available in FLORES+, and our work focused on a thorough revision. Similarly, FLORES+ included a Catalan dataset, which we adapted to the Valencian variety spoken in the Valencian Community. The development of the Aragonese, Aranese, and revised Asturian FLORES+ datasets was undertaken as part of a WMT24 shared task on translation into low-resource languages of Spain.

pdf
Findings of the WMT 2024 Shared Task Translation into Low-Resource Languages of Spain: Blending Rule-Based and Neural Systems
Felipe Sánchez-Martínez | Juan Antonio Perez-Ortiz | Aaron Galiano Jimenez | Antoni Oliver
Proceedings of the Ninth Conference on Machine Translation

This paper presents the results of the Ninth Conference on Machine Translation (WMT24) Shared Task “Translation into Low-Resource Languages of Spain”’. The task focused on the development of machine translation systems for three language pairs: Spanish-Aragonese, Spanish-Aranese, and Spanish-Asturian. 17 teams participated in the shared task with a total of 87 submissions. The baseline system for all language pairs was Apertium, a rule-based machine translation system that still performs competitively well, even in an era dominated by more advanced non-symbolic approaches. We report and discuss the results of the submitted systems, highlighting the strengths of both neural and rule-based approaches.

pdf
TAN-IBE Participation in the Shared Task: Translation into Low-Resource Languages of Spain
Antoni Oliver
Proceedings of the Ninth Conference on Machine Translation

This paper describes the systems presented by the TAN-IBE team into the WMT24 Shared task Translation into Low-Resource Languages of Spain. The aim of this joint task was to train systems for Spanish-Asturian, Spanish-Aragonese and Spanish-Aranesian. Our team presented systems for all three language pairs and for two types of submission: for Spanish-Aragonese and Spanish-Aranese we participated with constrained submissions, and for Spanish-Asturian with an open submission.

pdf bib
Using a multilingual literary parallel corpus to train NMT systems
Bojana Mikelenić | Antoni Oliver
Proceedings of the 1st Workshop on Creative-text Translation and Technology

This article presents an application of a multilingual and multidirectional parallel corpus composed of literary texts in five Romance languages (Spanish, French, Italian, Portuguese, Romanian) and a Slavic language (Croatian), with a total of 142,000 segments and 15.7 million words. After combining it with very large freely available parallel corpora, this resource is used to train NMT systems tailored to literature. A total of five NMT systems have been trained: Spanish-French, Spanish-Italian, Spanish-Portuguese, Spanish-Romanian and Spanish-Croatian. The trained systems were evaluated using automatic metrics (BLEU, chrF2 and TER) and a comparison with a rule-based MT system (Apertium) and a neural system (Google Translate) is presented. As a main conclusion, we can highlight that the use of this literary corpus has been very productive, as the majority of the trained systems achieve comparable, and in some cases even better, values of the automatic quality metrics than a widely used commercial NMT system.

pdf
LitPC: A set of tools for building parallel corporafrom literary works
Antoni Oliver | Sergi Alvarez-Vidal
Proceedings of the 1st Workshop on Creative-text Translation and Technology

In this paper, we describe the LitPC toolkit, a variety of tools and methods designed for the quick and effective creation of parallel corpora derived from literary works. This toolkit can be a useful resource due to the scarcity of curated parallel texts for this domain. We also feature a case study describing the creation of a Russian-English parallel corpus based on the literary works by Leo Tolstoy. Furthermore, an augmented version of this corpus is used to both train and assess neural machine translation systems specifically adapted to the author’s style.

2023

pdf
PE effort and neural-based automatic MT metrics: do they correlate?
Sergi Alvarez | Antoni Oliver
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

Neural machine translation (NMT) has shown overwhelmingly good results in recent times. This improvement in quality has boosted the presence of NMT in nearly all fields of translation. Most current translation industry workflows include postediting (PE) of MT as part of their process. For many domains and language combinations, translators post-edit raw machine translation (MT) to produce the final document. However, this process can only work properly if the quality of the raw MT output can be assured. MT is usually evaluated using automatic scores, as they are much faster and cheaper. However, traditional automatic scores have not been good quality indicators and do not correlate with PE effort. We analyze the correlation of each of the three dimensions of PE effort (temporal, technical and cognitive) with COMET, a neural framework which has obtained outstanding results in recent MT evaluation campaigns.

pdf
TAN-IBE: Neural Machine Translation for the romance languages of the Iberian Peninsula
Antoni Oliver | Mercè Vàzquez | Marta Coll-Florit | Sergi Álvarez | Víctor Suárez | Claudi Aventín-Boya | Cristina Valdés | Mar Font | Alejandro Pardos
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

The main goal of this project is to explore the techniques for training NMT systems applied to Spanish, Portuguese, Catalan, Galician, Asturian, Aragonese and Aranese. These languages belong to the same Romance family, but they are very different in terms of the linguistic resources available. Asturian, Aragonese and Aranese can be considered low resource languages. These characteristics make this setting an excellent place to explore training techniques for low-resource languages: transfer learning and multilingual systems, among others. The first months of the project have been dedicated to the compilation of monolingual and parallel corpora for Asturian, Aragonese and Aranese.

pdf bib
Training and integration of neural machine translation with MTUOC
Antoni Oliver | Sergi Alvarez
Proceedings of the 1st Workshop on Open Community-Driven Machine Translation

In this paper the goals and main objectives of the project MTUOC are presented. This project aims to ease the process of training and integrating neural machine translation (NMT) systems into professional translation environments. The MTUOC project distributes a series of auxiliary tools that allow to perform parallel corpus compilation and preprocessing, as well as the training of NMT systems. The project also distributes a server that implements most of the communication protocols used in computer assisted translation tools.

2020

pdf
TermEval 2020: Using TSR Filtering Method to Improve Automatic Term Extraction
Antoni Oliver | Mercè Vàzquez
Proceedings of the 6th International Workshop on Computational Terminology

The identification of terms from domain-specific corpora using computational methods is a highly time-consuming task because terms has to be validated by specialists. In order to improve term candidate selection, we have developed the Token Slot Recognition (TSR) method, a filtering strategy based on terminological tokens which is used to rank extracted term candidates from domain-specific corpora. We have implemented this filtering strategy in TBXTools. In this paper we present the system we have used in the TermEval 2020 shared task on monolingual term extraction. We also present the evaluation results for the system for English, French and Dutch and for two corpora: corruption and heart failure. For English and French we have used a linguistic methodology based on POS patterns, and for Dutch we have used a statistical methodology based on n-grams calculation and filtering with stop-words. For all languages, TSR (Token Slot Recognition) filtering method has been applied. We have obtained competitive results, but there is still room for improvement of the system.

pdf
Aligning Wikipedia with WordNet:a Review and Evaluation of Different Techniques
Antoni Oliver
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper we explore techniques for aligning Wikipedia articles with WordNet synsets, their successful alignment being our main goal. We evaluate techniques that use the definitions and sense relations in Wordnet and the text and categories in Wikipedia articles. The results we present are based on two evaluation strategies: one uses a new gold and silver standard (for which the creation process is explained); the other creates wordnets in other languages and then compares them with existing wordnets for those languages found in the Open Multilingual Wordnet project. A reliable alignment between WordNet and Wikipedia is a very valuable resource for the creation of new wordnets in other languages and for the development of existing wordnets. The evaluation of alignments between WordNet and lexical resources is a difficult and time-consuming task, but the evaluation strategy using the Open Multilingual Wordnet can be used as an automated evaluation measure to assess the quality of alignments between these two resources.

pdf
ReSiPC: a Tool for Complex Searches in Parallel Corpora
Antoni Oliver | Bojana Mikelenić
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, a tool specifically designed to allow for complex searches in large parallel corpora is presented. The formalism for the queries is very powerful as it uses standard regular expressions that allow for complex queries combining word forms, lemmata and POS-tags. As queries are performed over POS-tags, at least one of the languages in the parallel corpus should be POS-tagged. Searches can be performed in one of the languages or in both languages at the same time. The program is able to POS-tag the corpora using the Freeling analyzer through its Python API. ReSiPC is developed in Python version 3 and it is distributed under a free license (GNU GPL). The tool can be used to provide data for contrastive linguistics research and an example of use in a Spanish-Croatian parallel corpus is presented. ReSiPC is designed for queries in POS-tagged corpora, but it can be easily adapted for querying corpora containing other kinds of information.

pdf
PosEdiOn: Post-Editing Assessment in PythOn
Antoni Oliver | Sergi Alvarez | Toni Badia
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

There is currently an extended use of post-editing of machine translation (PEMT) in the translation industry. This is due to the increase in the demand of translation and to the significant improvements in quality achieved by neural machine translation (NMT). PEMT has been included as part of the translation workflow because it increases translators’ productivity and it also reduces costs. Although an effective post-editing requires enough quality of the MT output, usual automatic metrics do not always correlate with post-editing effort. We describe a standalone tool designed both for industry and research that has two main purposes: collect sentence-level information from the post-editing process (e.g. post-editing time and keystrokes) and visually present multiple evaluation scores so they can be easily interpreted by a user.

pdf
Quantitative Analysis of Post-Editing Effort Indicators for NMT
Sergi Alvarez | Antoni Oliver | Toni Badia
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

The recent improvements in machine translation (MT) have boosted the use of post-editing (PE) in the translation industry. A new machine translation paradigm, neural machine translation (NMT), is displacing its corpus-based predecessor, statistical machine translation (SMT), in the translation workflows currently implemented because it usually increases the fluency and accuracy of the MT output. However, usual automatic measurements do not always indicate the quality of the MT output and there is still no clear correlation between PE effort and productivity. We present a quantitative analysis of different PE effort indicators for two NMT systems (transformer and seq2seq) for English-Spanish in-domain medical documents. We compare both systems and study the correlation between PE time and other scores. Results show less PE effort for the transformer NMT model and a high correlation between PE time and keystrokes.

pdf
MTUOC: easy and free integration of NMT systems in professional translation environments
Antoni Oliver
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

In this paper the MTUOC project, aiming to provide an easy integration of neural and statistical machine translation systems, is presented. Almost all the required software to train and use neural and statistical MT systems are released under free licences. However, their use is not always easy and intuitive and medium-high specialized skills are required. MTUOC project provides simplified scripts for preprocessing and training MT systems, and a server and client for easy use of the trained systems. The server is compatible with popular CAT tools for a seamless integration. The project also distributes some free engines.

pdf
INMIGRA3: building a case for NGOs and NMT
Celia Rico | María Del Mar Sánchez Ramos | Antoni Oliver
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

INMIGRA3 is a three-year project that builds on the work of two previous initi-atives: INMIGRA2-CM and CRISIS-MT . Together, they address the specific needs of NGOs in multilingual settings with a particular interest in migratory contexts. Work on INMIGRA3 concentrates in the analysis of how best can be NMT put to use for the purposes of translating NGOs documentation.

pdf
Neural Metaphor Detection with a Residual biLSTM-CRF Model
Andrés Torres Rivera | Antoni Oliver | Salvador Climent | Marta Coll-Florit
Proceedings of the Second Workshop on Figurative Language Processing

In this paper we present a novel resource-inexpensive architecture for metaphor detection based on a residual bidirectional long short-term memory and conditional random fields. Current approaches on this task rely on deep neural networks to identify metaphorical words, using additional linguistic features or word embeddings. We evaluate our proposed approach using different model configurations that combine embeddings, part of speech tags, and semantically disambiguated synonym sets. This evaluation process was performed using the training and testing partitions of the VU Amsterdam Metaphor Corpus. We use this method of evaluation as reference to compare the results with other current neural approaches for this task that implement similar neural architectures and features, and that were evaluated using this corpus. Results show that our system achieves competitive results with a simpler architecture compared to previous approaches.

2019

pdf
Does NMT make a difference when post-editing closely related languages? The case of Spanish-Catalan
Sergi Alvarez | Antoni Oliver | Toni Badia
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

2018

pdf
Further expansion of the Croatian WordNet
Krešimir Šojat | Matea Filko | Antoni Oliver
Proceedings of the 9th Global Wordnet Conference

In this paper a semi-automatic procedure for the expansion of the Croatian Wordnet (CroWN) is presented. An English-Croatian dictionary was used in order to translate monosemous PWN 3.0 English variants. The precision values of the automatic process is low (about 30%), but the results proved valuable for the enlargment of CroWN. After manual validation, 10,884 new synset-variant pairs were added to CroWN, achieving a total of 62,075 synset-variant pairs.

2017

pdf
Morphological Analysis of the Dravidian Language Family
Arun Kumar | Ryan Cotterell | Lluís Padró | Antoni Oliver
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

The Dravidian languages are one of the most widely spoken language families in the world, yet there are very few annotated resources available to NLP researchers. To remedy this, we create DravMorph, a corpus annotated for morphological segmentation and part-of-speech. Additionally, we exploit novel features and higher-order models to set state-of-the-art results on these corpora on both tasks, beating techniques proposed in the literature by as much as 4 points in segmentation F1.

2016

pdf
Extending the WN-Toolkit: dealing with polysemous words in the dictionary-based strategy
Antoni Oliver
Proceedings of the 8th Global WordNet Conference (GWC)

In this paper we present an extension of the dictionary-based strategy for wordnet construction implemented in the WN-Toolkit. This strategy allows the extraction of information for polysemous English words if definitions and/or semantic relations are present in the dictionary. The WN-Toolkit is a freely available set of programs for the creation and expansion of wordnets using dictionary-based and parallel-corpus based strategies. In previous versions of the toolkit the dictionary-based strategy was only used for translating monosemous English variants. In the experiments we have used Omegawiki and Wiktionary and we present automatic evaluation results for 24 languages that have wordnets in the Open Multilingual Wordnet project. We have used these existing versions of the wordnet to perform an automatic evaluation.

2015

pdf
Learning Agglutinative Morphology of Indian Languages with Linguistically Motivated Adaptor Grammars
Arun Kumar | Lluís Padró | Antoni Oliver
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf
TBXTools: A Free, Fast and Flexible Tool for Automatic Terminology Extraction
Antoni Oliver | Mercè Vàzquez
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf
Enlarging the Croatian WordNet with WN-Toolkit and Cro-Deriv
Antoni Oliver | Krešimir Šojat | Matea Srebačić
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf
Joint Bayesian Morphology Learning for Dravidian Languages
Arun Kumar | Lluís Padró | Antoni Oliver
Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects

2014

pdf
Automatic creation of WordNets from parallel corpora
Antoni Oliver | Salvador Climent
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we present the evaluation results for the creation of WordNets for five languages (Spanish, French, German, Italian and Portuguese) using an approach based on parallel corpora. We have used three very large parallel corpora for our experiments: DGT-TM, EMEA and ECB. The English part of each corpus is semantically tagged using Freeling and UKB. After this step, the process of WordNet creation is converted into a word alignment problem, where we want to alignWordNet synsets in the English part of the corpus with lemmata on the target language part of the corpus. The word alignment algorithm used in these experiments is a simple most frequent translation algorithm implemented into the WN-Toolkit. The obtained precision values are quite satisfactory, but the overall number of extracted synset-variant pairs is too low, leading into very poor recall values. In the conclusions, the use of more advanced word alignment algorithms, such as Giza++, Fast Align or Berkeley aligner is suggested.

pdf bib
WN-Toolkit: Automatic generation of WordNets following the expand model
Antoni Oliver
Proceedings of the Seventh Global Wordnet Conference

2008

pdf
Complete and Consistent Annotation of WordNet using the Top Concept Ontology
Javier Álvez | Jordi Atserias | Jordi Carrera | Salvador Climent | Egoitz Laparra | Antoni Oliver | German Rigau
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents the complete and consistent ontological annotation of the nominal part of WordNet. The annotation has been carried out using the semantic features defined in the EuroWordNet Top Concept Ontology and made available to the NLP community. Up to now only an initial core set of 1,024 synsets, the so-called Base Concepts, was ontologized in such a way. The work has been achieved by following a methodology based on an iterative and incremental expansion of the initial labeling through the hierarchy while setting inheritance blockage points. Since this labeling has been set on the EuroWordNet’s Interlingual Index (ILI), it can be also used to populate any other wordnet linked to it through a simple porting process. This feature-annotated WordNet is intended to be useful for a large number of semantic NLP tasks and for testing for the first time componential analysis on real environments. Moreover, the quantitative analysis of the work shows that more than 40% of the nominal part of WordNet is involved in structure errors or inadequacies.

2007

pdf
A free terminology extraction suite
Antoni Oliver | Merce Vazquez
Proceedings of Translating and the Computer 29

2005

pdf bib
An n-gram Approach to Exploiting a Monolingual Corpus for Machine Translation
Toni Badia | Gemma Boleda | Maite Melero | Antoni Oliver
Workshop on example-based machine translation

2004

pdf
Enlarging the Croatian Morphological Lexicon by Automatic Lexical Acquisition from Raw Corpora
Antoni Oliver | Marko Tadić
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

This paper presents experiments for enlarging the Croatian Morphological Lexicon by applying an automatic acquisition methodology. The basic sources of information for the system are a set of morphological rules and a raw corpus. The morphological rules have been automatically derived from the existing Croatian Morphological Lexicon and we have used in our experiments a subset of the Croatian National Corpus. The methodology has proved to be efficient for those languages that, like Croatian, present a rich and mainly concatenative morphology. This method can be applied for the creation of new resources, as well as in the enrichment of existing ones. We also present an extension of the system that uses automatic querying to Internet to acquire those entries for which we have not enough information in our corpus.

pdf
A Grammar and Style Checker Based on Internet Searches
Joaquim Moré | Salvador Climent | Antoni Oliver
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2003

pdf
Customizing an MT system for unsupervised automatic email translation
Salvador Climent | Joaquim Moré | Antoni Oliver
EAMT Workshop: Improving MT through other language technology tools: resources and tools for building MT

pdf
Automatic Lexical Acquisition from Raw Corpora: An Application to Russian
Antoni Oliver | Irene Castellón | Lluís Màrquez
Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages