Olatz Perez-De-Viñaspre

Also published as: Olatz Perez-de-Viñaspre


2022

pdf
Unsupervised Machine Translation in Real-World Scenarios
Ona de Gibert Bonet | Iakes Goenaga | Jordi Armengol-Estapé | Olatz Perez-de-Viñaspre | Carla Parra Escartín | Marina Sanchez | Mārcis Pinnis | Gorka Labaka | Maite Melero
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this work, we present the work that has been carried on in the MT4All CEF project and the resources that it has generated by leveraging recent research carried out in the field of unsupervised learning. In the course of the project 18 monolingual corpora for specific domains and languages have been collected, and 12 bilingual dictionaries and translation models have been generated. As part of the research, the unsupervised MT methodology based only on monolingual corpora (Artetxe et al., 2017) has been tested on a variety of languages and domains. Results show that in specialised domains, when there is enough monolingual in-domain data, unsupervised results are comparable to those of general domain supervised translation, and that, at any rate, unsupervised techniques can be used to boost results whenever very little data is available.

pdf
BasqueParl: A Bilingual Corpus of Basque Parliamentary Transcriptions
Nayla Escribano | Jon Ander Gonzalez | Julen Orbegozo-Terradillos | Ainara Larrondo-Ureta | Simón Peña-Fernández | Olatz Perez-de-Viñaspre | Rodrigo Agerri
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Parliamentary transcripts provide a valuable resource to understand the reality and know about the most important facts that occur over time in our societies. Furthermore, the political debates captured in these transcripts facilitate research on political discourse from a computational social science perspective. In this paper we release the first version of a newly compiled corpus from Basque parliamentary transcripts. The corpus is characterized by heavy Basque-Spanish code-switching, and represents an interesting resource to study political discourse in contrasting languages such as Basque and Spanish. We enrich the corpus with metadata related to relevant attributes of the speakers and speeches (language, gender, party...) and process the text to obtain named entities and lemmas. The obtained metadata is then used to perform a detailed corpus analysis which provides interesting insights about the language use of the Basque political representatives across time, parties and gender.

pdf
Does Corpus Quality Really Matter for Low-Resource Languages?
Mikel Artetxe | Itziar Aldabe | Rodrigo Agerri | Olatz Perez-de-Viñaspre | Aitor Soroa
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

The vast majority of non-English corpora are derived from automatically filtered versions of CommonCrawl. While prior work has identified major issues on the quality of these datasets (Kreutzer et al., 2021), it is not clear how this impacts downstream performance. Taking representation learning in Basque as a case study, we explore tailored crawling (manually identifying and scraping websites with high-quality content) as an alternative to filtering CommonCrawl. Our new corpus, called EusCrawl, is similar in size to the Basque portion of popular multilingual corpora like CC100 and mC4, yet it has a much higher quality according to native annotators. For instance, 66% of documents are rated as high-quality for EusCrawl, in contrast with <33% for both mC4 and CC100. Nevertheless, we obtain similar results on downstream NLU tasks regardless of the corpus used for pre-training. Our work suggests that NLU performance in low-resource languages is not primarily constrained by the quality of the data, and other factors like corpus size and domain coverage can play a more important role.

pdf
Comparing and combining tagging with different decoding algorithms for back-translation in NMT: learnings from a low resource scenario
Xabier Soto | Olatz Perez-De-Viñaspre | Gorka Labaka | Maite Oronoz
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

Back-translation is a well established approach to improve the performance of Neural Machine Translation (NMT) systems when large monolingual corpora of the target language and domain are available. Recently, diverse approaches have been proposed to get better automatic evaluation results of NMT models using back-translation, including the use of sampling instead of beam search as decoding algorithm for creating the synthetic corpus. Alternatively, it has been proposed to append a tag to the back-translated corpus for helping the NMT system to distinguish the synthetic bilingual corpus from the authentic one. However, not all the combinations of the previous approaches have been tested, and thus it is not clear which is the best approach for developing a given NMT system. In this work, we empirically compare and combine existing techniques for back-translation in a real low resource setting: the translation of clinical notes from Basque into Spanish. Apart from automatically evaluating the MT systems, we ask bilingual healthcare workers to perform a human evaluation, and analyze the different synthetic corpora by measuring their lexical diversity (LD). For reproducibility and generalizability, we repeat our experiments for German to English translation using public data. The results suggest that in lower resource scenarios tagging only helps when using sampling for decoding, in contradiction with the previous literature using bigger corpora from the news domain. When fine-tuning with a few thousand bilingual in-domain sentences, one of our proposed method (tagged restricted sampling) obtains the best results both in terms of automatic and human evaluation. We will publish the code upon acceptance.

2021

pdf
Findings of the WMT 2021 Biomedical Translation Shared Task: Summaries of Animal Experiments as New Test Set
Lana Yeganova | Dina Wiemann | Mariana Neves | Federica Vezzani | Amy Siu | Inigo Jauregi Unanue | Maite Oronoz | Nancy Mah | Aurélie Névéol | David Martinez | Rachel Bawden | Giorgio Maria Di Nunzio | Roland Roller | Philippe Thomas | Cristian Grozea | Olatz Perez-de-Viñaspre | Maika Vicente Navarro | Antonio Jimeno Yepes
Proceedings of the Sixth Conference on Machine Translation

In the sixth edition of the WMT Biomedical Task, we addressed a total of eight language pairs, namely English/German, English/French, English/Spanish, English/Portuguese, English/Chinese, English/Russian, English/Italian, and English/Basque. Further, our tests were composed of three types of textual test sets. New to this year, we released a test set of summaries of animal experiments, in addition to the test sets of scientific abstracts and terminologies. We received a total of 107 submissions from 15 teams from 6 countries.

2020

pdf
Findings of the WMT 2020 Biomedical Translation Shared Task: Basque, Italian and Russian as New Additional Languages
Rachel Bawden | Giorgio Maria Di Nunzio | Cristian Grozea | Inigo Jauregi Unanue | Antonio Jimeno Yepes | Nancy Mah | David Martinez | Aurélie Névéol | Mariana Neves | Maite Oronoz | Olatz Perez-de-Viñaspre | Massimo Piccardi | Roland Roller | Amy Siu | Philippe Thomas | Federica Vezzani | Maika Vicente Navarro | Dina Wiemann | Lana Yeganova
Proceedings of the Fifth Conference on Machine Translation

Machine translation of scientific abstracts and terminologies has the potential to support health professionals and biomedical researchers in some of their activities. In the fifth edition of the WMT Biomedical Task, we addressed a total of eight language pairs. Five language pairs were previously addressed in past editions of the shared task, namely, English/German, English/French, English/Spanish, English/Portuguese, and English/Chinese. Three additional languages pairs were also introduced this year: English/Russian, English/Italian, and English/Basque. The task addressed the evaluation of both scientific abstracts (all language pairs) and terminologies (English/Basque only). We received submissions from a total of 20 teams. For recurring language pairs, we observed an improvement in the translations in terms of automatic scores and qualitative evaluations, compared to previous years.

pdf
Ixamed’s submission description for WMT20 Biomedical shared task: benefits and limitations of using terminologies for domain adaptation
Xabier Soto | Olatz Perez-de-Viñaspre | Gorka Labaka | Maite Oronoz
Proceedings of the Fifth Conference on Machine Translation

In this paper we describe the systems developed at Ixa for our participation in WMT20 Biomedical shared task in three language pairs, en-eu, en-es and es-en. When defining our approach, we have put the focus on making an efficient use of corpora recently compiled for training Machine Translation (MT) systems to translate Covid-19 related text, as well as reusing previously compiled corpora and developed systems for biomedical or clinical domain. Regarding the techniques used, we base on the findings from our previous works for translating clinical texts into Basque, making use of clinical terminology for adapting the MT systems to the clinical domain. However, after manually inspecting some of the outputs generated by our systems, for most of the submissions we end up using the system trained only with the basic corpus, since the systems including the clinical terminologies generated outputs shorter in length than the corresponding references. Thus, we present simple baselines for translating abstracts between English and Spanish (en/es); while for translating abstracts and terms from English into Basque (en-eu), we concatenate the best en-es system for each kind of text with our es-eu system. We present automatic evaluation results in terms of BLEU scores, and analyse the effect of including clinical terminology on the average sentence length of the generated outputs. Following the recent recommendations for a responsible use of GPUs for NLP research, we include an estimation of the generated CO2 emissions, based on the power consumed for training the MT systems.

2019

pdf bib
Leveraging SNOMED CT terms and relations for machine translation of clinical texts from Basque to Spanish
Xabier Soto | Olatz Perez-De-Viñaspre | Maite Oronoz | Gorka Labaka
Proceedings of the Second Workshop on Multilingualism at the Intersection of Knowledge Bases and Machine Translation

2016

pdf
IXA Biomedical Translation System at WMT16 Biomedical Translation Task
Olatz Perez-de-Viñaspre | Gorka Labaka
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

2014

pdf
Translating SNOMED CT Terminology into a Minor Language
Olatz Perez-de-Viñaspre | Maite Oronoz
Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi)

2013

pdf
A Finite-State Approach to Translate SNOMED CT Terms into Basque Using Medical Prefixes and Suffixes
Olatz Perez-de-Viñaspre | Maite Oronoz | Manex Agirrezabal | Mikel Lersundi
Proceedings of the 11th International Conference on Finite State Methods and Natural Language Processing