Cristina España-Bonet

Also published as: Cristina Espana-Bonet

2022

pdf abs
Exploiting Social Media Content for Self-Supervised Style Transfer
Dana Ruiter | Thomas Kleinbauer | Cristina España-Bonet | Josef van Genabith | Dietrich Klakow
Proceedings of the Tenth International Workshop on Natural Language Processing for Social Media

Recent research on style transfer takes inspiration from unsupervised neural machine translation (UNMT), learning from large amounts of non-parallel data by exploiting cycle consistency loss, back-translation, and denoising autoencoders. By contrast, the use of selfsupervised NMT (SSNMT), which leverages (near) parallel instances hidden in non-parallel data more efficiently than UNMT, has not yet been explored for style transfer. In this paper we present a novel Self-Supervised Style Transfer (3ST) model, which augments SSNMT with UNMT methods in order to identify and efficiently exploit supervisory signals in non-parallel social media posts. We compare 3ST with state-of-the-art (SOTA) style transfer models across civil rephrasing, formality and polarity tasks. We show that 3ST is able to balance the three major objectives (fluency, content preservation, attribute transfer accuracy) the best, outperforming SOTA models on averaged performance across their tested tasks in automatic and human evaluation.

pdf abs
Towards Debiasing Translation Artifacts
Koel Dutta Chowdhury | Rricha Jalota | Cristina España-Bonet | Josef Genabith
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Cross-lingual natural language processing relies on translation, either by humans or machines, at different levels, from translating training data to translating test sets. However, compared to original texts in the same language, translations possess distinct qualities referred to as translationese. Previous research has shown that these translation artifacts influence the performance of a variety of cross-lingual tasks. In this work, we propose a novel approach to reducing translationese by extending an established bias-removal technique. We use the Iterative Null-space Projection (INLP) algorithm, and show by measuring classification accuracy before and after debiasing, that translationese is reduced at both sentence and word level. We evaluate the utility of debiasing translationese on a natural language inference (NLI) task, and show that by reducing this bias, NLI accuracy improves. To the best of our knowledge, this is the first study to debias translationese as represented in latent embedding space.

2021

This paper presents the results of the newstranslation task, the multilingual low-resourcetranslation for Indo-European languages, thetriangular translation task, and the automaticpost-editing task organised as part of the Con-ference on Machine Translation (WMT) 2021.In the news task, participants were asked tobuild machine translation systems for any of10 language pairs, to be evaluated on test setsconsisting mainly of news stories. The taskwas also opened up to additional test suites toprobe specific aspects of translation.

pdf abs
Tracing Source Language Interference in Translation with Graph-Isomorphism Measures
Koel Dutta Chowdhury | Cristina España-Bonet | Josef van Genabith
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Previous research has used linguistic features to show that translations exhibit traces of source language interference and that phylogenetic trees between languages can be reconstructed from the results of translations into the same language. Recent research has shown that instances of translationese (source language interference) can even be detected in embedding spaces, comparing embeddings spaces of original language data with embedding spaces resulting from translations into the same language, using a simple Eigenvector-based divergence from isomorphism measure. To date, it remains an open question whether alternative graph-isomorphism measures can produce better results. In this paper, we (i) explore Gromov-Hausdorff distance, (ii) present a novel spectral version of the Eigenvector-based method, and (iii) evaluate all approaches against a broad linguistic typological database (URIEL). We show that language distances resulting from our spectral isomorphism approaches can reproduce genetic trees on a par with previous work without requiring any explicit linguistic information and that the results can be extended to non-Indo-European languages. Finally, we show that the methods are robust under a variety of modeling conditions.

pdf
Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age
Yuri Bizzoni | Elke Teich | Cristina España-Bonet | Josef van Genabith
Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age

pdf
Do not Rely on Relay Translations: Multilingual Parallel Direct Europarl
Kwabena Amponsah-Kaakyire | Daria Pylypenko | Cristina España-Bonet | Josef van Genabith
Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age

Massively multilingual machine translation (MT) has shown impressive capabilities and including zero and few-shot translation between low-resource language pairs. However and these models are often evaluated on high-resource languages with the assumption that they generalize to low-resource ones. The difficulty of evaluating MT models on low-resource pairs is often due to lack of standardized evaluation datasets. In this paper and we present MENYO-20k and the first multi-domain parallel corpus with a especially curated orthography for Yoruba–English with standardized train-test splits for benchmarking. We provide several neural MT benchmarks and compare them to the performance of popular pre-trained (massively multilingual) MT models both for the heterogeneous test set and its subdomains. Since these pre-trained models use huge amounts of data with uncertain quality and we also analyze the effect of diacritics and a major characteristic of Yoruba and in the training data. We investigate how and when this training condition affects the final quality of a translation and its understandability.Our models outperform massively multilingual models such as Google (+8.7 BLEU) and Facebook M2M (+9.1) when translating to Yoruba and setting a high quality benchmark for future research.

pdf abs
Integrating Unsupervised Data Generation into Self-Supervised Neural Machine Translation for Low-Resource Languages
Dana Ruiter | Dietrich Klakow | Josef van Genabith | Cristina España-Bonet
Proceedings of Machine Translation Summit XVIII: Research Track

For most language combinations and parallel data is either scarce or simply unavailable. To address this and unsupervised machine translation (UMT) exploits large amounts of monolingual data by using synthetic data generation techniques such as back-translation and noising and while self-supervised NMT (SSNMT) identifies parallel sentences in smaller comparable data and trains on them. To this date and the inclusion of UMT data generation techniques in SSNMT has not been investigated. We show that including UMT techniques into SSNMT significantly outperforms SSNMT (up to +4.3 BLEU and af2en) as well as statistical (+50.8 BLEU) and hybrid UMT (+51.5 BLEU) baselines on related and distantly-related and unrelated language pairs.

This paper presents an overview of AVASAG; an ongoing applied-research project developing a text-to-sign-language translation system for public services. We describe the scientific innovation points (geometry-based SL-description, 3D animation and video corpus, simplified annotation scheme, motion capture strategy) and the overall translation pipeline.

pdf abs
Comparing Feature-Engineering and Feature-Learning Approaches for Multilingual Translationese Classification
Daria Pylypenko | Kwabena Amponsah-Kaakyire | Koel Dutta Chowdhury | Josef van Genabith | Cristina España-Bonet
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Traditional hand-crafted linguistically-informed features have often been used for distinguishing between translated and original non-translated texts. By contrast, to date, neural architectures without manual feature engineering have been less explored for this task. In this work, we (i) compare the traditional feature-engineering-based approach to the feature-learning-based one and (ii) analyse the neural architectures in order to investigate how well the hand-crafted features explain the variance in the neural models’ predictions. We use pre-trained neural word embeddings, as well as several end-to-end neural architectures in both monolingual and multilingual settings and compare them to feature-engineering-based SVM classifiers. We show that (i) neural architectures outperform other approaches by more than 20 accuracy points, with the BERT-based model performing the best in both the monolingual and multilingual settings; (ii) while many individual hand-crafted translationese features correlate with neural model predictions, feature importance analysis shows that the most important features for neural and classical architectures differ; and (iii) our multilingual experiments provide empirical evidence for translationese universals across languages.

2020

pdf abs
Massive vs. Curated Embeddings for Low-Resourced Languages: the Case of Yorùbá and Twi
Jesujoba Alabi | Kwabena Amponsah-Kaakyire | David Adelani | Cristina España-Bonet
Proceedings of the Twelfth Language Resources and Evaluation Conference

The success of several architectures to learn semantic representations from unannotated text and the availability of these kind of texts in online multilingual resources such as Wikipedia has facilitated the massive and automatic creation of resources for multiple languages. The evaluation of such resources is usually done for the high-resourced languages, where one has a smorgasbord of tasks and test sets to evaluate on. For low-resourced languages, the evaluation is more difficult and normally ignored, with the hope that the impressive capability of deep learning architectures to learn (multilingual) representations in the high-resourced setting holds in the low-resourced setting too. In this paper we focus on two African languages, Yorùbá and Twi, and compare the word embeddings obtained in this way, with word embeddings obtained from curated corpora and a language-dependent processing. We analyse the noise in the publicly available corpora, collect high quality and noisy data for the two languages and quantify the improvements that depend not only on the amount of data but on the quality too. We also use different architectures that learn word representations both from surface forms and characters to further exploit all the available information which showed to be important for these languages. For the evaluation, we manually translate the wordsim-353 word pairs dataset from English into Yorùbá and Twi. We extend the analysis to contextual word embeddings and evaluate multilingual BERT on a named entity recognition task. For this, we annotate with named entities the Global Voices corpus for Yorùbá. As output of the work, we provide corpora, embeddings and the test suits for both languages.

pdf abs
GeBioToolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies
Marta R. Costa-jussà | Pau Li Lin | Cristina España-Bonet
Proceedings of the Twelfth Language Resources and Evaluation Conference

We introduce GeBioToolkit, a tool for extracting multilingual parallel corpora at sentence level, with document and gender information from Wikipedia biographies. Despite the gender inequalities present in Wikipedia, the toolkit has been designed to extract corpus balanced in gender. While our toolkit is customizable to any number of languages (and different domains), in this work we present a corpus of 2,000 sentences in English, Spanish and Catalan, which has been post-edited by native speakers to become a high-quality dataset for machine translation evaluation. While GeBioCorpus aims at being one of the first non-synthetic gender-balanced test datasets, GeBioToolkit aims at paving the path to standardize procedures to produce gender-balanced datasets.

pdf abs
Self-Induced Curriculum Learning in Self-Supervised Neural Machine Translation
Dana Ruiter | Josef van Genabith | Cristina España-Bonet
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self-supervised neural machine translation (SSNMT) jointly learns to identify and select suitable training data from comparable (rather than parallel) corpora and to translate, in a way that the two tasks support each other in a virtuous circle. In this study, we provide an in-depth analysis of the sampling choices the SSNMT model makes during training. We show how, without it having been told to do so, the model self-selects samples of increasing (i) complexity and (ii) task-relevance in combination with (iii) performing a denoising curriculum. We observe that the dynamics of the mutual-supervision signals of both system internal representation types are vital for the extraction and translation performance. We show that in terms of the Gunning-Fog Readability index, SSNMT starts extracting and learning from Wikipedia data suitable for high school students and quickly moves towards content suitable for first year undergraduate students.

pdf abs
Understanding Translationese in Multi-view Embedding Spaces
Koel Dutta Chowdhury | Cristina España-Bonet | Josef van Genabith
Proceedings of the 28th International Conference on Computational Linguistics

Recent studies use a combination of lexical and syntactic features to show that footprints of the source language remain visible in translations, to the extent that it is possible to predict the original source language from the translation. In this paper, we focus on embedding-based semantic spaces, exploiting departures from isomorphism between spaces built from original target language and translations into this target language to predict relations between languages in an unsupervised way. We use different views of the data — words, parts of speech, semantic tags and synsets — to track translationese. Our analysis shows that (i) semantic distances between original target language and translations into this target language can be detected using the notion of isomorphism, (ii) language family ties with characteristics similar to linguistically motivated phylogenetic trees can be inferred from the distances and (iii) with delexicalised embeddings exhibiting source-language interference most significantly, other levels of abstraction display the same tendency, indicating the lexicalised results to be not “just” due to possible topic differences between original and translated texts. To the best of our knowledge, this is the first time departures from isomorphism between embedding spaces are used to track translationese.

pdf abs
How Human is Machine Translationese? Comparing Human and Machine Translations of Text and Speech
Yuri Bizzoni | Tom S Juzek | Cristina España-Bonet | Koel Dutta Chowdhury | Josef van Genabith | Elke Teich
Proceedings of the 17th International Conference on Spoken Language Translation

Translationese is a phenomenon present in human translations, simultaneous interpreting, and even machine translations. Some translationese features tend to appear in simultaneous interpreting with higher frequency than in human text translation, but the reasons for this are unclear. This study analyzes translationese patterns in translation, interpreting, and machine translation outputs in order to explore possible reasons. In our analysis we – (i) detail two non-invasive ways of detecting translationese and (ii) compare translationese across human and machine translations from text and speech. We find that machine translation shows traces of translationese, but does not reproduce the patterns found in human translation, offering support to the hypothesis that such patterns are due to the model (human vs machine) rather than to the data (written vs spoken).

pdf abs
Multilingual and Interlingual Semantic Representations for Natural Language Processing: A Brief Introduction
Marta R. Costa-jussà | Cristina España-Bonet | Pascale Fung | Noah A. Smith
Computational Linguistics, Volume 46, Issue 2 - June 2020

We introduce the Computational Linguistics special issue on Multilingual and Interlingual Semantic Representations for Natural Language Processing. We situate the special issue’s five articles in the context of our fast-changing field, explaining our motivation for this project. We offer a brief summary of the work in the issue, which includes developments on lexical and sentential semantic representations, from symbolic and neural perspectives.

2019

pdf abs
Analysing Coreference in Transformer Outputs
Ekaterina Lapshinova-Koltunski | Cristina España-Bonet | Josef van Genabith
Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019)

We analyse coreference phenomena in three neural machine translation systems trained with different data settings with or without access to explicit intra- and cross-sentential anaphoric information. We compare system performance on two different genres: news and TED talks. To do this, we manually annotate (the possibly incorrect) coreference chains in the MT outputs and evaluate the coreference chain translations. We define an error typology that aims to go further than pronoun translation adequacy and includes types such as incorrect word selection or missing words. The features of coreference chains in automatic translations are also compared to those of the source texts and human translations. The analysis shows stronger potential translationese effects in machine translated outputs than in human translations.

pdf abs
Context-Aware Neural Machine Translation Decoding
Eva Martínez Garcia | Carles Creus | Cristina España-Bonet
Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019)

This work presents a decoding architecture that fuses the information from a neural translation model and the context semantics enclosed in a semantic space language model based on word embeddings. The method extends the beam search decoding process and therefore can be applied to any neural machine translation framework. With this, we sidestep two drawbacks of current document-level systems: (i) we do not modify the training process so there is no increment in training time, and (ii) we do not require document-level an-notated data. We analyze the impact of the fusion system approach and its parameters on the final translation quality for English–Spanish. We obtain consistent and statistically significant improvements in terms of BLEU and METEOR and we observe how the fused systems are able to handle synonyms to propose more adequate translations as well as help the system to disambiguate among several translation candidates for a word.

pdf abs
Self-Supervised Neural Machine Translation
Dana Ruiter | Cristina España-Bonet | Josef van Genabith
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We present a simple new method where an emergent NMT system is used for simultaneously selecting training data and learning internal NMT representations. This is done in a self-supervised way without parallel data, in such a way that both tasks enhance each other during training. The method is language independent, introduces no additional hyper-parameters, and achieves BLEU scores of 29.21 (en2fr) and 27.36 (fr2en) on newstest2014 using English and French Wikipedia data for training.

pdf abs
UdS-DFKI Participation at WMT 2019: Low-Resource (en-gu) and Coreference-Aware (en-de) Systems
Cristina España-Bonet | Dana Ruiter
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper describes the UdS-DFKI submission to the WMT2019 news translation task for Gujarati–English (low-resourced pair) and German–English (document-level evaluation). Our systems rely on the on-line extraction of parallel sentences from comparable corpora for the first scenario and on the inclusion of coreference-related information in the training data in the second one.

2017

pdf abs
Lump at SemEval-2017 Task 1: Towards an Interlingua Semantic Similarity
Cristina España-Bonet | Alberto Barrón-Cedeño
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This is the Lump team participation at SemEval 2017 Task 1 on Semantic Textual Similarity. Our supervised model relies on features which are multilingual or interlingual in nature. We include lexical similarities, cross-language explicit semantic analysis, internal representations of multilingual neural networks and interlingual word embeddings. Our representations allow to use large datasets in language pairs with many instances to better classify instances in smaller language pairs avoiding the necessity of translating into a single language. Hence we can deal with all the languages in the task: Arabic, English, Spanish, and Turkish.

pdf abs
Going beyond zero-shot MT: combining phonological, morphological and semantic factors. The UdS-DFKI System at IWSLT 2017
Cristina España-Bonet | Josef van Genabith
Proceedings of the 14th International Conference on Spoken Language Translation

This paper describes the UdS-DFKI participation to the multilingual task of the IWSLT Evaluation 2017. Our approach is based on factored multilingual neural translation systems following the small data and zero-shot training conditions. Our systems are designed to fully exploit multilinguality by including factors that increase the number of common elements among languages such as phonetic coarse encodings and synsets, besides shallow part-of-speech tags, stems and lemmas. Document level information is also considered by including the topic of every document. This approach improves a baseline without any additional factor for all the language pairs and even allows beyond-zero-shot translation. That is, the translation from unseen languages is possible thanks to the common elements —especially synsets in our models— among languages.

pdf abs
Learning Bilingual Projections of Embeddings for Vocabulary Expansion in Machine Translation
Pranava Swaroop Madhyastha | Cristina España-Bonet
Proceedings of the 2nd Workshop on Representation Learning for NLP

We propose a simple log-bilinear softmax-based model to deal with vocabulary expansion in machine translation. Our model uses word embeddings trained on significantly large unlabelled monolingual corpora and learns over a fairly small, word-to-word bilingual dictionary. Given an out-of-vocabulary source word, the model generates a probabilistic list of possible translations in the target language using the trained bilingual embeddings. We integrate these translation options into a standard phrase-based statistical machine translation system and obtain consistent improvements in translation quality on the English–Spanish language pair. When tested over an out-of-domain testset, we get a significant improvement of 3.9 BLEU points.

2016

pdf
The TALP–UPC Spanish–English WMT Biomedical Task: Bilingual Embeddings and Char-based Neural Language Model Rescoring in a Phrase-based System
Marta R. Costa-jussà | Cristina España-Bonet | Pranava Madhyastha | Carlos Escolano | José A. R. Fonollosa
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

We introduce TweetMT, a parallel corpus of tweets in four language pairs that combine five languages (Spanish from/to Basque, Catalan, Galician and Portuguese), all of which have an official status in the Iberian Peninsula. The corpus has been created by combining automatic collection and crowdsourcing approaches, and it is publicly available. It is intended for the development and testing of microtext machine translation systems. In this paper we describe the methodology followed to build the corpus, and present the results of the shared task in which it was tested.

2015

pdf
Document-Level Machine Translation with Word Vector Models
Eva Martinez Garcia | Cristina Espana-Bonet | Lluis Marquez
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
A Factory of Comparable Corpora from Wikipedia
Alberto Barrón-Cedeño | Cristina España-Bonet | Josu Boldoba | Lluís Màrquez
Proceedings of the Eighth Workshop on Building and Using Comparable Corpora

pdf
Document-Level Machine Translation with Word Vector Models
Eva Martínez Garcia | Cristina España-Bonet | Lluís Màrquez
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

2014

pdf
Word’s Vector Representations meet Machine Translation
Eva Martínez Garcia | Jörg Tiedemann | Cristina España-Bonet | Lluís Màrquez
Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation

2013

In this paper, we present a brief snapshot of the state of affairs in computational processing of Catalan and the initiatives that are starting to take place in an effort to bring the field a step forward, by making a better and more efficient use of the already existing resources and tools, by bridging the gap between research and market, and by establishing periodical meeting points for the community. In particular, we present the results of the First Workshop on the Computational Processing of Catalan, which succeeded in putting together a fair representation of the research in the area, and received attention from both the industry and the administration. Aside from facilitating communication among researchers and between developers and users, the Workshop provided the organizers with valuable information about existing resources, tools, developers and providers. This information has allowed us to go a step further by setting up a harvesting procedure which will hopefully build the seed of a portal-catalogue-observatory of language resources and technologies in Catalan.