Toni Badia

2021

pdf abs
Evaluating morphological typology in zero-shot cross-lingual transfer
Antonio Martínez-García | Toni Badia | Jeremy Barnes
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Cross-lingual transfer has improved greatly through multi-lingual language model pretraining, reducing the need for parallel data and increasing absolute performance. However, this progress has also brought to light the differences in performance across languages. Specifically, certain language families and typologies seem to consistently perform worse in these models. In this paper, we address what effects morphological typology has on zero-shot cross-lingual transfer for two tasks: Part-of-speech tagging and sentiment analysis. We perform experiments on 19 languages from four language typologies (fusional, isolating, agglutinative, and introflexive) and find that transfer to another morphological type generally implies a higher loss than transfer to another language with the same morphological typology. Furthermore, POS tagging is more sensitive to morphological typology than sentiment analysis and, on this task, models perform much better on fusional languages than on the other typologies.

2020

pdf abs
Cross-lingual Emotion Intensity Prediction
Irean Navas Alejo | Toni Badia | Jeremy Barnes
Proceedings of the Third Workshop on Computational Modeling of People's Opinions, Personality, and Emotion's in Social Media

Emotion intensity prediction determines the degree or intensity of an emotion that the author expresses in a text, extending previous categorical approaches to emotion detection. While most previous work on this topic has concentrated on English texts, other languages would also benefit from fine-grained emotion classification, preferably without having to recreate the amount of annotated data available in English in each new language. Consequently, we explore cross-lingual transfer approaches for fine-grained emotion detection in Spanish and Catalan tweets. To this end we annotate a test set of Spanish and Catalan tweets using Best-Worst scaling. We compare six cross-lingual approaches, e.g., machine translation and cross-lingual embeddings, which have varying requirements for parallel data – from millions of parallel sentences to completely unsupervised. The results show that on this data, methods with low parallel-data requirements perform surprisingly better than methods that use more parallel data, which we explain through an in-depth error analysis. We make the dataset and the code available at https://github.com/jerbarnes/fine-grained_cross-lingual_emotion.

pdf abs
PosEdiOn: Post-Editing Assessment in PythOn
Antoni Oliver | Sergi Alvarez | Toni Badia
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

There is currently an extended use of post-editing of machine translation (PEMT) in the translation industry. This is due to the increase in the demand of translation and to the significant improvements in quality achieved by neural machine translation (NMT). PEMT has been included as part of the translation workflow because it increases translators’ productivity and it also reduces costs. Although an effective post-editing requires enough quality of the MT output, usual automatic metrics do not always correlate with post-editing effort. We describe a standalone tool designed both for industry and research that has two main purposes: collect sentence-level information from the post-editing process (e.g. post-editing time and keystrokes) and visually present multiple evaluation scores so they can be easily interpreted by a user.

pdf abs
Quantitative Analysis of Post-Editing Effort Indicators for NMT
Sergi Alvarez | Antoni Oliver | Toni Badia
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

The recent improvements in machine translation (MT) have boosted the use of post-editing (PE) in the translation industry. A new machine translation paradigm, neural machine translation (NMT), is displacing its corpus-based predecessor, statistical machine translation (SMT), in the translation workflows currently implemented because it usually increases the fluency and accuracy of the MT output. However, usual automatic measurements do not always indicate the quality of the MT output and there is still no clear correlation between PE effort and productivity. We present a quantitative analysis of different PE effort indicators for two NMT systems (transformer and seq2seq) for English-Spanish in-domain medical documents. We compare both systems and study the correlation between PE time and other scores. Results show less PE effort for the transformer NMT model and a high correlation between PE time and keystrokes.

2019

pdf
Does NMT make a difference when post-editing closely related languages? The case of Spanish-Catalan
Sergi Alvarez | Antoni Oliver | Toni Badia
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

pdf abs
Attention and Lexicon Regularized LSTM for Aspect-based Sentiment Analysis
Lingxian Bao | Patrik Lambert | Toni Badia
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Abstract Attention based deep learning systems have been demonstrated to be the state of the art approach for aspect-level sentiment analysis, however, end-to-end deep neural networks lack flexibility as one can not easily adjust the network to fix an obvious problem, especially when more training data is not available: e.g. when it always predicts positive when seeing the word disappointed. Meanwhile, it is less stressed that attention mechanism is likely to “over-focus” on particular parts of a sentence, while ignoring positions which provide key information for judging the polarity. In this paper, we describe a simple yet effective approach to leverage lexicon information so that the model becomes more flexible and robust. We also explore the effect of regularizing attention vectors to allow the network to have a broader “focus” on different parts of the sentence. The experimental results demonstrate the effectiveness of our approach.

2018

pdf
MultiBooked: A Corpus of Basque and Catalan Hotel Reviews Annotated for Aspect-level Sentiment Classification
Jeremy Barnes | Toni Badia | Patrik Lambert
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf abs
Exploring Distributional Representations and Machine Translation for Aspect-based Cross-lingual Sentiment Classification.
Jeremy Barnes | Patrik Lambert | Toni Badia
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Cross-lingual sentiment classification (CLSC) seeks to use resources from a source language in order to detect sentiment and classify text in a target language. Almost all research into CLSC has been carried out at sentence and document level, although this level of granularity is often less useful. This paper explores methods for performing aspect-based cross-lingual sentiment classification (aspect-based CLSC) for under-resourced languages. Given the limited nature of parallel data for many languages, we would like to make the most of this resource for our task. We compare zero-shot learning, bilingual word embeddings, stacked denoising autoencoder representations and machine translation techniques for aspect-based CLSC. Each of these approaches requires differing amounts of parallel data. We show that models based on distributed semantics can achieve comparable results to machine translation on aspect-based CLSC and give an analysis of the errors found for each method.

2014

pdf abs
The NewSoMe Corpus: A Unifying Opinion Annotation Framework across Genres and in Multiple Languages
Roser Saurí | Judith Domingo | Toni Badia
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present the NewSoMe (News and Social Media) Corpus, a set of subcorpora with annotations on opinion expressions across genres (news reports, blogs, product reviews and tweets) and covering multiple languages (English, Spanish, Catalan and Portuguese). NewSoMe is the result of an effort to increase the opinion corpus resources available in languages other than English, and to build a unifying annotation framework for analyzing opinion in different genres, including controlled text, such as news reports, as well as different types of user generated contents (UGC). Given the broad design of the resource, most of the annotation effort were carried out resorting to crowdsourcing platforms: Amazon Mechanical Turk and CrowdFlower. This created an excellent opportunity to research on the feasibility of crowdsourcing methods for annotating big amounts of text in different languages.

2012

pdf
Modeling Regular Polysemy: A Study on the Semantic Classification of Catalan Adjectives
Gemma Boleda | Sabine Schulte im Walde | Toni Badia
Computational Linguistics, Volume 38, Issue 3 - September 2012

pdf bib
Proceedings of the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT
Josef van Genabith | Toni Badia | Christian Federmann | Maite Melero | Marta R. Costa-jussà | Tsuyoshi Okita
Proceedings of the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT

pdf
Results from the ML4HMT-12 Shared Task on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid Machine Translation
Christian Federmann | Tsuyoshi Okita | Maite Melero | Marta R. Costa-Jussa | Toni Badia | Josef van Genabith
Proceedings of the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT

2008

This paper presents a methodology for the design and implementation of user-centred language checking applications. The methodology is based on the separation of three critical aspects in this kind of application: functional purpose (educational or corrective goal), types of warning messages, and linguistic resources and computational techniques used. We argue that to assure a user-centred design there must be a clear-cut division between the error typology underlying the system and the software architecture. The methodology described has been used to implement two different user-driven spell, grammar and style checkers for Catalan. We discuss that this is an issue often neglected in commercial applications, and remark the benefits of such a methodology in the scalability of language checking applications. We evaluate our application in terms of recall, precision and noise, and compare it to the only other existing grammar checker for Catalan, to our knowledge.

pdf abs
Rapid Deployment of a New METIS Language Pair: Catalan-English
Toni Badia | Maite Melero | Oriol Valentín
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We show here the viability of a rapid deployment of a new language pair within the METIS architecture. In order to do it, we have benefited from the approach of our existing Spanish-English system, which is particularly generation intensive. Contrarily to other SMT or EBMT systems, the METIS architecture allows us to forgo parallel texts, which for many language pairs, such as Catalan-English are hard to obtain. In this experiment, we have successfully built a Catalan-English prototype by simply plugging a POS tagger for Catalan and a bilingual Catalan-English dictionary to the English generation part of the system already developed for other language pairs.

In this paper we describe the METIS-II system and its evaluation on each of the language pairs: Dutch, German, Greek, and Spanish to English. The METIS-II system envisaged developing a data-driven approach in which no parallel corpus is required, and in which no full parser or extensive rule sets are needed. We describe evalution on a development test set and on a test set coming from Europarl, and compare our results with SYSTRAN. We also provide some further analysis, researching the impact of the number and source of the reference translations and analysing the results according to test text type. The results are expectably lower for the METIS system, but not at an unatainable distance from a mature system like SYSTRAN.

2007

pdf
Demonstration of the Spanish to English METIS-II MT system
Maite Melero | Toni Badia
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

pdf
Modelling Polysemy in Adjective Classes by Multi-Label Classification
Gemma Boleda | Sabine Schulte im Walde | Toni Badia
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

pdf abs
METIS-II: Machine Translation for Low Resource Languages
Vincent Vandeghinste | Ineke Schuurman | Michael Carl | Stella Markantonatou | Toni Badia
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we describe a machine translation prototype in which we use only minimal resources for both the source and the target language. A shallow source language analysis, combined with a translation dictionary and a mapping system of source language phenomena into the target language and a target language corpus for generation are all the resources needed in the described system. Several approaches are presented.