Antonio. Toral

Also published as: Antonio Toral

2021

pdf bib abs
Thank you BART! Rewarding Pre-Trained Models Improves Formality Style Transfer
Huiyuan Lai | Antonio Toral | Malvina Nissim
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Scarcity of parallel data causes formality style transfer models to have scarce success in preserving content. We show that fine-tuning pre-trained language (GPT-2) and sequence-to-sequence (BART) models boosts content preservation, and that this is possible even with limited amounts of parallel data. Augmenting these models with rewards that target style and content –the two core aspects of the task– we achieve a new state-of-the-art.

pdf bib abs
Generic resources are what you need: Style transfer tasks without task-specific parallel training data
Huiyuan Lai | Antonio Toral | Malvina Nissim
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Style transfer aims to rewrite a source text in a different target style while preserving its content. We propose a novel approach to this task that leverages generic resources, and without using any task-specific parallel (source–target) data outperforms existing unsupervised approaches on the two most popular style transfer tasks: formality transfer and polarity swap. In practice, we adopt a multi-step procedure which builds on a generic pre-trained sequence-to-sequence model (BART). First, we strengthen the model’s ability to rewrite by further pre-training BART on both an existing collection of generic paraphrases, as well as on synthetic pairs created using a general-purpose lexical resource. Second, through an iterative back-translation approach, we train two models, each in a transfer direction, so that they can provide each other with synthetically generated pairs, dynamically in the training process. Lastly, we let our best resulting model generate static synthetic pairs to be used in a supervised training regime. Besides methodology and state-of-the-art results, a core contribution of this work is a reflection on the nature of the two tasks we address, and how their differences are highlighted by their response to our approach.

pdf bib abs
Unsupervised Translation of German–Lower Sorbian: Exploring Training and Novel Transfer Methods on a Low-Resource Language
Lukas Edman | Ahmet Üstün | Antonio Toral | Gertjan van Noord
Proceedings of the Sixth Conference on Machine Translation

This paper describes the methods behind the systems submitted by the University of Groningen for the WMT 2021 Unsupervised Machine Translation task for German–Lower Sorbian (DE–DSB): a high-resource language to a low-resource one. Our system uses a transformer encoder-decoder architecture in which we make three changes to the standard training procedure. First, our training focuses on two languages at a time, contrasting with a wealth of research on multilingual systems. Second, we introduce a novel method for initializing the vocabulary of an unseen language, achieving improvements of 3.2 BLEU for DE->DSB and 4.0 BLEU for DSB->DE.Lastly, we experiment with the order in which offline and online back-translation are used to train an unsupervised system, finding that using online back-translation first works better for DE->DSB by 2.76 BLEU. Our submissions ranked first (tied with another team) for DSB->DE and third for DE->DSB.

2020

pdf bib abs
Character-level Representations Improve DRS-based Semantic Parsing Even in the Age of BERT
Rik van Noord | Antonio Toral | Johan Bos
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We combine character-level and contextual language model representations to improve performance on Discourse Representation Structure parsing. Character representations can easily be added in a sequence-to-sequence model in either one encoder or as a fully separate encoder, with improvements that are robust to different language models, languages and data sets. For English, these improvements are larger than adding individual sources of linguistic information or adding non-contextual embeddings. A new method of analysis based on semantic tags demonstrates that the character-level representations improve performance across a subset of selected semantic phenomena.

pdf bib abs
Machine Translation for English–Inuktitut with Segmentation, Data Acquisition and Pre-Training
Christian Roest | Lukas Edman | Gosse Minnema | Kevin Kelly | Jennifer Spenader | Antonio Toral
Proceedings of the Fifth Conference on Machine Translation

Translating to and from low-resource polysynthetic languages present numerous challenges for NMT. We present the results of our systems for the English–Inuktitut language pair for the WMT 2020 translation tasks. We investigated the importance of correct morphological segmentation, whether or not adding data from a related language (Greenlandic) helps, and whether using contextual word embeddings improves translation. While each method showed some promise, the results are mixed.

pdf bib abs
Data Selection for Unsupervised Translation of German–Upper Sorbian
Lukas Edman | Antonio Toral | Gertjan van Noord
Proceedings of the Fifth Conference on Machine Translation

This paper describes the methods behind the systems submitted by the University of Groningen for the WMT 2020 Unsupervised Machine Translation task for German–Upper Sorbian. We investigate the usefulness of data selection in the unsupervised setting. We find that we can perform data selection using a pretrained model and show that the quality of a set of sentences or documents can have a great impact on the performance of the UNMT system trained on it. Furthermore, we show that document-level data selection should be preferred for training the XLM model when possible. Finally, we show that there is a trade-off between quality and quantity of the data used to train UNMT systems.

pdf bib abs
Low-Resource Unsupervised NMT: Diagnosing the Problem and Providing a Linguistically Motivated Solution
Lukas Edman | Antonio Toral | Gertjan van Noord
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

Unsupervised Machine Translation has been advancing our ability to translate without parallel data, but state-of-the-art methods assume an abundance of monolingual data. This paper investigates the scenario where monolingual data is limited as well, finding that current unsupervised methods suffer in performance under this stricter setting. We find that the performance loss originates from the poor quality of the pretrained monolingual embeddings, and we offer a potential solution: dependency-based word embeddings. These embeddings result in a complementary word representation which offers a boost in performance of around 1.5 BLEU points compared to standard word2vec when monolingual data is limited to 1 million sentences per language. We also find that the inclusion of sub-word information is crucial to improving the quality of the embeddings.

pdf bib abs
Fine-grained Human Evaluation of Transformer and Recurrent Approaches to Neural Machine Translation for English-to-Chinese
Yuying Ye | Antonio Toral
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

This research presents a fine-grained human evaluation to compare the Transformer and recurrent approaches to neural machine translation (MT), on the translation direction English-to-Chinese. To this end, we develop an error taxonomy compliant with the Multidimensional Quality Metrics (MQM) framework that is customised to the relevant phenomena of this translation direction. We then conduct an error annotation using this customised error taxonomy on the output of state-of-the-art recurrent- and Transformer-based MT systems on a subset of WMT2019’s news test set. The resulting annotation shows that, compared to the best recurrent system, the best Transformer system results in a 31% reduction of the total number of errors and it produced significantly less errors in 10 out of 22 error categories. We also note that two of the systems evaluated do not produce any error for a category that was relevant for this translation direction prior to the advent of NMT systems: Chinese classifiers.

pdf bib abs
Reassessing Claims of Human Parity and Super-Human Performance in Machine Translation at WMT 2019
Antonio Toral
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

We reassess the claims of human parity and super-human performance made at the news shared task of WMT2019 for three translation directions: English→German, English→Russian and German→English. First we identify three potential issues in the human evaluation of that shared task: (i) the limited amount of intersen- tential context available, (ii) the limited translation proficiency of the evaluators and (iii) the use of a reference transla- tion. We then conduct a modified eval- uation taking these issues into account. Our results indicate that all the claims of human parity and super-human perfor- mance made at WMT2019 should be re- futed, except the claim of human parity for English→German. Based on our findings, we put forward a set of recommendations and open questions for future assessments of human parity in machine translation.

2019

pdf bib abs
Linguistic Information in Neural Semantic Parsing with Multiple Encoders
Rik van Noord | Antonio Toral | Johan Bos
Proceedings of the 13th International Conference on Computational Semantics - Short Papers

Recently, sequence-to-sequence models have achieved impressive performance on a number of semantic parsing tasks. However, they often do not exploit available linguistic resources, while these, when employed correctly, are likely to increase performance even further. Research in neural machine translation has shown that employing this information has a lot of potential, especially when using a multi-encoder setup. We employ a range of semantic and syntactic resources to improve performance for the task of Discourse Representation Structure Parsing. We show that (i) linguistic features can be beneficial for neural semantic parsing and (ii) the best method of adding these features is by using multiple encoders.

pdf bib abs
The Effect of Translationese in Machine Translation Test Sets
Mike Zhang | Antonio Toral
Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers)

The effect of translationese has been studied in the field of machine translation (MT), mostly with respect to training data. We study in depth the effect of translationese on test data, using the test sets from the last three editions of WMT’s news shared task, containing 17 translation directions. We show evidence that (i) the use of translationese in test sets results in inflated human evaluation scores for MT systems; (ii) in some cases system rankings do change and (iii) the impact translationese has on a translation direction is inversely correlated to the translation quality attainable by state-of-the-art MT systems for that direction.

pdf bib abs
Neural Machine Translation for English–Kazakh with Morphological Segmentation and Synthetic Data
Antonio Toral | Lukas Edman | Galiya Yeshmagambetova | Jennifer Spenader
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper presents the systems submitted by the University of Groningen to the English– Kazakh language pair (both translation directions) for the WMT 2019 news translation task. We explore the potential benefits of (i) morphological segmentation (both unsupervised and rule-based), given the agglutinative nature of Kazakh, (ii) data from two additional languages (Turkish and Russian), given the scarcity of English–Kazakh data and (iii) synthetic data, both for the source and for the target language. Our best submissions ranked second for Kazakh→English and third for English→Kazakh in terms of the BLEU automatic evaluation metric.

pdf bib
Post-editese: an Exacerbated Translationese
Antonio Toral
Proceedings of Machine Translation Summit XVII: Research Track

bib abs
Practical Statistics for Research in Machine Translation and Translation Studies
Antonio Toral
Proceedings of Machine Translation Summit XVII: Tutorial Abstracts

The tutorial will introduce a set of very useful statistical tests for conducting analyses in the research areas of Machine Translation (MT) and Translation Studies (TS). For each statistical test, the presenter will: 1) introduce it in the context of a common research example that pertains to the area of MT and/or TS 2) explain the technique behind the test and its assumptions 3) cover common pitfalls when the test is applied in research studies, and 4) conduct a hands-on activity so that attendees can put the knowledge acquired in practice straight-away. All examples and exercises will be in R. The following statistical tests will be covered: t-tests (both parametric and non-parametric), bootstrap resampling, Pearson and Spearman correlation coefficients, linear mixed-effects models.

2018

pdf bib abs
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation
Antonio Toral | Sheila Castilho | Ke Hu | Andy Way
Proceedings of the Third Conference on Machine Translation: Research Papers

We reassess a recent study (Hassan et al., 2018) that claimed that machine translation (MT) has reached human parity for the translation of news from Chinese into English, using pairwise ranking and considering three variables that were not taken into account in that previous study: the language in which the source side of the test set was originally written, the translation proficiency of the evaluators, and the provision of inter-sentential context. If we consider only original source text (i.e. not translated from another language, or translationese), then we find evidence showing that human parity has not been achieved. We compare the judgments of professional translators against those of non-experts and discover that those of the experts result in higher inter-annotator agreement and better discrimination between human and machine translations. In addition, we analyse the human translations of the test set and identify important translation issues. Finally, based on these findings, we provide a set of recommendations for future human evaluations of MT.

pdf bib abs
Exploring Neural Methods for Parsing Discourse Representation Structures
Rik van Noord | Lasha Abzianidze | Antonio Toral | Johan Bos
Transactions of the Association for Computational Linguistics, Volume 6

Neural methods have had several recent successes in semantic parsing, though they have yet to face the challenge of producing meaning representations based on formal semantics. We present a sequence-to-sequence neural semantic parser that is able to produce Discourse Representation Structures (DRSs) for English sentences with high accuracy, outperforming traditional DRS parsers. To facilitate the learning of the output, we represent DRSs as a sequence of flat clauses and introduce a method to verify that produced DRSs are well-formed and interpretable. We compare models using characters and words as input and see (somewhat surprisingly) that the former performs better than the latter. We show that eliminating variable names from the output using De Bruijn indices increases parser performance. Adding silver training data boosts performance even further.

2017

pdf bib abs
A Multifaceted Evaluation of Neural versus Phrase-Based Machine Translation for 9 Language Directions
Antonio Toral | Víctor M. Sánchez-Cartagena
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

We aim to shed light on the strengths and weaknesses of the newly introduced neural machine translation paradigm. To that end, we conduct a multifaceted evaluation in which we compare outputs produced by state-of-the-art neural machine translation and phrase-based machine translation systems for 9 language directions across a number of dimensions. Specifically, we measure the similarity of the outputs, their fluency and amount of reordering, the effect of sentence length and performance across different error categories. We find out that translations produced by neural machine translation systems are considerably different, more fluent and more accurate in terms of word order compared to those produced by phrase-based systems. Neural machine translation systems are also more accurate at producing inflected forms, but they perform poorly when translating very long sentences.

2016

pdf bib
Abu-MaTran at WMT 2016 Translation Task: Deep Learning, Morphological Segmentation and Tuning on Character Sequences
Víctor M. Sánchez-Cartagena | Antonio Toral
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
Re-assessing the Impact of SMT Techniques with Human Evaluation: a Case Study on English—Croatian
Antonio Toral | Raphael Rubino | Gema Ramírez-Sánchez
Proceedings of the 19th Annual Conference of the European Association for Machine Translation

pdf bib abs
Using Wordnet to Improve Reordering in Hierarchical Phrase-Based Statistical Machine Translation
Arefeh Kazemi | Antonio Toral | Andy Way
Proceedings of the 8th Global WordNet Conference (GWC)

We propose the use of WordNet synsets in a syntax-based reordering model for hierarchical statistical machine translation (HPB-SMT) to enable the model to generalize to phrases not seen in the training data but that have equivalent meaning. We detail our methodology to incorporate synsets’ knowledge in the reordering model and evaluate the resulting WordNet-enhanced SMT systems on the English-to-Farsi language direction. The inclusion of synsets leads to the best BLEU score, outperforming the baseline (standard HPB-SMT) by 0.6 points absolute.

pdf bib
Abu-MaTran: automatic building of machine translation
Antonio Toral | Sergio Ortiz Rojas | Mikel Forcada | Nikola Lubesic | Prokopis Prokopidis
Proceedings of the 19th Annual Conference of the European Association for Machine Translation: Projects/Products

We introduce TweetMT, a parallel corpus of tweets in four language pairs that combine five languages (Spanish from/to Basque, Catalan, Galician and Portuguese), all of which have an official status in the Iberian Peninsula. The corpus has been created by combining automatic collection and crowdsourcing approaches, and it is publicly available. It is intended for the development and testing of microtext machine translation systems. In this paper we describe the methodology followed to build the corpus, and present the results of the shared task in which it was tested.

pdf bib abs
Producing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor’s Love Affair
Nikola Ljubešić | Miquel Esplà-Gomis | Antonio Toral | Sergio Ortiz Rojas | Filip Klubička
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool for harvesting parallel data from a collection of documents. We call the system combining these two tools Spidextor, a blend of the names of its two crucial parts. We evaluate the described approach intrinsically by measuring the accuracy of the extracted bitexts from the Croatian top-level domain “.hr” and the Slovene top-level domain “.si”, and extrinsically on the English-Croatian language pair by comparing an SMT system built from the crawled data with third-party systems. We finally present parallel datasets collected with our approach for the English-Croatian, English-Finnish, English-Serbian and English-Slovene language pairs.

pdf bib abs
Enhancing Cross-border EU E-commerce through Machine Translation: Needed Language Resources, Challenges and Opportunities
Meritxell Fernández Barrera | Vladimir Popescu | Antonio Toral | Federico Gaspari | Khalid Choukri
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper discusses the role that statistical machine translation (SMT) can play in the development of cross-border EU e-commerce,by highlighting extant obstacles and identifying relevant technologies to overcome them. In this sense, it firstly proposes a typology of e-commerce static and dynamic textual genres and it identifies those that may be more successfully targeted by SMT. The specific challenges concerning the automatic translation of user-generated content are discussed in detail. Secondly, the paper highlights the risk of data sparsity inherent to e-commerce and it explores the state-of-the-art strategies to achieve domain adequacy via adaptation. Thirdly, it proposes a robust workflow for the development of SMT systems adapted to the e-commerce domain by relying on inexpensive methods. Given the scarcity of user-generated language corpora for most language pairs, the paper proposes to obtain monolingual target-language data to train language models and aligned parallel corpora to tune and evaluate MT systems by means of crowdsourcing.

2015

pdf bib
Translating Literary Text between Related Languages using SMT
Antonio Toral | Andy Way
Proceedings of the Fourth Workshop on Computational Linguistics for Literature

pdf bib
Dependency-based Reordering Model for Constituent Pairs in Hierarchical SMT
Arefeh Kazemi | Antonio Toral | Andy Way | Amirhassan Monadjemi | Mohammadali Nematbakhsh
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Dependency-based Reordering Model for Constituent Pairs in Hierarchical SMT
Arefeh Kazemiy | Antonio Toral | Andy Way | Amirhassan Monadjemiy
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

2014

pdf bib abs
Perception vs. reality: measuring machine translation post-editing productivity
Federico Gaspari | Antonio Toral | Sudip Kumar Naskar | Declan Groves | Andy Way
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas

This paper presents a study of user-perceived vs real machine translation (MT) post-editing effort and productivity gains, focusing on two bidirectional language pairs: English—German and English—Dutch. Twenty experienced media professionals post-edited statistical MT output and also manually translated comparative texts within a production environment. The paper compares the actual post-editing time against the users’ perception of the effort and time required to post-edit the MT output to achieve publishable quality, thus measuring real (vs perceived) productivity gains. Although for all the language pairs users perceived MT post-editing to be slower, in fact it proved to be a faster option than manual translation for two translation directions out of four, i.e. for Dutch to English, and (marginally) for English to German. For further objective scrutiny, the paper also checks the correlation of three state-of-the-art automatic MT evaluation metrics (BLEU, METEOR and TER) with the actual post-editing time.

pdf bib abs
TLAXCALA: a multilingual corpus of independent news
Antonio Toral
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We acquire corpora from the domain of independent news from the Tlaxcala website. We build monolingual corpora for 15 languages and parallel corpora for all the combinations of those 15 languages. These corpora include languages for which only very limited such resources exist (e.g. Tamazight). We present the acquisition process in detail and we also present detailed statistics of the produced corpora, concerning mainly quantitative dimensions such as the size of the corpora per language (for the monolingual corpora) and per language pair (for the parallel corpora). To the best of our knowledge, these are the first publicly available parallel and monolingual corpora for the domain of independent news. We also create models for unsupervised sentence splitting for all the languages of the study.

pdf bib abs
Quality Estimation for Synthetic Parallel Data Generation
Raphael Rubino | Antonio Toral | Nikola Ljubešić | Gema Ramírez-Sánchez
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents a novel approach for parallel data generation using machine translation and quality estimation. Our study focuses on pivot-based machine translation from English to Croatian through Slovene. We generate an English―Croatian version of the Europarl parallel corpus based on the English―Slovene Europarl corpus and the Apertium rule-based translation system for Slovene―Croatian. These experiments are to be considered as a first step towards the generation of reliable synthetic parallel data for under-resourced languages. We first collect small amounts of aligned parallel data for the Slovene―Croatian language pair in order to build a quality estimation system for sentence-level Translation Edit Rate (TER) estimation. We then infer TER scores on automatically translated Slovene to Croatian sentences and use the best translations to build an English―Croatian statistical MT system. We show significant improvement in terms of automatic metrics obtained on two test sets using our approach compared to a random selection of synthetic parallel data.

pdf bib abs
caWaC – A web corpus of Catalan and its application to language modeling and machine translation
Nikola Ljubešić | Antonio Toral
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we present the construction process of a web corpus of Catalan built from the content of the .cat top-level domain. For collecting and processing data we use the Brno pipeline with the spiderling crawler and its accompanying tools. To the best of our knowledge the corpus represents the largest existing corpus of Catalan containing 687 million words, which is a significant increase given that until now the biggest corpus of Catalan, CuCWeb, counts 166 million words. We evaluate the resulting resource on the tasks of language modeling and statistical machine translation (SMT) by calculating LM perplexity and incorporating the LM in the SMT pipeline. We compare language models trained on different subsets of the resource with those trained on the Catalan Wikipedia and the target side of the parallel data used to train the SMT system.

pdf bib
Is machine translation ready for literature
Antonio Toral | Andy Way
Proceedings of Translating and the Computer 36

pdf bib
Extrinsic evaluation of web-crawlers in machine translation: a study on Croatian-English for the tourism domain
Antonio Toral | Raphael Rubino | Miquel Esplà-Gomis | Tommi Pirinen | Andy Way | Gema Ramírez-Sánchez
Proceedings of the 17th Annual conference of the European Association for Machine Translation

pdf bib
Active Learning for Post-Editing Based Incrementally Retrained MT
Aswarth Abhilash Dara | Josef van Genabith | Qun Liu | John Judge | Antonio Toral
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

2013

pdf bib
A Web Application for the Diagnostic Evaluation of Machine Translation over Specific Linguistic Phenomena
Antonio Toral | Sudip Kumar Naskar | Joris Vreeke | Federico Gaspari | Declan Groves
Proceedings of the 2013 NAACL HLT Demonstration Session

pdf bib
Hybrid Selection of Language Model Training Data Using Linguistic Information and Perplexity
Antonio Toral
Proceedings of the Second Workshop on Hybrid Approaches to Translation

pdf bib
Meta-Evaluation of a Diagnostic Quality Metric for Machine Translation
Sudip Kumar Naskar | Antonio Toral | Federico Gaspari | Declan Groves
Proceedings of Machine Translation Summit XIV: Papers

pdf bib
MosesCore: Moses Open Source Evaluation and Support Co-ordination for OutReach and Exploitation PANACEA: Platform for Automatic, Normalised Annotation and Cost-Effective Acquisition of Language Resources for Human Language Technologies
Nuria Bel | Marc Poch | Antonio Toral
Proceedings of Machine Translation Summit XIV: European projects

pdf bib
PANACEA: Platform for Automatic, Normalised Annotation and Cost-Effective Acquisition of Language Resources for Human Language Technologies
Nuria Bel | Marc Poch | Antonio Toral
Proceedings of Machine Translation Summit XIV: European projects

2012

pdf bib
A Diagnostic Evaluation Approach Targeting MT Systems for Indian Languages
Renu Balyan | Sudip Kumar Naskar | Antonio Toral | Niladri Chatterjee
Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages

pdf bib
Topic Modeling-based Domain Adaptation for System Combination
Tsuyoshi Okita | Antonio Toral | Josef van Genabith
Proceedings of the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT

pdf bib
Efficiency-based evaluation of aligners for industrial applications
Antonio. Toral | Marc Poch | Pavel Pecina | Gregor Thurmair
Proceedings of the 16th Annual conference of the European Association for Machine Translation

pdf bib
Domain Adaptation of Statistical Machine Translation using Web-Crawled Resources: A Case Study
Pavel Pecina | Antonio Toral | Vassilis Papavassiliou | Prokopis Prokopidis | Josef van Genabith
Proceedings of the 16th Annual conference of the European Association for Machine Translation

pdf bib
Pivot-based Machine Translation between Statistical and Black Box systems
Antonio Toral
Proceedings of the 16th Annual conference of the European Association for Machine Translation

pdf bib
Simple and Effective Parameter Tuning for Domain Adaptation of Statistical Machine Translation
Pavel Pecina | Antonio Toral | Josef van Genabith
Proceedings of COLING 2012

pdf bib
Language Resources Factory: case study on the acquisition of Translation Memories
Marc Poch | Antonio Toral | Núria Bel
Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib abs
Towards a User-Friendly Platform for Building Language Resources based on Web Services
Marc Poch | Antonio Toral | Olivier Hamon | Valeria Quochi | Núria Bel
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents the platform developed in the PANACEA project, a distributed factory that automates the stages involved in the acquisition, production, updating and maintenance of Language Resources required by Machine Translation and other Language Technologies. We adopt a set of tools that have been successfully used in the Bioinformatics field, they are adapted to the needs of our field and used to deploy web services, which can be combined to build more complex processing chains (workflows). This paper describes the platform and its different components (web services, registry, workflows, social network and interoperability). We demonstrate the scalability of the platform by carrying out a set of massive data experiments. Finally, a validation of the platform across a set of required criteria proves its usability for different types of users (non-technical users and providers).

2011

pdf bib
A Comparative Evaluation of Research vs. Online MT Systems
Antonio Toral | Federico Gaspari | Sudip Kumar Naskar | Andy Way
Proceedings of the 15th Annual conference of the European Association for Machine Translation

pdf bib
Towards a User-Friendly Webservice Architecture for Statistical Machine Translation in the PANACEA project
Antonio Toral | Pavel Pecina | Marc Poch | Andy Way
Proceedings of the 15th Annual conference of the European Association for Machine Translation

pdf bib
Towards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation
Pavel Pecina | Antonio Toral | Andy Way | Vassilis Papavassiliou | Prokopis Prokopidis | Maria Giagkou
Proceedings of the 15th Annual conference of the European Association for Machine Translation

pdf bib
A Framework for Diagnostic Evaluation of MT Based on Linguistic Checkpoints
Sudip Kumar Naskar | Antonio Toral | Federico Gaspari | Andy Way
Proceedings of Machine Translation Summit XIII: Papers

pdf bib
An Open-Source Finite State Morphological Transducer for Modern Standard Arabic
Mohammed Attia | Pavel Pecina | Antonio Toral | Lamia Tounsi | Josef van Genabith
Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing

pdf bib abs
Automatic acquisition of named entities for rule-based machine translation
Antonio Toral | Andy Way
Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation

This paper proposes to enrich RBMT dictionaries with Named Entities (NEs) automatically acquired from Wikipedia. The method is applied to the Apertium English–Spanish system and its performance compared to that of Apertium with and without handtagged NEs. The system with automatic NEs outperforms the one without NEs, while results vary when compared to a system with handtagged NEs (results are comparable for Spanish→English but slightly worst for English→Spanish). Apart from that, adding automatic NEs contributes to decreasing the amount of unknown terms by more than 10%.

pdf bib abs
An Italian to Catalan RBMT system reusing data from existing language pairs
Antonio Toral | Mireia Ginestí-Rosell | Francis Tyers
Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation

This paper presents an Italian→Catalan RBMT system automatically built by combining the linguistic data of the existing pairs Spanish–Catalan and Spanish–Italian. A lightweight manual postprocessing is carried out in order to fix inconsistencies in the automatically derived dictionaries and to add very frequent words that are missing according to a corpus analysis. The system is evaluated on the KDE4 corpus and outperforms Google Translate by approximately ten absolute points in terms of both TER and GTM.

2010

pdf bib abs
An Automatically Built Named Entity Lexicon for Arabic
Mohammed Attia | Antonio Toral | Lamia Tounsi | Monica Monachini | Josef van Genabith
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We have adapted and extended the automatic Multilingual, Interoperable Named Entity Lexicon approach to Arabic, using Arabic WordNet (AWN) and Arabic Wikipedia (AWK). First, we extract AWNs instantiable nouns and identify the corresponding categories and hyponym subcategories in AWK. Then, we exploit Wikipedia inter-lingual links to locate correspondences between articles in ten different languages in order to identify Named Entities (NEs). We apply keyword search on AWK abstracts to provide for Arabic articles that do not have a correspondence in any of the other languages. In addition, we perform a post-processing step to fetch further NEs from AWK not reachable through AWN. Finally, we investigate diacritization using matching with geonames databases, MADA-TOKAN tools and different heuristics for restoring vowel marks of Arabic NEs. Using this methodology, we have extracted approximately 45,000 Arabic NEs and built, to the best of our knowledge, the largest, most mature and well-structured Arabic NE lexical resource to date. We have stored and organised this lexicon following the LMF ISO standard. We conduct a quantitative and qualitative evaluation against a manually annotated gold standard and achieve precision scores from 95.83% (with 66.13% recall) to 99.31% (with 61.45% recall) according to different values of a threshold.

pdf bib
Automatic Extraction of Arabic Multiword Expressions
Mohammed Attia | Antonio Toral | Lamia Tounsi | Pavel Pecina | Josef van Genabith
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications

2009

pdf bib
SemEval-2010 Task 17: All-words Word Sense Disambiguation on a Specific Domain
Eneko Agirre | Oier Lopez de Lacalle | Christiane Fellbaum | Andrea Marchetti | Antonio Toral | Piek Vossen
Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009)

pdf bib
A Study on Linking Wikipedia Categories to Wordnet Synsets using Text Similarity
Antonio Toral | Óscar Ferrández | Eneko Agirre | Rafael Muñoz
Proceedings of the International Conference RANLP-2009

2008

pdf bib abs
Named Entity WordNet
Antonio Toral | Rafael Muñoz | Monica Monachini
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents the automatic extension of Princeton WordNet with Named Entities (NEs). This new resource is called Named Entity WordNet. Our method maps the noun is-a hierarchy of WordNet to Wikipedia categories, identifies the NEs present in the latter and extracts different information from them such as written variants, definitions, etc. This information is inserted into a NE repository. A module that converts from this generic repository to the WordNet specific format has been developed. The paper explores different aspects of our methodology such as the treatment of polysemous terms, the identification of hyponyms within the Wikipedia categorization system, the identification of Wikipedia articles which are NEs and the design of a NE repository compliant with the LMF ISO standard. So far, this procedure enriches WordNet with 310,742 NEs and 381,043 instance of relations.

EVALITA 2007, the first edition of the initiative devoted to the evaluation of Natural Language Processing tools for Italian, provided a shared framework where participants systems had the possibility to be evaluated on five different tasks, namely Part of Speech Tagging (organised by the University of Bologna), Parsing (organised by the University of Torino), Word Sense Disambiguation (organised by CNR-ILC, Pisa), Temporal Expression Recognition and Normalization (organised by CELCT, Trento), and Named Entity Recognition (organised by FBK, Trento). We believe that the diffusion of shared tasks and shared evaluation practices is a crucial step towards the development of resources and tools for Natural Language Processing. Experiences of this kind, in fact, are a valuable contribution to the validation of existing models and data, allowing for consistent comparisons among approaches and among representation schemes. The good response obtained by EVALITA, both in the number of participants and in the quality of results, showed that pursuing such goals is feasible not only for English, but also for other languages.

pdf bib abs
More Semantic Links in the SIMPLE-CLIPS Database
Nilda Ruimy | Antonio Toral
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Notwithstanding its acknowledged richness, the SIMPLE semantic model does not offer the representational vocabulary for encoding some conceptual links holding between events and their participants and among co-participants in events. Although critical for boosting performance in many NLP application tasks, such deep lexical information is therefore only partially encoded in the SIMPLE-CLIPS Italian semantic database. This paper reports on the enrichment of the SIMPLE relation set by some expressive means, namely semantic relations, borrowed from the EuroWordNet model and their implementation in the SIMPLE-CLIPS lexicon. The original situation existing in the database, as to the expression of this type of information is described and the loan descriptive vocabulary presented. Strategies based on the exploitation of the source lexicon data were adopted to induce new information: a wide range of semantic - but also syntactic - information was investigated for singling out word senses candidate to be linked by the new relations. The lexicon enrichment by 5,000 new relations instantiated so far has therefore been carried out as a largely automated, low-effort and cost-free process, with no heavy human intervention. The redundancy set off by such an extension of information is being addressed by the implementation of inheritance in the SIMPLE-CLIPS database (Del Gratta et al., 2008).

pdf bib abs
Simple-Clips ongoing research: more information with less data by implementing inheritance
Riccardo Del Gratta | Nilda Ruimy | Antonio Toral
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents the application of inheritance to the formal taxonomy (is-a) of a semantically rich Language Resource based on the Generative Lexicon theory, SIMPLE-CLIPS. The aim is to lighten the representation of its semantic layer by reducing the number of encoded relations. A prediction calculation on the impact of introducing inheritance regarding space occupancy is carried out, yielding a significant space reduction of 22%. This is corroborated by its actual application, which reduces the number of explicitly encoded relations in this lexicon by 18.4%. Later on, we study the issues that inheritance poses to the Language Resources, and discuss sensitive solutions to tackle each of them, including examples. Finally, we present a discussion on the application of inheritance, from which two side effect advantages arise: consistency enhancement and inference capabilities.