Andy Way


2023

pdf
Filtering Matters: Experiments in Filtering Training Sets for Machine Translation
Steinþór Steingrímsson | Hrafn Loftsson | Andy Way
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

We explore different approaches for filtering parallel data for MT training, whether the same filtering approaches suit different datasets, and if separate filters should be applied to a dataset depending on the translation direction. We evaluate the results of different approaches, both manually and on a downstream NMT task. We find that, first, it is beneficial to inspect how well different filtering approaches suit different datasets and, second, that while MT systems trained on data prepared using different filters do not differ substantially in quality, there is indeed a statistically significant difference. Finally, we find that the same training sets do not seem to suit different translation directions.

2022

pdf
Achievements of the PRINCIPLE Project: Promoting MT for Croatian, Icelandic, Irish and Norwegian
Petra Bago | Sheila Castilho | Jane Dunne | Federico Gaspari | Andre K | Gauti Kristmannsson | Jon Arild Olsen | Natalia Resende | Níels Rúnar Gíslason | Dana D. Sheridan | Páraic Sheridan | John Tinsley | Andy Way
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

This paper provides an overview of the main achievements of the completed PRINCIPLE project, a 2-year action funded by the European Commission under the Connecting Europe Facility (CEF) programme. PRINCIPLE focused on collecting high-quality language resources for Croatian, Icelandic, Irish and Norwegian, which are severely low-resource languages, especially for building effective machine translation (MT) systems. We report the achievements of the project, primarily, in terms of the large amounts of data collected for all four low-resource languages and of promoting the uptake of neural MT (NMT) for these languages.

pdf
Overview of the ELE Project
Itziar Aldabe | Jane Dunne | Aritz Farwell | Owen Gallagher | Federico Gaspari | Maria Giagkou | Jan Hajic | Jens Peter Kückens | Teresa Lynn | Georg Rehm | German Rigau | Katrin Marheinecke | Stelios Piperidis | Natalia Resende | Tea Vojtěchová | Andy Way
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

This paper provides an overview of the ongoing European Language Equality(ELE) project, an 18-month action funded by the European Commission which involves 52 partners. The primary goal of ELE is to prepare the European Language Equality Programme, in the form of a strategic research, innovation and implementation agenda and a roadmap for achieving full digital language equality (DLE) in Europe by 2030.

pdf
Developing Machine Translation Engines for Multilingual Participatory Spaces
Pintu Lohar | Guodong Xie | Andy Way
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

It is often a challenging task to build Machine Translation (MT) engines for a specific domain due to the lack of parallel data in that area. In this project, we develop a range of MT systems for 6 European languages (English, German, Italian, French, Polish and Irish) in all directions and in two domains (environment and economics).

pdf bib
Introducing the Digital Language Equality Metric: Technological Factors
Federico Gaspari | Owen Gallagher | Georg Rehm | Maria Giagkou | Stelios Piperidis | Jane Dunne | Andy Way
Proceedings of the Workshop Towards Digital Language Equality within the 13th Language Resources and Evaluation Conference

This paper introduces the concept of Digital Language Equality (DLE) developed by the EU-funded European Language Equality (ELE) project, and describes the associated DLE Metric with a focus on its technological factors (TFs), which are complemented by situational contextual factors. This work aims at objectively describing the level of technological support of all European languages and lays the foundation to implement a large-scale EU-wide programme to ensure that these languages can continue to exist and prosper in the digital age, to serve the present and future needs of their speakers. The paper situates this ongoing work with a strong European focus in the broader context of related efforts, and explains how the DLE Metric can help track the progress towards DLE for all languages of Europe, focusing in particular on the role played by the TFs. These are derived from the European Language Grid (ELG) Catalogue, that provides the empirical basis to measure the level of digital readiness of all European languages. The DLE Metric scores can be consulted through an online interactive dashboard to show the level of technological support of each European language and track the overall progress toward DLE.

pdf
gaHealth: An English–Irish Bilingual Corpus of Health Data
Séamus Lankford | Haithem Afli | Órla Ní Loinsigh | Andy Way
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Machine Translation is a mature technology for many high-resource language pairs. However in the context of low-resource languages, there is a paucity of parallel data datasets available for developing translation models. Furthermore, the development of datasets for low-resource languages often focuses on simply creating the largest possible dataset for generic translation. The benefits and development of smaller in-domain datasets can easily be overlooked. To assess the merits of using in-domain data, a dataset for the specific domain of health was developed for the low-resource English to Irish language pair. Our study outlines the process used in developing the corpus and empirically demonstrates the benefits of using an in-domain dataset for the health domain. In the context of translating health-related data, models developed using the gaHealth corpus demonstrated a maximum BLEU score improvement of 22.2 points (40%) when compared with top performing models from the LoResMT2021 Shared Task. Furthermore, we define linguistic guidelines for developing gaHealth, the first bilingual corpus of health data for the Irish language, which we hope will be of use to other creators of low-resource data sets. gaHealth is now freely available online and is ready to be explored for further research.

pdf bib
Domain-Specific Text Generation for Machine Translation
Yasmin Moslem | Rejwanul Haque | John Kelleher | Andy Way
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

Preservation of domain knowledge from the source to target is crucial in any translation workflow. It is common in the translation industry to receive highly-specialized projects, where there is hardly any parallel in-domain data. In such scenarios where there is insufficient in-domain data to fine-tune Machine Translation (MT) models, producing translations that are consistent with the relevant context is challenging. In this work, we propose leveraging state-of-the-art pretrained language models (LMs) for domain-specific data augmentation for MT, simulating the domain characteristics of either (a) a small bilingual dataset, or (b) the monolingual source text to be translated. Combining this idea with back-translation, we can generate huge amounts of synthetic bilingual in-domain data for both use cases. For our investigation, we used the state-of-the-art MT architecture, Transformer. We employed mixed fine-tuning to train models that significantly improve translation of in-domain texts. More specifically, our proposed methods achieved improvements of approximately 5-6 BLEU and 2-3 BLEU, respectively, on Arabic-to-English and English-to-Arabic language pairs. Furthermore, the outcome of human evaluation corroborates the automatic evaluation results.

pdf
Knowledge Distillation for Sustainable Neural Machine Translation
Wandri Jooste | Andy Way | Rejwanul Haque | Riccardo Superbo
Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 2: Users and Providers Track and Government Track)

Knowledge distillation (KD) can be used to reduce model size and training time, without significant loss in performance. However, the process of distilling knowledge requires translation of sizeable data sets, and the translation is usually performed using large cumbersome models (teacher models). Producing such translations for KD is expensive in terms of both time and cost, which is a significant concern for translation service providers. On top of that, this process can be the cause of higher carbon footprints. In this work, we tested different variants of a teacher model for KD, tracked the power consumption of the GPUs used during translation, recorded overall translation time, estimated translation cost, and measured the accuracy of the student models. The findings of our investigation demonstrate to the translation industry a cost-effective, high-quality alternative to the standard KD training methods.

pdf
Translation Word-Level Auto-Completion: What Can We Achieve Out of the Box?
Yasmin Moslem | Rejwanul Haque | Andy Way
Proceedings of the Seventh Conference on Machine Translation (WMT)

Research on Machine Translation (MT) has achieved important breakthroughs in several areas. While there is much more to be done in order to build on this success, we believe that the language industry needs better ways to take full advantage of current achievements. Due to a combination of factors, including time, resources, and skills, businesses tend to apply pragmatism into their AI workflows. Hence, they concentrate more on outcomes, e.g. delivery, shipping, releases, and features, and adopt high-level working production solutions, where possible. Among the features thought to be helpful for translators are sentence-level and word-level translation auto-suggestion and auto-completion. Suggesting alternatives can inspire translators and limit their need to refer to external resources, which hopefully boosts their productivity. This work describes our submissions to WMT’s shared task on word-level auto-completion, for the Chinese-to-English, English-to-Chinese, German-to-English, and English-to-German language directions. We investigate the possibility of using pre-trained models and out-of-the-box features from available libraries. We employ random sampling to generate diverse alternatives, which reveals good results. Furthermore, we introduce our open-source API, based on CTranslate2, to serve translations, auto-suggestions, and auto-completions.

pdf
Compiling a Highly Accurate Bilingual Lexicon by Combining Different Approaches
Steinþór Steingrímsson | Luke O’Brien | Finnur Ingimundarson | Hrafn Loftsson | Andy Way
Proceedings of Globalex Workshop on Linked Lexicography within the 13th Language Resources and Evaluation Conference

Bilingual lexicons can be generated automatically using a wide variety of approaches. We perform a rigorous manual evaluation of four different methods: word alignments on different types of bilingual data, pivoting, machine translation and cross-lingual word embeddings. We investigate how the different setups perform using publicly available data for the English-Icelandic language pair, doing separate evaluations for each method, dataset and confidence class where it can be calculated. The results are validated by human experts, working with a random sample from all our experiments. By combining the most promising approaches and data sets, using confidence scores calculated from the data and the results of manually evaluating samples from our manual evaluation as indicators, we are able to induce lists of translations with a very high acceptance rate. We show how multiple different combinations generate lists with well over 90% acceptance rate, substantially exceeding the results for each individual approach, while still generating reasonably large candidate lists. All manually evaluated equivalence pairs are published in a new lexicon of over 232,000 pairs under an open license.

2021

pdf
On Machine Translation of User Reviews
Maja Popović | Alberto Poncelas | Marija Brkic | Andy Way
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

This work investigates neural machine translation (NMT) systems for translating English user reviews into Croatian and Serbian, two similar morphologically complex languages. Two types of reviews are used for testing the systems: IMDb movie reviews and Amazon product reviews. Two types of training data are explored: large out-of-domain bilingual parallel corpora, as well as small synthetic in-domain parallel corpus obtained by machine translation of monolingual English Amazon reviews into the target languages. Both automatic scores and human evaluation show that using the synthetic in-domain corpus together with a selected sub-set of out-of-domain data is the best option. Separated results on IMDb and Amazon reviews indicate that MT systems perform differently on different review types so that user reviews generally should not be considered as a homogeneous genre. Nevertheless, more detailed research on larger amount of different reviews covering different domains/topics is needed to fully understand these differences.

pdf
Transformers for Low-Resource Languages: Is Féidir Linn!
Seamus Lankford | Haithem Alfi | Andy Way
Proceedings of Machine Translation Summit XVIII: Research Track

The Transformer model is the state-of-the-art in Machine Translation. However and in general and neural translation models often under perform on language pairs with insufficient training data. As a consequence and relatively few experiments have been carried out using this architecture on low-resource language pairs. In this study and hyperparameter optimization of Transformer models in translating the low-resource English-Irish language pair is evaluated. We demonstrate that choosing appropriate parameters leads to considerable performance improvements. Most importantly and the correct choice of subword model is shown to be the biggest driver of translation performance. SentencePiece models using both unigram and BPE approaches were appraised. Variations on model architectures included modifying the number of layers and testing various regularization techniques and evaluating the optimal number of heads for attention. A generic 55k DGT corpus and an in-domain 88k public admin corpus were used for evaluation. A Transformer optimized model demonstrated a BLEU score improvement of 7.8 points when compared with a baseline RNN model. Improvements were observed across a range of metrics and including TER and indicating a substantially reduced post editing effort for Transformer optimized models with 16k BPE subword models. Bench-marked against Google Translate and our translation engines demonstrated significant improvements. The question of whether or not Transformers can be used effectively in a low-resource setting of English-Irish translation has been addressed. Is féidir linn - yes we can.


Building MT systems in low resourced languages for Public Sector users in Croatia, Iceland, Ireland, and Norway
Róisín Moran | Carla Para Escartín | Akshai Ramesh | Páraic Sheridan | Jane Dunne | Federico Gaspari | Sheila Castilho | Natalia Resende | Andy Way
Proceedings of Machine Translation Summit XVIII: Users and Providers Track

When developing Machine Translation engines, low resourced language pairs tend to be in a disadvantaged position: less available data means that developing robust MT models can be more challenging.The EU-funded PRINCIPLE project aims at overcoming this challenge for four low resourced European languages: Norwegian, Croatian, Irish and Icelandic. This presentation will give an overview of the project, with a focus on the set of Public Sector users and their use cases for which we have developed MT solutions.We will discuss the range of language resources that have been gathered through contributions from public sector collaborators, and present the extensive evaluations that have been undertaken, including significant user evaluation of MT systems across all of the public sector participants in each of the four countries involved.

pdf
Machine Translation in the Covid domain: an English-Irish case study for LoResMT 2021
Seamus Lankford | Haithem Afli | Andy Way
Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)

Translation models for the specific domain of translating Covid data from English to Irish were developed for the LoResMT 2021 shared task. Domain adaptation techniques, using a Covid-adapted generic 55k corpus from the Directorate General of Translation, were applied. Fine-tuning, mixed fine-tuning and combined dataset approaches were compared with models trained on an extended in-domain dataset. As part of this study, an English-Irish dataset of Covid related data, from the Health and Education domains, was developed. The highestperforming model used a Transformer architecture trained with an extended in-domain Covid dataset. In the context of this study, we have demonstrated that extending an 8k in-domain baseline dataset by just 5k lines improved the BLEU score by 27 points.

pdf
DELA Corpus - A Document-Level Corpus Annotated with Context-Related Issues
Sheila Castilho | João Lucas Cavalheiro Camargo | Miguel Menezes | Andy Way
Proceedings of the Sixth Conference on Machine Translation

Recently, the Machine Translation (MT) community has become more interested in document-level evaluation especially in light of reactions to claims of “human parity”, since examining the quality at the level of the document rather than at the sentence level allows for the assessment of suprasentential context, providing a more reliable evaluation. This paper presents a document-level corpus annotated in English with context-aware issues that arise when translating from English into Brazilian Portuguese, namely ellipsis, gender, lexical ambiguity, number, reference, and terminology, with six different domains. The corpus can be used as a challenge test set for evaluation and as a training/testing corpus for MT as well as for deep linguistic analysis of context issues. To the best of our knowledge, this is the first corpus of its kind.

pdf
CombAlign: a Tool for Obtaining High-Quality Word Alignments
Steinþór Steingrímsson | Hrafn Loftsson | Andy Way
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Being able to generate accurate word alignments is useful for a variety of tasks. While statistical word aligners can work well, especially when parallel training data are plentiful, multilingual embedding models have recently been shown to give good results in unsupervised scenarios. We evaluate an ensemble method for word alignment on four language pairs and demonstrate that by combining multiple tools, taking advantage of their different approaches, substantial gains can be made. This holds for settings ranging from very low-resource to high-resource. Furthermore, we introduce a new gold alignment test set for Icelandic and a new easy-to-use tool for creating manual word alignments.

pdf
Effective Bitext Extraction From Comparable Corpora Using a Combination of Three Different Approaches
Steinþór Steingrímsson | Pintu Lohar | Hrafn Loftsson | Andy Way
Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)

Parallel sentences extracted from comparable corpora can be useful to supplement parallel corpora when training machine translation (MT) systems. This is even more prominent in low-resource scenarios, where parallel corpora are scarce. In this paper, we present a system which uses three very different measures to identify and score parallel sentences from comparable corpora. We measure the accuracy of our methods in low-resource settings by comparing the results against manually curated test data for English–Icelandic, and by evaluating an MT system trained on the concatenation of the parallel data extracted by our approach and an existing data set. We show that the system is capable of extracting useful parallel sentences with high accuracy, and that the extracted pairs substantially increase translation quality of an MT system trained on the data, as measured by automatic evaluation metrics.

2020

pdf bib
Arabisc: Context-Sensitive Neural Spelling Checker
Yasmin Moslem | Rejwanul Haque | Andy Way
Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications

Traditional statistical approaches to spelling correction usually consist of two consecutive processes — error detection and correction — and they are generally computationally intensive. Current state-of-the-art neural spelling correction models usually attempt to correct spelling errors directly over an entire sentence, which, as a consequence, lacks control of the process, e.g. they are prone to overcorrection. In recent years, recurrent neural networks (RNNs), in particular long short-term memory (LSTM) hidden units, have proven increasingly popular and powerful models for many natural language processing (NLP) problems. Accordingly, we made use of a bidirectional LSTM language model (LM) for our context-sensitive spelling detection and correction model which is shown to have much control over the correction process. While the use of LMs for spelling checking and correction is not new to this line of NLP research, our proposed approach makes better use of the rich neighbouring context, not only from before the word to be corrected, but also after it, via a dual-input deep LSTM network. Although in theory our proposed approach can be applied to any language, we carried out our experiments on Arabic, which we believe adds additional value given the fact that there are limited linguistic resources readily available in Arabic in comparison to many languages. Our experimental results demonstrate that the proposed methods are effective in both improving the quality of correction suggestions and minimising overcorrection.

pdf
Neural Machine Translation for translating into Croatian and Serbian
Maja Popović | Alberto Poncelas | Marija Brkic | Andy Way
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

In this work, we systematically investigate different set-ups for training of neural machine translation (NMT) systems for translation into Croatian and Serbian, two closely related South Slavic languages. We explore English and German as source languages, different sizes and types of training corpora, as well as bilingual and multilingual systems. We also explore translation of English IMDb user movie reviews, a domain/genre where only monolingual data are available. First, our results confirm that multilingual systems with joint target languages perform better. Furthermore, translation performance from English is much better than from German, partly because German is morphologically more complex and partly because the corpus consists mostly of parallel human translations instead of original text and its human translation. The translation from German should be further investigated systematically. For translating user reviews, creating synthetic in-domain parallel data through back- and forward-translation and adding them to a small out-of-domain parallel corpus can yield performance comparable with a system trained on a full out-of-domain corpus. However, it is still not clear what is the optimal size of synthetic in-domain data, especially for forward-translated data where the target language is machine translated. More detailed research including manual evaluation and analysis is needed in this direction.

pdf
Selecting Backtranslated Data from Multiple Sources for Improved Neural Machine Translation
Xabier Soto | Dimitar Shterionov | Alberto Poncelas | Andy Way
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Machine translation (MT) has benefited from using synthetic training data originating from translating monolingual corpora, a technique known as backtranslation. Combining backtranslated data from different sources has led to better results than when using such data in isolation. In this work we analyse the impact that data translated with rule-based, phrase-based statistical and neural MT systems has on new MT systems. We use a real-world low-resource use-case (Basque-to-Spanish in the clinical domain) as well as a high-resource language pair (German-to-English) to test different scenarios with backtranslation and employ data selection to optimise the synthetic corpora. We exploit different data selection strategies in order to reduce the amount of data used, while at the same time maintaining high-quality MT systems. We further tune the data selection method by taking into account the quality of the MT systems used for backtranslation and lexical diversity of the resulting corpora. Our experiments show that incorporating backtranslated data from different sources can be beneficial, and that availing of data selection can yield improved performance.

pdf
Effectively Aligning and Filtering Parallel Corpora under Sparse Data Conditions
Steinþór Steingrímsson | Hrafn Loftsson | Andy Way
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Parallel corpora are key to developing good machine translation systems. However, abundant parallel data are hard to come by, especially for languages with a low number of speakers. When rich morphology exacerbates the data sparsity problem, it is imperative to have accurate alignment and filtering methods that can help make the most of what is available by maximising the number of correctly translated segments in a corpus and minimising noise by removing incorrect translations and segments containing extraneous data. This paper sets out a research plan for improving alignment and filtering methods for parallel texts in low-resource settings. We propose an effective unsupervised alignment method to tackle the alignment problem. Moreover, we propose a strategy to supplement state-of-the-art models with automatically extracted information using basic NLP tools to effectively handle rich morphology.

pdf
ELRI: A Decentralised Network of National Relay Stations to Collect, Prepare and Share Language Resources
Thierry Etchegoyhen | Borja Anza Porras | Andoni Azpeitia | Eva Martínez Garcia | José Luis Fonseca | Patricia Fonseca | Paulo Vale | Jane Dunne | Federico Gaspari | Teresa Lynn | Helen McHugh | Andy Way | Victoria Arranz | Khalid Choukri | Hervé Pusset | Alexandre Sicard | Rui Neto | Maite Melero | David Perez | António Branco | Ruben Branco | Luís Gomes
Proceedings of the 1st International Workshop on Language Technology Platforms

We describe the European Language Resource Infrastructure (ELRI), a decentralised network to help collect, prepare and share language resources. The infrastructure was developed within a project co-funded by the Connecting Europe Facility Programme of the European Union, and has been deployed in the four Member States participating in the project, namely France, Ireland, Portugal and Spain. ELRI provides sustainable and flexible means to collect and share language resources via National Relay Stations, to which members of public institutions can freely subscribe. The infrastructure includes fully automated data processing engines to facilitate the preparation, sharing and wider reuse of useful language resources that can help optimise human and automated translation services in the European Union.

pdf
The European Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe
Georg Rehm | Katrin Marheinecke | Stefanie Hegele | Stelios Piperidis | Kalina Bontcheva | Jan Hajič | Khalid Choukri | Andrejs Vasiļjevs | Gerhard Backfried | Christoph Prinz | José Manuel Gómez-Pérez | Luc Meertens | Paul Lukowicz | Josef van Genabith | Andrea Lösch | Philipp Slusallek | Morten Irgens | Patrick Gatellier | Joachim Köhler | Laure Le Bars | Dimitra Anastasiou | Albina Auksoriūtė | Núria Bel | António Branco | Gerhard Budin | Walter Daelemans | Koenraad De Smedt | Radovan Garabík | Maria Gavriilidou | Dagmar Gromann | Svetla Koeva | Simon Krek | Cvetana Krstev | Krister Lindén | Bernardo Magnini | Jan Odijk | Maciej Ogrodniczuk | Eiríkur Rögnvaldsson | Mike Rosner | Bolette Pedersen | Inguna Skadiņa | Marko Tadić | Dan Tufiș | Tamás Váradi | Kadri Vider | Andy Way | François Yvon
Proceedings of the Twelfth Language Resources and Evaluation Conference

Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe’s specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI – including many opportunities, synergies but also misconceptions – has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions.

pdf
On Context Span Needed for Machine Translation Evaluation
Sheila Castilho | Maja Popović | Andy Way
Proceedings of the Twelfth Language Resources and Evaluation Conference

Despite increasing efforts to improve evaluation of machine translation (MT) by going beyond the sentence level to the document level, the definition of what exactly constitutes a “document level” is still not clear. This work deals with the context span necessary for a more reliable MT evaluation. We report results from a series of surveys involving three domains and 18 target languages designed to identify the necessary context span as well as issues related to it. Our findings indicate that, despite the fact that some issues and spans are strongly dependent on domain and on the target language, a number of common patterns can be observed so that general guidelines for context-aware MT evaluation can be drawn.

pdf
Identifying Complaints from Product Reviews: A Case Study on Hindi
Raghvendra Pratap Singh | Rejwanul Haque | Mohammed Hasanuzzaman | Andy Way
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

Automatic recognition of customer complaints on products or services that they purchase can be crucial for the organisations, multinationals and online retailers since they can exploit this information to fulfil their customers’ expectations including managing and resolving the complaints. Recently, researchers have applied supervised learning strategies to automatically identify users’ complaints expressed in English on Twitter. The downside of these approaches is that they require labeled training data for learning, which is expensive to create. This poses a barrier for them being applied to low-resource languages and domains for which task-specific data is not available. Machine translation (MT) can be used as an alternative to the tools that require such task-specific data. In this work, we use state-of-the-art neural MT (NMT) models for translating Hindi reviews into English and investigate performance of the downstream classification task (complaints identification) on their English translations.

pdf
Terminology-Aware Sentence Mining for NMT Domain Adaptation: ADAPT’s Submission to the Adap-MT 2020 English-to-Hindi AI Translation Shared Task
Rejwanul Haque | Yasmin Moslem | Andy Way
Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task

This paper describes the ADAPT Centre’s submission to the Adap-MT 2020 AI Translation Shared Task for English-to-Hindi. The neural machine translation (NMT) systems that we built to translate AI domain texts are state-of-the-art Transformer models. In order to improve the translation quality of our NMT systems, we made use of both in-domain and out-of-domain data for training and employed different fine-tuning techniques for adapting our NMT systems to this task, e.g. mixed fine-tuning and on-the-fly self-training. For this, we mined parallel sentence pairs and monolingual sentences from large out-of-domain data, and the mining process was facilitated through automatic extraction of terminology from the in-domain data. This paper outlines the experiments we carried out for this task and reports the performance of our NMT systems on the evaluation test set.

pdf
The ADAPT System Description for the WMT20 News Translation Task
Venkatesh Parthasarathy | Akshai Ramesh | Rejwanul Haque | Andy Way
Proceedings of the Fifth Conference on Machine Translation

This paper describes the ADAPT Centre’s submissions to the WMT20 News translation shared task for English-to-Tamil and Tamil-to-English. We present our machine translation (MT) systems that were built using the state-of-the-art neural MT (NMT) model, Transformer. We applied various strategies in order to improve our baseline MT systems, e.g. onolin- gual sentence selection for creating synthetic training data, mining monolingual sentences for adapting our MT systems to the task, hyperparameters search for Transformer in lowresource scenarios. Our experiments show that adding the aforementioned techniques to the baseline yields an excellent performance in the English-to-Tamil and Tamil-to-English translation tasks.

pdf
The ADAPT’s Submissions to the WMT20 Biomedical Translation Task
Prashant Nayak | Rejwanul Haque | Andy Way
Proceedings of the Fifth Conference on Machine Translation

This paper describes the ADAPT Centre’s submissions to the WMT20 Biomedical Translation Shared Task for English-to-Basque. We present the machine translation (MT) systems that were built to translate scientific abstracts and terms from biomedical terminologies, and using the state-of-the-art neural MT (NMT) model: Transformer. In order to improve our baseline NMT system, we employ a number of methods, e.g. “pseudo” parallel data selection, monolingual data selection for synthetic corpus creation, mining monolingual sentences for adapting our NMT systems to this task, hyperparameters search for Transformer in lowresource scenarios. Our experiments show that systematic addition of the aforementioned techniques to the baseline yields an excellent performance in the English-to-Basque translation task.

pdf
A Tool for Facilitating OCR Postediting in Historical Documents
Alberto Poncelas | Mohammad Aboomar | Jan Buts | James Hadley | Andy Way
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages

Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a Language Model (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom (Cary, 1719). As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is also transparent and subject to human intervention.

pdf
Using Multiple Subwords to Improve English-Esperanto Automated Literary Translation Quality
Alberto Poncelas | Jan Buts | James Hadley | Andy Way
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

Building Machine Translation (MT) systems for low-resource languages remains challenging. For many language pairs, parallel data are not widely available, and in such cases MT models do not achieve results comparable to those seen with high-resource languages. When data are scarce, it is of paramount importance to make optimal use of the limited material available. To that end, in this paper we propose employing the same parallel sentences multiple times, only changing the way the words are split each time. For this purpose we use several Byte Pair Encoding models, with various merge operations used in their configuration. In our experiments, we use this technique to expand the available data and improve an MT system involving a low-resource language pair, namely English-Esperanto. As an additional contribution, we made available a set of English-Esperanto parallel data in the literary domain.

pdf
Investigating Low-resource Machine Translation for English-to-Tamil
Akshai Ramesh | Venkatesh Balavadhani parthasa | Rejwanul Haque | Andy Way
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

Statistical machine translation (SMT) which was the dominant paradigm in machine translation (MT) research for nearly three decades has recently been superseded by the end-to-end deep learning approaches to MT. Although deep neural models produce state-of-the-art results in many translation tasks, they are found to under-perform on resource-poor scenarios. Despite some success, none of the present-day benchmarks that have tried to overcome this problem can be regarded as a universal solution to the problem of translation of many low-resource languages. In this work, we investigate the performance of phrase-based SMT (PB-SMT) and neural MT (NMT) on a rarely-tested low-resource language-pair, English-to-Tamil, taking a specialised data domain (software localisation) into consideration. In particular, we produce rankings of our MT systems via a social media platform-based human evaluation scheme, and demonstrate our findings in the low-resource domain-specific text translation task.

pdf
Multiple Segmentations of Thai Sentences for Neural Machine Translation
Alberto Poncelas | Wichaya Pidchamook | Chao-Hong Liu | James Hadley | Andy Way
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Thai is a low-resource language, so it is often the case that data is not available in sufficient quantities to train an Neural Machine Translation (NMT) model which perform to a high level of quality. In addition, the Thai script does not use white spaces to delimit the boundaries between words, which adds more complexity when building sequence to sequence models. In this work, we explore how to augment a set of English–Thai parallel data by replicating sentence-pairs with different word segmentation methods on Thai, as training data for NMT model training. Using different merge operations of Byte Pair Encoding, different segmentations of Thai sentences can be obtained. The experiments show that combining these datasets, performance is improved for NMT models trained with a dataset that has been split using a supervised splitting tool.

pdf
Modelling Source- and Target- Language Syntactic Information as Conditional Context in Interactive Neural Machine Translation
Kamal Kumar Gupta | Rejwanul Haque | Asif Ekbal | Pushpak Bhattacharyya | Andy Way
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

In interactive machine translation (MT), human translators correct errors in automatic translations in collaboration with the MT systems, which is seen as an effective way to improve the productivity gain in translation. In this study, we model source-language syntactic constituency parse and target-language syntactic descriptions in the form of supertags as conditional context for interactive prediction in neural MT (NMT). We found that the supertags significantly improve productivity gain in translation in interactive-predictive NMT (INMT), while syntactic parsing somewhat found to be effective in reducing human effort in translation. Furthermore, when we model this source- and target-language syntactic information together as the conditional context, both types complement each other and our fully syntax-informed INMT model statistically significantly reduces human efforts in a French–to–English translation task, achieving 4.30 points absolute (corresponding to 9.18% relative) improvement in terms of word prediction accuracy (WPA) and 4.84 points absolute (corresponding to 9.01% relative) reduction in terms of word stroke ratio (WSR) over the baseline.

pdf
MT syntactic priming effects on L2 English speakers
Natália Resende | Benjamin Cowan | Andy Way
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

In this paper, we tested 20 Brazilian Portuguese speakers at intermediate and advanced English proficiency levels to investigate the influence of Google Translate’s MT system on the mental processing of English as a second language. To this end, we employed a syntactic priming experimental paradigm using a pretest-priming design which allowed us to compare participants’ linguistic behaviour before and after a translation task using Google Translate. Results show that, after performing a translation task with Google Translate, participants more frequently described images in English using the syntactic alternative previously seen in the output of Google Translate, compared to the translation task with no prior influence of the MT output. Results also show that this syntactic priming effect is modulated by English proficiency levels.

pdf
A human evaluation of English-Irish statistical and neural machine translation
Meghan Dowling | Sheila Castilho | Joss Moorkens | Teresa Lynn | Andy Way
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

With official status in both Ireland and the EU, there is a need for high-quality English-Irish (EN-GA) machine translation (MT) systems which are suitable for use in a professional translation environment. While we have seen recent research on improving both statistical MT and neural MT for the EN-GA pair, the results of such systems have always been reported using automatic evaluation metrics. This paper provides the first human evaluation study of EN-GA MT using professional translators and in-domain (public administration) data for a more accurate depiction of the translation quality available via MT.

pdf
Progress of the PRINCIPLE Project: Promoting MT for Croatian, Icelandic, Irish and Norwegian
Andy Way | Petra Bago | Jane Dunne | Federico Gaspari | Andre Kåsen | Gauti Kristmannsson | Helen McHugh | Jon Arild Olsen | Dana Davis Sheridan | Páraic Sheridan | John Tinsley
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

This paper updates the progress made on the PRINCIPLE project, a 2-year action funded by the European Commission under the Connecting Europe Facility (CEF) programme. PRINCIPLE focuses on collecting high-quality language resources for Croatian, Icelandic, Irish and Norwegian, which have been identified as low-resource languages, especially for building effective machine translation (MT) systems. We report initial achievements of the project and ongoing activities aimed at promoting the uptake of neural MT for the low-resource languages of the project.

pdf
MTrill project: Machine Translation impact on language learning
Natália Resende | Andy Way
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

Over the last decades, massive research investments have been made in the development of machine translation (MT) systems (Gupta and Dhawan, 2019). This has brought about a paradigm shift in the performance of these language tools, leading to widespread use of popular MT systems (Gaspari and Hutchins, 2007). Although the first MT engines were used for gisting purposes, in recent years, there has been an increasing interest in using MT tools, especially the freely available online MT tools, for language teaching and learning (Clifford et al., 2013). The literature on MT and Computer Assisted Language Learning (CALL) shows that, over the years, MT systems have been facilitating language teaching and also language learning (Nin ̃o, 2006). It has been shown that MT tools can increase awareness of grammatical linguistic features of a foreign language. Research also shows the positive role of MT systems in the development of writing skills in English as well as in improving communication skills in English(Garcia and Pena, 2011). However, to date, the cognitive impact of MT on language acquisition and on the syntactic aspects of language processing has not yet been investigated and deserves further scrutiny. The MTril project aims at filling this gap in the literature by examining whether MT is contributing to a central aspect of language acquisition: the so-called language binding, i.e., the ability to combine single words properly in a grammatical sentence (Heyselaar et al., 2017; Ferreira and Bock, 2006). The project focus on the initial stages (pre-intermediate and intermediate) of the acquisition of English syntax by Brazilian Portuguese native speakers using MT systems as a support for language learning.

pdf
Constraining the Transformer NMT Model with Heuristic Grid Beam Search
Guodong Xie | Andy Way
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

pdf
The Impact of Indirect Machine Translation on Sentiment Classification
Alberto Poncelas | Pintu Lohar | James Hadley | Andy Way
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)


A Case Study of Natural Gender Phenomena in Translation: A Comparison of Google Translate, Bing Microsoft Translator and DeepL for English to Italian, French and Spanish
Argentina Anna Rescigno | Johanna Monti | Andy Way | Eva Vanmassenhove
Workshop on the Impact of Machine Translation (iMpacT 2020)

pdf
The ADAPT Centre’s Participation in WAT 2020 English-to-Odia Translation Task
Prashanth Nayak | Rejwanul Haque | Andy Way
Proceedings of the 7th Workshop on Asian Translation

This paper describes the ADAPT Centre sub-missions to WAT 2020 for the English-to-Odia translation task. We present the approaches that we followed to try to build competitive machine translation (MT) systems for English-to-Odia. Our approaches include monolingual data selection for creating synthetic data and identifying optimal sets of hyperparameters for the Transformer in a low-resource scenario. Our best MT system produces 4.96BLEU points on the evaluation test set in the English-to-Odia translation task.

pdf
The ADAPT Centre’s Neural MT Systems for the WAT 2020 Document-Level Translation Task
Wandri Jooste | Rejwanul Haque | Andy Way
Proceedings of the 7th Workshop on Asian Translation

In this paper we describe the ADAPT Centre’s submissions to the WAT 2020 document-level Business Scene Dialogue (BSD) Translation task. We only consider translating from Japanese to English for this task and we use the MarianNMT toolkit to train Transformer models. In order to improve the translation quality, we made use of both in-domain and out-of-domain data for training our Machine Translation (MT) systems, as well as various data augmentation techniques for fine-tuning the model parameters. This paper outlines the experiments we ran to train our systems and report the accuracy achieved through these various experiments.

pdf
An Error-based Investigation of Statistical and Neural Machine Translation Performance on Hindi-to-Tamil and English-to-Tamil
Akshai Ramesh | Venkatesh Balavadhani Parthasa | Rejwanul Haque | Andy Way
Proceedings of the 7th Workshop on Asian Translation

Statistical machine translation (SMT) was the state-of-the-art in machine translation (MT) research for more than two decades, but has since been superseded by neural MT (NMT). Despite producing state-of-the-art results in many translation tasks, neural models underperform in resource-poor scenarios. Despite some success, none of the present-day benchmarks that have tried to overcome this problem can be regarded as a universal solution to the problem of translation of many low-resource languages. In this work, we investigate the performance of phrase-based SMT (PB-SMT) and NMT on two rarely-tested low-resource language-pairs, English-to-Tamil and Hindi-to-Tamil, taking a specialised data domain (software localisation) into consideration. This paper demonstrates our findings including the identification of several issues of the current neural approaches to low-resource domain-specific text translation.

pdf
The ADAPT System Description for the STAPLE 2020 English-to-Portuguese Translation Task
Rejwanul Haque | Yasmin Moslem | Andy Way
Proceedings of the Fourth Workshop on Neural Generation and Translation

This paper describes the ADAPT Centre’s submission to STAPLE (Simultaneous Translation and Paraphrase for Language Education) 2020, a shared task of the 4th Workshop on Neural Generation and Translation (WNGT), for the English-to-Portuguese translation task. In this shared task, the participants were asked to produce high-coverage sets of plausible translations given English prompts (input source sentences). We present our English-to-Portuguese machine translation (MT) models that were built applying various strategies, e.g. data and sentence selection, monolingual MT for generating alternative translations, and combining multiple n-best translations. Our experiments show that adding the aforementioned techniques to the baseline yields an excellent performance in the English-to-Portuguese translation task.

2019

pdf
Building English-to-Serbian Machine Translation System for IMDb Movie Reviews
Pintu Lohar | Maja Popović | Andy Way
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

This paper reports the results of the first experiment dealing with the challenges of building a machine translation system for user-generated content involving a complex South Slavic language. We focus on translation of English IMDb user movie reviews into Serbian, in a low-resource scenario. We explore potentials and limits of (i) phrase-based and neural machine translation systems trained on out-of-domain clean parallel data from news articles (ii) creating additional synthetic in-domain parallel corpus by machine-translating the English IMDb corpus into Serbian. Our main findings are that morphology and syntax are better handled by the neural approach than by the phrase-based approach even in this low-resource mismatched domain scenario, however the situation is different for the lexical aspect, especially for person names. This finding also indicates that in general, machine translation of person names into Slavic languages (especially those which require/allow transcription) should be investigated more systematically.

pdf bib
Proceedings of Machine Translation Summit XVII: Research Track
Mikel Forcada | Andy Way | Barry Haddow | Rico Sennrich
Proceedings of Machine Translation Summit XVII: Research Track

pdf
Lost in Translation: Loss and Decay of Linguistic Richness in Machine Translation
Eva Vanmassenhove | Dimitar Shterionov | Andy Way
Proceedings of Machine Translation Summit XVII: Research Track

pdf bib
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks
Mikel Forcada | Andy Way | John Tinsley | Dimitar Shterionov | Celia Rico | Federico Gaspari
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

pdf
PRINCIPLE: Providing Resources in Irish, Norwegian, Croatian and Icelandic for the Purposes of Language Engineering
Andy Way | Federico Gaspari
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

pdf
Pivot Machine Translation in INTERACT Project
Chao-Hong Liu | Andy Way | Catarina Silva | André Martins
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

pdf
Large-scale Machine Translation Evaluation of the iADAATPA Project
Sheila Castilho | Natália Resende | Federico Gaspari | Andy Way | Tony O’Dowd | Marek Mazur | Manuel Herranz | Alex Helle | Gema Ramírez-Sánchez | Víctor Sánchez-Cartagena | Mārcis Pinnis | Valters Šics
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

pdf
When less is more in Neural Quality Estimation of Machine Translation. An industry case study
Dimitar Shterionov | Félix Do Carmo | Joss Moorkens | Eric Paquin | Dag Schmidtke | Declan Groves | Andy Way
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

pdf
Leveraging backtranslation to improve machine translation for Gaelic languages
Meghan Dowling | Teresa Lynn | Andy Way
Proceedings of the Celtic Language Technology Workshop

pdf bib
Transductive Data-Selection Algorithms for Fine-Tuning Neural Machine Translation
Alberto Poncelas | Gideon Maillette de Buy Wenniger | Andy Way
Proceedings of the 8th Workshop on Patent and Scientific Literature Translation

pdf bib
Proceedings of the Qualities of Literary Machine Translation
James Hadley | Maja Popović | Haithem Afli | Andy Way
Proceedings of the Qualities of Literary Machine Translation

pdf
Selecting Artificially-Generated Sentences for Fine-Tuning Neural Machine Translation
Alberto Poncelas | Andy Way
Proceedings of the 12th International Conference on Natural Language Generation

Neural Machine Translation (NMT) models tend to achieve the best performances when larger sets of parallel sentences are provided for training. For this reason, augmenting the training set with artificially-generated sentence pair can boost the performance. Nonetheless, the performance can also be improved with a small number of sentences if they are in the same domain as the test set. Accordingly, we want to explore the use of artificially-generated sentence along with data-selection algorithms to improve NMT models trained solely with authentic data. In this work, we show how artificially-generated sentences can be more beneficial than authentic pairs and what are their advantages when used in combination with data-selection algorithms.

pdf
Investigating Terminology Translation in Statistical and Neural Machine Translation: A Case Study on English-to-Hindi and Hindi-to-English
Rejwanul Haque | Md Hasanuzzaman | Andy Way
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Terminology translation plays a critical role in domain-specific machine translation (MT). In this paper, we conduct a comparative qualitative evaluation on terminology translation in phrase-based statistical MT (PB-SMT) and neural MT (NMT) in two translation directions: English-to-Hindi and Hindi-to-English. For this, we select a test set from a legal domain corpus and create a gold standard for evaluating terminology translation in MT. We also propose an error typology taking the terminology translation errors into consideration. We evaluate the MT systems’ performance on terminology translation, and demonstrate our findings, unraveling strengths, weaknesses, and similarities of PB-SMT and NMT in the area of term translation.

pdf
Combining PBSMT and NMT Back-translated Data for Efficient NMT
Alberto Poncelas | Maja Popović | Dimitar Shterionov | Gideon Maillette de Buy Wenniger | Andy Way
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Neural Machine Translation (NMT) models achieve their best performance when large sets of parallel data are used for training. Consequently, techniques for augmenting the training set have become popular recently. One of these methods is back-translation, which consists on generating synthetic sentences by translating a set of monolingual, target-language sentences using a Machine Translation (MT) model. Generally, NMT models are used for back-translation. In this work, we analyze the performance of models when the training data is extended with synthetic data using different MT approaches. In particular we investigate back-translated data generated not only by NMT but also by Statistical Machine Translation (SMT) models and combinations of both. The results reveal that the models achieve the best performances when the training set is augmented with back-translated data created by merging different MT approaches.

2018

pdf
Balancing Translation Quality and Sentiment Preservation (Non-archival Extended Abstract)
Pintu Lohar | Haithem Afli | Andy Way
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

pdf bib
SMT versus NMT: Preliminary comparisons for Irish
Meghan Dowling | Teresa Lynn | Alberto Poncelas | Andy Way
Proceedings of the AMTA 2018 Workshop on Technologies for MT of Low Resource Languages (LoResMT 2018)

pdf
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation
Antonio Toral | Sheila Castilho | Ke Hu | Andy Way
Proceedings of the Third Conference on Machine Translation: Research Papers

We reassess a recent study (Hassan et al., 2018) that claimed that machine translation (MT) has reached human parity for the translation of news from Chinese into English, using pairwise ranking and considering three variables that were not taken into account in that previous study: the language in which the source side of the test set was originally written, the translation proficiency of the evaluators, and the provision of inter-sentential context. If we consider only original source text (i.e. not translated from another language, or translationese), then we find evidence showing that human parity has not been achieved. We compare the judgments of professional translators against those of non-experts and discover that those of the experts result in higher inter-annotator agreement and better discrimination between human and machine translations. In addition, we analyse the human translations of the test set and identify important translation issues. Finally, based on these findings, we provide a set of recommendations for future human evaluations of MT.

pdf
Extracting In-domain Training Corpora for Neural Machine Translation Using Data Selection Methods
Catarina Cruz Silva | Chao-Hong Liu | Alberto Poncelas | Andy Way
Proceedings of the Third Conference on Machine Translation: Research Papers

Data selection is a process used in selecting a subset of parallel data for the training of machine translation (MT) systems, so that 1) resources for training might be reduced, 2) trained models could perform better than those trained with the whole corpus, and/or 3) trained models are more tailored to specific domains. It has been shown that for statistical MT (SMT), the use of data selection helps improve the MT performance significantly. In this study, we reviewed three data selection approaches for MT, namely Term Frequency– Inverse Document Frequency, Cross-Entropy Difference and Feature Decay Algorithm, and conducted experiments on Neural Machine Translation (NMT) with the selected data using the three approaches. The results showed that for NMT systems, using data selection also improved the performance, though the gain is not as much as for SMT systems.

pdf
Improving Character-Based Decoding Using Target-Side Morphological Information for Neural Machine Translation
Peyman Passban | Qun Liu | Andy Way
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Recently, neural machine translation (NMT) has emerged as a powerful alternative to conventional statistical approaches. However, its performance drops considerably in the presence of morphologically rich languages (MRLs). Neural engines usually fail to tackle the large vocabulary and high out-of-vocabulary (OOV) word rate of MRLs. Therefore, it is not suitable to exploit existing word-based models to translate this set of languages. In this paper, we propose an extension to the state-of-the-art model of Chung et al. (2016), which works at the character level and boosts the decoder with target-side morphological information. In our architecture, an additional morphology table is plugged into the model. Each time the decoder samples from a target vocabulary, the table sends auxiliary signals from the most relevant affixes in order to enrich the decoder’s current state and constrain it to provide better predictions. We evaluated our model to translate English into German, Russian, and Turkish as three MRLs and observed significant improvements.

pdf
Fine-Grained Temporal Orientation and its Relationship with Psycho-Demographic Correlates
Sabyasachi Kamila | Mohammed Hasanuzzaman | Asif Ekbal | Pushpak Bhattacharyya | Andy Way
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Temporal orientation refers to an individual’s tendency to connect to the psychological concepts of past, present or future, and it affects personality, motivation, emotion, decision making and stress coping processes. The study of the social media users’ psycho-demographic attributes from the perspective of human temporal orientation can be of utmost interest and importance to the business and administrative decision makers as it can provide an extra precious information for them to make informed decisions. In this paper, we propose a very first study to demonstrate the association between the sentiment view of the temporal orientation of the users and their different psycho-demographic attributes by analyzing their tweets. We first create a temporal orientation classifier in a minimally supervised way which classifies each tweet of the users in one of the three temporal categories, namely past, present, and future. A deep Bi-directional Long Short Term Memory (BLSTM) is used for the tweet classification task. Our tweet classifier achieves an accuracy of 78.27% when tested on a manually created test set. We then determine the users’ overall temporal orientation based on their tweets on the social media. The sentiment is added to the tweets at the fine-grained level where each temporal tweet is given a sentiment with either of the positive, negative or neutral. Our experiment reveals that depending upon the sentiment view of temporal orientation, a user’s attributes vary. We finally measure the correlation between the users’ sentiment view of temporal orientation and their different psycho-demographic factors using regression.

pdf
Tailoring Neural Architectures for Translating from Morphologically Rich Languages
Peyman Passban | Andy Way | Qun Liu
Proceedings of the 27th International Conference on Computational Linguistics

A morphologically complex word (MCW) is a hierarchical constituent with meaning-preserving subunits, so word-based models which rely on surface forms might not be powerful enough to translate such structures. When translating from morphologically rich languages (MRLs), a source word could be mapped to several words or even a full sentence on the target side, which means an MCW should not be treated as an atomic unit. In order to provide better translations for MRLs, we boost the existing neural machine translation (NMT) architecture with a double- channel encoder and a double-attentive decoder. The main goal targeted in this research is to provide richer information on the encoder side and redesign the decoder accordingly to benefit from such information. Our experimental results demonstrate that we could achieve our goal as the proposed model outperforms existing subword- and character-based architectures and showed significant improvements on translating from German, Russian, and Turkish into English.

pdf
Incorporating Deep Visual Features into Multiobjective based Multi-view Search Results Clustering
Sayantan Mitra | Mohammed Hasanuzzaman | Sriparna Saha | Andy Way
Proceedings of the 27th International Conference on Computational Linguistics

Current paper explores the use of multi-view learning for search result clustering. A web-snippet can be represented using multiple views. Apart from textual view cued by both the semantic and syntactic information, a complimentary view extracted from images contained in the web-snippets is also utilized in the current framework. A single consensus partitioning is finally obtained after consulting these two individual views by the deployment of a multiobjective based clustering technique. Several objective functions including the values of a cluster quality measure measuring the goodness of partitionings obtained using different views and an agreement-disagreement index, quantifying the amount of oneness among multiple views in generating partitionings are optimized simultaneously using AMOSA. In order to detect the number of clusters automatically, concepts of variable length solutions and a vast range of permutation operators are introduced in the clustering process. Finally, a set of alternative partitioning are obtained on the final Pareto front by the proposed multi-view based multiobjective technique. Experimental results by the proposed approach on several benchmark test datasets of SRC with respect to different performance metrics evidently establish the power of visual and text-based views in achieving better search result clustering.

pdf
FooTweets: A Bilingual Parallel Corpus of World Cup Tweets
Henny Sluyter-Gäthje | Pintu Lohar | Haithem Afli | Andy Way
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
The ADAPT System Description for the IWSLT 2018 Basque to English Translation Task
Alberto Poncelas | Andy Way | Kepa Sarasola
Proceedings of the 15th International Conference on Spoken Language Translation

In this paper we present the ADAPT system built for the Basque to English Low Resource MT Evaluation Campaign. Basque is a low-resourced, morphologically-rich language. This poses a challenge for Neural Machine Translation models which usually achieve better performance when trained with large sets of data. Accordingly, we used synthetic data to improve the translation quality produced by a model built using only authentic data. Our proposal uses back-translated data to: (a) create new sentences, so the system can be trained with more data; and (b) translate sentences that are close to the test set, so the model can be fine-tuned to the document to be translated.

pdf
Data Selection with Feature Decay Algorithms Using an Approximated Target Side
Alberto Poncelas | Gideon Maillette de Buy Wenniger | Andy Way
Proceedings of the 15th International Conference on Spoken Language Translation

Data selection techniques applied to neural machine translation (NMT) aim to increase the performance of a model by retrieving a subset of sentences for use as training data. One of the possible data selection techniques are transductive learning methods, which select the data based on the test set, i.e. the document to be translated. A limitation of these methods to date is that using the source-side test set does not by itself guarantee that sentences are selected with correct translations, or translations that are suitable given the test-set domain. Some corpora, such as subtitle corpora, may contain parallel sentences with inaccurate translations caused by localization or length restrictions. In order to try to fix this problem, in this paper we propose to use an approximated target-side in addition to the source-side when selecting suitable sentence-pairs for training a model. This approximated target-side is built by pre-translating the source-side. In this work, we explore the performance of this general idea for one specific data selection approach called Feature Decay Algorithms (FDA). We train German-English NMT models on data selected by using the test set (source), the approximated target side, and a mixture of both. Our findings reveal that models built using a combination of outputs of FDA (using the test set and an approximated target side) perform better than those solely using the test set. We obtain a statistically significant improvement of more than 1.5 BLEU points over a model trained with all data, and more than 0.5 BLEU points over a strong FDA baseline that uses source-side information only.

pdf
Feature Decay Algorithms for Neural Machine Translation
Alberto Poncelas | Gideon Maillette de Buy Wenniger | Andy Way
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

Neural Machine Translation (NMT) systems require a lot of data to be competitive. For this reason, data selection techniques are used only for finetuning systems that have been trained with larger amounts of data. In this work we aim to use Feature Decay Algorithms (FDA) data selection techniques not only to fine-tune a system but also to build a complete system with less data. Our findings reveal that it is possible to find a subset of sentence pairs, that outperforms by 1.11 BLEU points the full training corpus, when used for training a German-English NMT system .

pdf
Investigating Backtranslation in Neural Machine Translation
Alberto Poncelas | Dimitar Shterionov | Andy Way | Gideon Maillette de Buy Wenniger | Peyman Passban
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

A prerequisite for training corpus-based machine translation (MT) systems – either Statistical MT (SMT) or Neural MT (NMT) – is the availability of high-quality parallel data. This is arguably more important today than ever before, as NMT has been shown in many studies to outperform SMT, but mostly when large parallel corpora are available; in cases where data is limited, SMT can still outperform NMT. Recently researchers have shown that back-translating monolingual data can be used to create synthetic parallel corpora, which in turn can be used in combination with authentic parallel data to train a highquality NMT system. Given that large collections of new parallel text become available only quite rarely, backtranslation has become the norm when building state-of-the-art NMT systems, especially in resource-poor scenarios. However, we assert that there are many unknown factors regarding the actual effects of back-translated data on the translation capabilities of an NMT model. Accordingly, in this work we investigate how using back-translated data as a training corpus – both as a separate standalone dataset as well as combined with human-generated parallel data – affects the performance of an NMT model. We use incrementally larger amounts of back-translated data to train a range of NMT systems for German-to-English, and analyse the resulting translation performance.

pdf
Perception vs. Acceptability of TM and SMT Output: What do translators prefer?
Pilar Sánchez-Gijón | Joss Moorkens | Andy Way
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

This paper reports the results of two studies carried out with two different group of professional translators to find out how professionals perceive and accept SMT in comparison with TM. The first group translated and post-edited segments from English into German, and the second group from English into Spanish. Both studies had equivalent settings in order to guarantee the comparability of the results. It will also help to shed light upon the real benefit of SMT from which translators may take advantage.

pdf
ELRI - European Language Resources Infrastructure
Thierry Etchegoyhen | Borja Anza Porras | Andoni Azpeitia | Eva Martínez Garcia | Paulo Vale | José Luis Fonseca | Teresa Lynn | Jane Dunne | Federico Gaspari | Andy Way | Victoria Arranz | Khalid Choukri | Vladimir Popescu | Pedro Neiva | Rui Neto | Maite Melero | David Perez Fernandez | Antonio Branco | Ruben Branco | Luis Gomes
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

We describe the European Language Resources Infrastructure project, whose main aim is the provision of an infrastructure to help collect, prepare and share language resources that can in turn improve translation services in Europe.

pdf
Project PiPeNovel: Pilot on Post-editing Novels
Antonio Toral | Martijn Wieling | Sheila Castilho | Joss Moorkens | Andy Way
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

Given (i) the rise of a new paradigm to machine translation based on neural networks that results in more fluent and less literal output than previous models and (ii) the maturity of machine-assisted translation via post-editing in industry, project PiPeNovel studies the feasibility of the post-editing workflow for literary text conducting experiments with professional literary translators.

pdf
SuperNMT: Neural Machine Translation with Semantic Supersenses and Syntactic Supertags
Eva Vanmassenhove | Andy Way
Proceedings of ACL 2018, Student Research Workshop

In this paper we incorporate semantic supersensetags and syntactic supertag features into EN–FR and EN–DE factored NMT systems. In experiments on various test sets, we observe that such features (and particularly when combined) help the NMT model training to converge faster and improve the model quality according to the BLEU scores.

pdf
Multi-Level Structured Self-Attentions for Distantly Supervised Relation Extraction
Jinhua Du | Jingguang Han | Andy Way | Dadong Wan
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Attention mechanism is often used in deep neural networks for distantly supervised relation extraction (DS-RE) to distinguish valid from noisy instances. However, traditional 1-D vector attention model is insufficient for learning of different contexts in the selection of valid instances to predict the relationship for an entity pair. To alleviate this issue, we propose a novel multi-level structured (2-D matrix) self-attention mechanism for DS-RE in a multi-instance learning (MIL) framework using bidirectional recurrent neural networks (BiRNN). In the proposed method, a structured word-level self-attention learns a 2-D matrix where each row vector represents a weight distribution for different aspects of an instance regarding two entities. Targeting the MIL issue, the structured sentence-level attention learns a 2-D matrix where each row vector represents a weight distribution on selection of different valid instances. Experiments conducted on two publicly available DS-RE datasets show that the proposed framework with multi-level structured self-attention mechanism significantly outperform baselines in terms of PR curves, P@N and F1 measures.

pdf
Learning to Jointly Translate and Predict Dropped Pronouns with a Shared Reconstruction Mechanism
Longyue Wang | Zhaopeng Tu | Andy Way | Qun Liu
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Pronouns are frequently omitted in pro-drop languages, such as Chinese, generally leading to significant challenges with respect to the production of complete translations. Recently, Wang et al. (2018) proposed a novel reconstruction-based approach to alleviating dropped pronoun (DP) translation problems for neural machine translation models. In this work, we improve the original model from two perspectives. First, we employ a shared reconstructor to better exploit encoder and decoder representations. Second, we jointly learn to translate and predict DPs in an end-to-end manner, to avoid the errors propagated from an external DP prediction model. Experimental results show that our approach significantly improves both translation performance and DP prediction accuracy.

pdf
Getting Gender Right in Neural Machine Translation
Eva Vanmassenhove | Christian Hardmeier | Andy Way
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Speakers of different languages must attend to and encode strikingly different aspects of the world in order to use their language correctly (Sapir, 1921; Slobin, 1996). One such difference is related to the way gender is expressed in a language. Saying “I am happy” in English, does not encode any additional knowledge of the speaker that uttered the sentence. However, many other languages do have grammatical gender systems and so such knowledge would be encoded. In order to correctly translate such a sentence into, say, French, the inherent gender information needs to be retained/recovered. The same sentence would become either “Je suis heureux”, for a male speaker or “Je suis heureuse” for a female one. Apart from morphological agreement, demographic factors (gender, age, etc.) also influence our use of language in terms of word choices or syntactic constructions (Tannen, 1991; Pennebaker et al., 2003). We integrate gender information into NMT systems. Our contribution is two-fold: (1) the compilation of large datasets with speaker information for 20 language pairs, and (2) a simple set of experiments that incorporate gender information into NMT for multiple language pairs. Our experiments show that adding a gender feature to an NMT system significantly improves the translation quality for some language pairs.

2017

pdf
Neural Pre-Translation for Hybrid Machine Translation
Jinhua Du | Andy Way
Proceedings of Machine Translation Summit XVI: Research Track

pdf
A Comparative Quality Evaluation of PBSMT and NMT using Professional Translators
Sheila Castilho | Joss Moorkens | Federico Gaspari | Rico Sennrich | Vilelmini Sosoni | Panayota Georgakopoulou | Pintu Lohar | Andy Way | Antonio Valerio Miceli-Barone | Maria Gialama
Proceedings of Machine Translation Summit XVI: Research Track

pdf
Elastic-substitution decoding for Hierarchical SMT: efficiency, richer search and double labels
Gideon Maillette de Buy Wenniger | Khalil Sima’an | Andy Way
Proceedings of Machine Translation Summit XVI: Research Track

pdf
Temporality as Seen through Translation: A Case Study on Hindi Texts
Sabyasachi Kamila | Sukanta Sen | Mohammad Hasanuzzaman | Asif Ekbal | Andy Way | Pushpak Bhattacharyya
Proceedings of Machine Translation Summit XVI: Research Track

pdf
The INTERACT Project and Crisis MT
Sharon O’Brien | Chao-Hong Liu | Andy Way | João Graça | André Martins | Helena Moniz | Ellie Kemp | Rebecca Petras
Proceedings of Machine Translation Summit XVI: Commercial MT Users and Translators Track

pdf
Context-Aware Graph Segmentation for Graph-Based Translation
Liangyou Li | Andy Way | Qun Liu
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

In this paper, we present an improved graph-based translation model which segments an input graph into node-induced subgraphs by taking source context into consideration. Translations are generated by combining subgraph translations left-to-right using beam search. Experiments on Chinese–English and German–English demonstrate that the context-aware segmentation significantly improves the baseline graph-based model.

pdf
Using Images to Improve Machine-Translating E-Commerce Product Listings.
Iacer Calixto | Daniel Stein | Evgeny Matusov | Pintu Lohar | Sheila Castilho | Andy Way
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

In this paper we study the impact of using images to machine-translate user-generated e-commerce product listings. We study how a multi-modal Neural Machine Translation (NMT) model compares to two text-only approaches: a conventional state-of-the-art attentional NMT and a Statistical Machine Translation (SMT) model. User-generated product listings often do not constitute grammatical or well-formed sentences. More often than not, they consist of the juxtaposition of short phrases or keywords. We train our models end-to-end as well as use text-only and multi-modal NMT models for re-ranking n-best lists generated by an SMT model. We qualitatively evaluate our user-generated training data also analyse how adding synthetic data impacts the results. We evaluate our models quantitatively using BLEU and TER and find that (i) additional synthetic data has a general positive impact on text-only and multi-modal NMT models, and that (ii) using a multi-modal NMT model for re-ranking n-best lists improves TER significantly across different n-best list sizes.

pdf
Exploiting Cross-Sentence Context for Neural Machine Translation
Longyue Wang | Zhaopeng Tu | Andy Way | Qun Liu
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

In translation, considering the document as a whole can help to resolve ambiguities and inconsistencies. In this paper, we propose a cross-sentence context-aware approach and investigate the influence of historical contextual information on the performance of neural machine translation (NMT). First, this history is summarized in a hierarchical way. We then integrate the historical representation into NMT in two strategies: 1) a warm-start of encoder and decoder states, and 2) an auxiliary context source for updating decoder states. Experimental results on a large Chinese-English translation task show that our approach significantly improves upon a strong attention-based NMT system by up to +2.1 BLEU points.

pdf
Demographic Word Embeddings for Racism Detection on Twitter
Mohammed Hasanuzzaman | Gaël Dias | Andy Way
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Most social media platforms grant users freedom of speech by allowing them to freely express their thoughts, beliefs, and opinions. Although this represents incredible and unique communication opportunities, it also presents important challenges. Online racism is such an example. In this study, we present a supervised learning strategy to detect racist language on Twitter based on word embedding that incorporate demographic (Age, Gender, and Location) information. Our methodology achieves reasonable classification accuracy over a gold standard dataset (F1=76.3%) and significantly improves over the classification performance of demographic-agnostic models.

pdf
Semantics-Enhanced Task-Oriented Dialogue Translation: A Case Study on Hotel Booking
Longyue Wang | Jinhua Du | Liangyou Li | Zhaopeng Tu | Andy Way | Qun Liu
Proceedings of the IJCNLP 2017, System Demonstrations

We showcase TODAY, a semantics-enhanced task-oriented dialogue translation system, whose novelties are: (i) task-oriented named entity (NE) definition and a hybrid strategy for NE recognition and translation; and (ii) a novel grounded semantic method for dialogue understanding and task-order management. TODAY is a case-study demo which can efficiently and accurately assist customers and agents in different languages to reach an agreement in a dialogue for the hotel booking.

pdf
ADAPT at IJCNLP-2017 Task 4: A Multinomial Naive Bayes Classification Approach for Customer Feedback Analysis task
Pintu Lohar | Koel Dutta Chowdhury | Haithem Afli | Mohammed Hasanuzzaman | Andy Way
Proceedings of the IJCNLP 2017, Shared Tasks

In this age of the digital economy, promoting organisations attempt their best to engage the customers in the feedback provisioning process. With the assistance of customer insights, an organisation can develop a better product and provide a better service to its customer. In this paper, we analyse the real world samples of customer feedback from Microsoft Office customers in four languages, i.e., English, French, Spanish and Japanese and conclude a five-plus-one-classes categorisation (comment, request, bug, complaint, meaningless and undetermined) for meaning classification. The task is to %access multilingual corpora annotated by the proposed meaning categorization scheme and develop a system to determine what class(es) the customer feedback sentences should be annotated as in four languages. We propose following approaches to accomplish this task: (i) a multinomial naive bayes (MNB) approach for multi-label classification, (ii) MNB with one-vs-rest classifier approach, and (iii) the combination of the multilabel classification-based and the sentiment classification-based approach. Our best system produces F-scores of 0.67, 0.83, 0.72 and 0.7 for English, Spanish, French and Japanese, respectively. The results are competitive to the best ones for all languages and secure 3rd and 5th position for Japanese and French, respectively, among all submitted systems.

pdf
Identifying Effective Translations for Cross-lingual Arabic-to-English User-generated Speech Search
Ahmad Khwileh | Haithem Afli | Gareth Jones | Andy Way
Proceedings of the Third Arabic Natural Language Processing Workshop

Cross Language Information Retrieval (CLIR) systems are a valuable tool to enable speakers of one language to search for content of interest expressed in a different language. A group for whom this is of particular interest is bilingual Arabic speakers who wish to search for English language content using information needs expressed in Arabic queries. A key challenge in CLIR is crossing the language barrier between the query and the documents. The most common approach to bridging this gap is automated query translation, which can be unreliable for vague or short queries. In this work, we examine the potential for improving CLIR effectiveness by predicting the translation effectiveness using Query Performance Prediction (QPP) techniques. We propose a novel QPP method to estimate the quality of translation for an Arabic-English Cross-lingual User-generated Speech Search (CLUGS) task. We present an empirical evaluation that demonstrates the quality of our method on alternative translation outputs extracted from an Arabic-to-English Machine Translation system developed for this task. Finally, we show how this framework can be integrated in CLUGS to find relevant translations for improved retrieval performance.

pdf
Ethical Considerations in NLP Shared Tasks
Carla Parra Escartín | Wessel Reijers | Teresa Lynn | Joss Moorkens | Andy Way | Chao-Hong Liu
Proceedings of the First ACL Workshop on Ethics in Natural Language Processing

Shared tasks are increasingly common in our field, and new challenges are suggested at almost every conference and workshop. However, as this has become an established way of pushing research forward, it is important to discuss how we researchers organise and participate in shared tasks, and make that information available to the community to allow further research improvements. In this paper, we present a number of ethical issues along with other areas of concern that are related to the competitive nature of shared tasks. As such issues could potentially impact on research ethics in the Natural Language Processing community, we also propose the development of a framework for the organisation of and participation in shared tasks that can help mitigate against these issues arising.

pdf
Human Evaluation of Multi-modal Neural Machine Translation: A Case-Study on E-Commerce Listing Titles
Iacer Calixto | Daniel Stein | Evgeny Matusov | Sheila Castilho | Andy Way
Proceedings of the Sixth Workshop on Vision and Language

In this paper, we study how humans perceive the use of images as an additional knowledge source to machine-translate user-generated product listings in an e-commerce company. We conduct a human evaluation where we assess how a multi-modal neural machine translation (NMT) model compares to two text-only approaches: a conventional state-of-the-art attention-based NMT and a phrase-based statistical machine translation (PBSMT) model. We evaluate translations obtained with different systems and also discuss the data set of user-generated product listings, which in our case comprises both product listings and associated images. We found that humans preferred translations obtained with a PBSMT system to both text-only and multi-modal NMT over 56% of the time. Nonetheless, human evaluators ranked translations from a multi-modal NMT model as better than those of a text-only NMT over 88% of the time, which suggests that images do help NMT in this use-case.

pdf bib
MultiNews: A Web collection of an Aligned Multimodal and Multilingual Corpus
Haithem Afli | Pintu Lohar | Andy Way
Proceedings of the First Workshop on Curation and Applications of Parallel and Comparable Corpora

Integrating Natural Language Processing (NLP) and computer vision is a promising effort. However, the applicability of these methods directly depends on the availability of a specific multimodal data that includes images and texts. In this paper, we present a collection of a Multimodal corpus of comparable texts and their images in 9 languages from the web news articles of Euronews website. This corpus has found widespread use in the NLP community in Multilingual and multimodal tasks. Here, we focus on its acquisition of the images and text data and their multilingual alignment.

2016

pdf
Identifying Temporal Orientation of Word Senses
Mohammed Hasanuzzaman | Gaël Dias | Stéphane Ferrari | Yann Mathet | Andy Way
Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning

pdf
TraMOOC (Translation for Massive Open Online Courses): providing reliable MT for MOOCs
Valia Kordoni | Lexi Birch | Ioana Buliga | Kostadin Cholakov | Markus Egg | Federico Gaspari | Yota Georgakopolou | Maria Gialama | Iris Hendrickx | Mitja Jermol | Katia Kermanidis | Joss Moorkens | Davor Orlic | Michael Papadopoulos | Maja Popović | Rico Sennrich | Vilelmini Sosoni | Dimitrios Tsoumakos | Antal van den Bosch | Menno van Zaanen | Andy Way
Proceedings of the 19th Annual Conference of the European Association for Machine Translation: Projects/Products

pdf
A Novel Approach to Dropped Pronoun Translation
Longyue Wang | Zhaopeng Tu | Xiaojun Zhang | Hang Li | Andy Way | Qun Liu
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Using BabelNet to Improve OOV Coverage in SMT
Jinhua Du | Andy Way | Andrzej Zydron
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Out-of-vocabulary words (OOVs) are a ubiquitous and difficult problem in statistical machine translation (SMT). This paper studies different strategies of using BabelNet to alleviate the negative impact brought about by OOVs. BabelNet is a multilingual encyclopedic dictionary and a semantic network, which not only includes lexicographic and encyclopedic terms, but connects concepts and named entities in a very large network of semantic relations. By taking advantage of the knowledge in BabelNet, three different methods ― using direct training data, domain-adaptation techniques and the BabelNet API ― are proposed in this paper to obtain translations for OOVs to improve system performance. Experimental results on English―Polish and English―Chinese language pairs show that domain adaptation can better utilize BabelNet knowledge and performs better than other methods. The results also demonstrate that BabelNet is a really useful tool for improving translation performance of SMT systems.

pdf
Enhancing Access to Online Education: Quality Machine Translation of MOOC Content
Valia Kordoni | Antal van den Bosch | Katia Lida Kermanidis | Vilelmini Sosoni | Kostadin Cholakov | Iris Hendrickx | Matthias Huck | Andy Way
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The present work is an overview of the TraMOOC (Translation for Massive Open Online Courses) research and innovation project, a machine translation approach for online educational content. More specifically, videolectures, assignments, and MOOC forum text is automatically translated from English into eleven European and BRIC languages. Unlike previous approaches to machine translation, the output quality in TraMOOC relies on a multimodal evaluation schema that involves crowdsourcing, error type markup, an error taxonomy for translation model comparison, and implicit evaluation via text mining, i.e. entity recognition and its performance comparison between the source and the translated text, and sentiment analysis on the students’ forum posts. Finally, the evaluation output will result in more and better quality in-domain parallel data that will be fed back to the translation engine for higher quality output. The translation service will be incorporated into the Iversity MOOC platform and into the VideoLectures.net digital library portal.

pdf
Using SMT for OCR Error Correction of Historical Texts
Haithem Afli | Zhengwei Qiu | Andy Way | Páraic Sheridan
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

A trend to digitize historical paper-based archives has emerged in recent years, with the advent of digital optical scanners. A lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into electronic versions that can be manipulated by a computer. For this purpose, Optical Character Recognition (OCR) systems have been developed to transform scanned digital text into editable computer text. However, different kinds of errors in the OCR system output text can be found, but Automatic Error Correction tools can help in performing the quality of electronic texts by cleaning and removing noises. In this paper, we perform a qualitative and quantitative comparison of several error-correction techniques for historical French documents. Experimentation shows that our Machine Translation for Error Correction method is superior to other Language Modelling correction techniques, with nearly 13% relative improvement compared to the initial baseline.

pdf
ProphetMT: A Tree-based SMT-driven Controlled Language Authoring/Post-Editing Tool
Xiaofeng Wu | Jinhua Du | Qun Liu | Andy Way
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents ProphetMT, a tree-based SMT-driven Controlled Language (CL) authoring and post-editing tool. ProphetMT employs the source-side rules in a translation model and provides them as auto-suggestions to users. Accordingly, one might say that users are writing in a Controlled Language that is understood by the computer. ProphetMT also allows users to easily attach structural information as they compose content. When a specific rule is selected, a partial translation is promptly generated on-the-fly with the help of the structural information. Our experiments conducted on English-to-Chinese show that our proposed ProphetMT system can not only better regularise an author’s writing behaviour, but also significantly improve translation fluency which is vital to reduce the post-editing time. Additionally, when the writing and translation process is over, ProphetMT can provide an effective colour scheme to further improve the productivity of post-editors by explicitly featuring the relations between the source and target rules.

pdf
Automatic Construction of Discourse Corpora for Dialogue Translation
Longyue Wang | Xiaojun Zhang | Zhaopeng Tu | Andy Way | Qun Liu
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, a novel approach is proposed to automatically construct parallel discourse corpus for dialogue machine translation. Firstly, the parallel subtitle data and its corresponding monolingual movie script data are crawled and collected from Internet. Then tags such as speaker and discourse boundary from the script data are projected to its subtitle data via an information retrieval approach in order to map monolingual discourse to bilingual texts. We not only evaluate the mapping results, but also integrate speaker information into the translation. Experiments show our proposed method can achieve 81.79% and 98.64% accuracy on speaker and dialogue boundary annotation, and speaker-based language model adaptation can obtain around 0.5 BLEU points improvement in translation qualities. Finally, we publicly release around 100K parallel discourse data with manual speaker and dialogue boundary annotation.

pdf
Graph-Based Translation Via Graph Segmentation
Liangyou Li | Andy Way | Qun Liu
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Phrase-Level Combination of SMT and TM Using Constrained Word Lattice
Liangyou Li | Andy Way | Qun Liu
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Extending Phrase-Based Translation with Dependencies by Using Graphs
Liangyou Li | Andy Way | Qun Liu
Proceedings of the 2nd Workshop on Semantics-Driven Machine Translation (SedMT 2016)

pdf
The ADAPT Bilingual Document Alignment system at WMT16
Pintu Lohar | Haithem Afli | Chao-Hong Liu | Andy Way
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf
Improving Phrase-Based SMT Using Cross-Granularity Embedding Similarity
Peyman Passban | Chris Hokamp | Andy Way | Qun Liu
Proceedings of the 19th Annual Conference of the European Association for Machine Translation

pdf
Comparing Translator Acceptability of TM and SMT Outputs
Joss Moorkens | Andy Way
Proceedings of the 19th Annual Conference of the European Association for Machine Translation

pdf
Integrating Optical Character Recognition and Machine Translation of Historical Documents
Haithem Afli | Andy Way
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)

Machine Translation (MT) plays a critical role in expanding capacity in the translation industry. However, many valuable documents, including digital documents, are encoded in non-accessible formats for machine processing (e.g., Historical or Legal documents). Such documents must be passed through a process of Optical Character Recognition (OCR) to render the text suitable for MT. No matter how good the OCR is, this process introduces recognition errors, which often renders MT ineffective. In this paper, we propose a new OCR to MT framework based on adding a new OCR error correction module to enhance the overall quality of translation. Experimentation shows that our new system correction based on the combination of Language Modeling and Translation methods outperforms the baseline system by nearly 30% relative improvement.

pdf
Using Wordnet to Improve Reordering in Hierarchical Phrase-Based Statistical Machine Translation
Arefeh Kazemi | Antonio Toral | Andy Way
Proceedings of the 8th Global WordNet Conference (GWC)

We propose the use of WordNet synsets in a syntax-based reordering model for hierarchical statistical machine translation (HPB-SMT) to enable the model to generalize to phrases not seen in the training data but that have equivalent meaning. We detail our methodology to incorporate synsets’ knowledge in the reordering model and evaluate the resulting WordNet-enhanced SMT systems on the English-to-Farsi language direction. The inclusion of synsets leads to the best BLEU score, outperforming the baseline (standard HPB-SMT) by 0.6 points absolute.

pdf
Improving KantanMT Training Efficiency with fast_align
Dimitar Shterionov | Jinhua Du | Marc Anthony Palminteri | Laura Casanellas | Tony O’Dowd | Andy Way
Conferences of the Association for Machine Translation in the Americas: MT Users' Track

pdf
Fast Gated Neural Domain Adaptation: Language Model as a Case Study
Jian Zhang | Xiaofeng Wu | Andy Way | Qun Liu
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Neural network training has been shown to be advantageous in many natural language processing applications, such as language modelling or machine translation. In this paper, we describe in detail a novel domain adaptation mechanism in neural network training. Instead of learning and adapting the neural network on millions of training sentences – which can be very time-consuming or even infeasible in some cases – we design a domain adaptation gating mechanism which can be used in recurrent neural networks and quickly learn the out-of-domain knowledge directly from the word vector representations with little speed overhead. In our experiments, we use the recurrent neural network language model (LM) as a case study. We show that the neural LM perplexity can be reduced by 7.395 and 12.011 using the proposed domain adaptation mechanism on the Penn Treebank and News data, respectively. Furthermore, we show that using the domain-adapted neural LM to re-rank the statistical machine translation n-best list on the French-to-English language pair can significantly improve translation quality.

pdf
Topic-Informed Neural Machine Translation
Jian Zhang | Liangyou Li | Andy Way | Qun Liu
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In recent years, neural machine translation (NMT) has demonstrated state-of-the-art machine translation (MT) performance. It is a new approach to MT, which tries to learn a set of parameters to maximize the conditional probability of target sentences given source sentences. In this paper, we present a novel approach to improve the translation performance in NMT by conveying topic knowledge during translation. The proposed topic-informed NMT can increase the likelihood of selecting words from the same topic and domain for translation. Experimentally, we demonstrate that topic-informed NMT can achieve a 1.15 (3.3% relative) and 1.67 (5.4% relative) absolute improvement in BLEU score on the Chinese-to-English language pair using NIST 2004 and 2005 test sets, respectively, compared to NMT without topic information.

pdf
Enriching Phrase Tables for Statistical Machine Translation Using Mixed Embeddings
Peyman Passban | Qun Liu | Andy Way
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

The phrase table is considered to be the main bilingual resource for the phrase-based statistical machine translation (PBSMT) model. During translation, a source sentence is decomposed into several phrases. The best match of each source phrase is selected among several target-side counterparts within the phrase table, and processed by the decoder to generate a sentence-level translation. The best match is chosen according to several factors, including a set of bilingual features. PBSMT engines by default provide four probability scores in phrase tables which are considered as the main set of bilingual features. Our goal is to enrich that set of features, as a better feature set should yield better translations. We propose new scores generated by a Convolutional Neural Network (CNN) which indicate the semantic relatedness of phrase pairs. We evaluate our model in different experimental settings with different language pairs. We observe significant improvements when the proposed features are incorporated into the PBSMT pipeline.

2015

pdf bib
Proceedings of the 18th Annual Conference of the European Association for Machine Translation
İIknur El‐Kahlout | Mehmed Özkan | Felipe Sánchez‐Martínez | Gema Ramírez‐Sánchez | Fred Hollywood | Andy Way
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
Dependency-based Reordering Model for Constituent Pairs in Hierarchical SMT
Arefeh Kazemiy | Antonio Toral | Andy Way | Amirhassan Monadjemiy
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
Benchmarking SMT Performance for Farsi Using the TEP++ Corpus
Peyman Passban | Andy Way | Qun Liu
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
TraMOOC: Translation for Massive Open Online Courses
Valia Kordoni | Kostadin Cholakov | Markus Egg | Andy Way | Lexi Birch | Katia Kermanidis | Vilelmini Sosoni | Dimitrios Tsoumakos | Antal van den Bosch | Iris Hendrickx | Michael Papadopoulos | Panayota Georgakopoulou | Maria Gialama | Menno van Zaanen | Ioana Buliga | Mitja Jermol | Davor Orlic
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
Abu-MaTran: Automatic building of Machine Translation
Antonio Toral | Tommi A Pirinen | Andy Way | Gema Ramírez-Sánchez | Sergio Ortiz Rojas | Raphael Rubino | Miquel Esplà | Mikel Forcada | Vassilis Papavassiliou | Prokopis Prokopidis | Nikola Ljubešić
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
Translating Literary Text between Related Languages using SMT
Antonio Toral | Andy Way
Proceedings of the Fourth Workshop on Computational Linguistics for Literature

pdf
ParFDA for Fast Deployment of Accurate Statistical Machine Translation Systems, Benchmarks, and Statistics
Ergun Biçici | Qun Liu | Andy Way
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf
Referential Translation Machines for Predicting Translation Quality and Related Statistics
Ergun Biçici | Qun Liu | Andy Way
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
Proceedings of the 18th Annual Conference of the European Association for Machine Translation
İlknur Durgar El-Kahlout | Mehmed Özkan | Felipe Sánchez-Martínez | Gema Ramírez-Sánchez | Fred Hollowood | Andy Way
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
Dependency-based Reordering Model for Constituent Pairs in Hierarchical SMT
Arefeh Kazemi | Antonio Toral | Andy Way | Amirhassan Monadjemi | Mohammadali Nematbakhsh
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
Benchmarking SMT Performance for Farsi Using the TEP++ Corpus
Peyman Passban | Andy Way | Qun Liu
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
TraMOOC: Translation for Massive Open Online Courses
Valia Kordoni | Kostadin Cholakov | Markus Egg | Andy Way | Lexi Birch | Katia Kermanidis | Vilelmini Sosoni | Dimitrios Tsoumakos | Antal van den Bosch | Iris Hendrickx | Michael Papadopoulos | Panayota Georgakopoulou | Maria Gialama | Menno van Zaanen | Ioana Buliga | Mitja Jermol | Davor Orlic
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
Abu-MaTran: Automatic building of Machine Translation
Antonio Toral | Tommi A. Pirinen | Andy Way | Gema Ramírez-Sánchez | Sergio Ortiz Rojas | Raphael Rubino | Miquel Esplà | Mikel L. Forcada | Vassilis Papavassiliou | Prokopis Prokopidis | Nikola Ljubešić
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
Dependency Graph-to-String Translation
Liangyou Li | Andy Way | Qun Liu
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf
An empirical study of segment prioritization for incrementally retrained post-editing-based SMT
Jinhua Du | Ankit Srivastava | Andy Way | Alfredo Maldonado-Guerra | David Lewis
Proceedings of Machine Translation Summit XV: Papers

pdf
Domain adaptation for social localisation-based SMT: a case study using the Trommons platform
Jinhua Du | Andy Way | Zhengwei Qiu | Asanka Wasala | Reinhard Schaler
Proceedings of the 4th Workshop on Post-editing Technology and Practice

2014

pdf
A probabilistic feature-based fill-up for SMT
Jian Zhang | Liangyou Li | Andy Way | Qun Liu
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track

In this paper, we describe an effective translation model combination approach based on the estimation of a probabilistic Support Vector Machine (SVM). We collect domain knowledge from both in-domain and general-domain corpora inspired by a commonly used data selection algorithm, which we then use as features for the SVM training. Drawing on previous work on binary-featured phrase table fill-up (Nakov, 2008; Bisazza et al., 2011), we substitute the binary feature in the original work with our probabilistic domain-likeness feature. Later, we design two experiments to evaluate the proposed probabilistic feature-based approach on the French-to-English language pair using data provided at WMT07, WMT13 and IWLST11 translation tasks. Our experiments demonstrate that translation performance can gain significant improvements of up to +0.36 and +0.82 BLEU scores by using our probabilistic feature-based translation model fill-up approach compared with the binary featured fill-up approach in both experiments.

pdf
A discriminative framework of integrating translation memory features into SMT
Liangyou Li | Andy Way | Qun Liu
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track

Combining Translation Memory (TM) with Statistical Machine Translation (SMT) together has been demonstrated to be beneficial. In this paper, we present a discriminative framework which can integrate TM into SMT by incorporating TM-related feature functions. Experiments on English–Chinese and English–French tasks show that our system using TM feature functions only from the best fuzzy match performs significantly better than the baseline phrase- based system on both tasks, and our discriminative model achieves comparable results to those of an effective generative model which uses similar features. Furthermore, with the capacity of handling a large amount of features in the discriminative framework, we propose a method to efficiently use multiple fuzzy matches which brings more feature functions and further significantly improves our system.

pdf
Perception vs. reality: measuring machine translation post-editing productivity
Federico Gaspari | Antonio Toral | Sudip Kumar Naskar | Declan Groves | Andy Way
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas

This paper presents a study of user-perceived vs real machine translation (MT) post-editing effort and productivity gains, focusing on two bidirectional language pairs: English—German and English—Dutch. Twenty experienced media professionals post-edited statistical MT output and also manually translated comparative texts within a production environment. The paper compares the actual post-editing time against the users’ perception of the effort and time required to post-edit the MT output to achieve publishable quality, thus measuring real (vs perceived) productivity gains. Although for all the language pairs users perceived MT post-editing to be slower, in fact it proved to be a faster option than manual translation for two translation directions out of four, i.e. for Dutch to English, and (marginally) for English to German. For further objective scrutiny, the paper also checks the correlation of three state-of-the-art automatic MT evaluation metrics (BLEU, METEOR and TER) with the actual post-editing time.

pdf
Parallel FDA5 for Fast Deployment of Accurate Statistical Machine Translation Systems
Ergun Biçici | Qun Liu | Andy Way
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf
The DCU-ICTCAS MT system at WMT 2014 on German-English Translation Task
Liangyou Li | Xiaofeng Wu | Santiago Cortés Vaíllo | Jun Xie | Andy Way | Qun Liu
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf
Abu-MaTran at WMT 2014 Translation Task: Two-step Data Selection and RBMT-Style Synthetic Rules
Raphael Rubino | Antonio Toral | Victor M. Sánchez-Cartagena | Jorge Ferrández-Tordera | Sergio Ortiz-Rojas | Gema Ramírez-Sánchez | Felipe Sánchez-Martínez | Andy Way
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf
DCU-Lingo24 Participation in WMT 2014 Hindi-English Translation task
Xiaofeng Wu | Rejwanul Haque | Tsuyoshi Okita | Piyush Arora | Andy Way | Qun Liu
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf
DCU Terminology Translation System for Medical Query Subtask at WMT14
Tsuyoshi Okita | Ali Vahid | Andy Way | Qun Liu
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf
Referential Translation Machines for Predicting Translation Quality
Ergun Biçici | Andy Way
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf
Transformation and Decomposition for Efficiently Implementing and Improving Dependency-to-String Model In Moses
Liangyou Li | Jun Xie | Andy Way | Qun Liu
Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf
Bilingual Termbank Creation via Log-Likelihood Comparison and Phrase-Based Statistical Machine Translation
Rejwanul Haque | Sergio Penkale | Andy Way
Proceedings of the 4th International Workshop on Computational Terminology (Computerm)

pdf bib
Proceedings of the 17th Annual Conference of the European Association for Machine Translation
Mauro Cettolo | Marcello Federico | Lucia Specia | Andy Way
Proceedings of the 17th Annual Conference of the European Association for Machine Translation

pdf
Standard language variety conversion for content localisation via SMT
Federico Fancellu | Andy Way | Morgan O’Brien
Proceedings of the 17th Annual Conference of the European Association for Machine Translation

pdf
Extrinsic evaluation of web-crawlers in machine translation: a study on Croatian-English for the tourism domain
Antonio Toral | Raphael Rubino | Miquel Esplà-Gomis | Tommi Pirinen | Andy Way | Gema Ramírez-Sánchez
Proceedings of the 17th Annual Conference of the European Association for Machine Translation

pdf
RTM-DCU: Referential Translation Machines for Semantic Similarity
Ergun Biçici | Andy Way
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf
Is machine translation ready for literature
Antonio Toral | Andy Way
Proceedings of Translating and the Computer 36

2013

pdf
Emerging use-cases for machine translation
Andy Way
Proceedings of Translating and the Computer 35

pdf bib
COACH: Designing a new CAT tool with Translator Interaction
Laura Bota | Christoph Schneider | Andy Way
Proceedings of Machine Translation Summit XIV: User track

2012

pdf
SUMAT: Data Collection and Parallel Corpus Compilation for Machine Translation of Subtitles
Volha Petukhova | Rodrigo Agerri | Mark Fishel | Sergio Penkale | Arantza del Pozo | Mirjam Sepesy Maučec | Andy Way | Panayota Georgakopoulou | Martin Volk
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Subtitling and audiovisual translation have been recognized as areas that could greatly benefit from the introduction of Statistical Machine Translation (SMT) followed by post-editing, in order to increase efficiency of subtitle production process. The FP7 European project SUMAT (An Online Service for SUbtitling by MAchine Translation: http://www.sumat-project.eu) aims to develop an online subtitle translation service for nine European languages, combined into 14 different language pairs, in order to semi-automate the subtitle translation processes of both freelance translators and subtitling companies on a large scale. In this paper we discuss the data collection and parallel corpus compilation for training SMT systems, which includes several procedures such as data partition, conversion, formatting, normalization and alignment. We discuss in detail each data pre-processing step using various approaches. Apart from the quantity (around 1 million subtitles per language pair), the SUMAT corpus has a number of very important characteristics. First of all, high quality both in terms of translation and in terms of high-precision alignment of parallel documents and their contents has been achieved. Secondly, the contents are provided in one consistent format and encoding. Finally, additional information such as type of content in terms of genres and domain is available.

pdf
Translation Quality-Based Supplementary Data Selection by Incremental Update of Translation Models
Pratyush Banerjee | Sudip Kumar Naskar | Johann Roturier | Andy Way | Josef van Genabith
Proceedings of COLING 2012

pdf bib
Hierarchical Phrase-Based MT for Phonetic Representation-Based Speech Translation
Zeeshan Ahmed | Jie Jiang | Julie Carson-Berndsen | Peter Cahill | Andy Way
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers

The paper presents a novel technique for speech translation using hierarchical phrased-based statistical machine translation (HPB-SMT). The system is based on translation of speech from phone sequences as opposed to conventional approach of speech translation from word sequences. The technique facilitates speech translation by allowing a machine translation (MT) system to access to phonetic information. This enables the MT system to act as both a word recognition and a translation component. This results in better performance than conventional speech translation approaches by recovering from recognition error with help of a source language model, translation model and target language model. For this purpose, the MT translation models are adopted to work on source language phones using a grapheme-to-phoneme component. The source-side phonetic confusions are handled using a confusion network. The result on IWLST'10 English- Chinese translation task shows a significant improvement in translation quality. In this paper, results for HPB-SMT are compared with previously published results of phrase-based statistical machine translation (PB-SMT) system (Baseline). The HPB-SMT system outperforms PB-SMT in this regard.

pdf bib
Taking Statistical Machine Translation to the Student Translator
Stephen Doherty | Dorothy Kenny | Andy Way
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Commercial MT User Program

Despite the growth of statistical machine translation (SMT) research and development in recent years, it remains somewhat out of reach for the translation community where programming expertise and knowledge of statistics tend not to be commonplace. While the concept of SMT is relatively straightforward, its implementation in functioning systems remains difficult for most, regardless of expertise. More recently, however, developments such as SmartMATE have emerged which aim to assist users in creating their own customized SMT systems and thus reduce the learning curve associated with SMT. In addition to commercial uses, translator training stands to benefit from such increased levels of inclusion and access to state-of-the-art approaches to MT. In this paper we draw on experience in developing and evaluating a new syllabus in SMT for a cohort of post-graduate student translators: we identify several issues encountered in the introduction of student translators to SMT, and report on data derived from repeated measures questionnaires that aim to capture data on students’ self-efficacy in the use of SMT. Overall, results show that participants report significant increases in their levels of confidence and knowledge of MT in general, and of SMT in particular. Additional benefits – such as increased technical competence and confidence – and future refinements are also discussed.

pdf
Translating User-Generated Content in the Social Networking Space
Jie Jiang | Andy Way | Rejwanul Haque
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Commercial MT User Program

This paper presents a case-study of work done by Applied Language Solutions (ALS) for a large social networking provider who claim to have built the world’s first multi-language social network, where Internet users from all over the world can communicate in languages that are available in the system. In an initial phase, the social networking provider contracted ALS to build Machine Translation (MT) engines for twelve language-pairs: Russian⇔English, Russian⇔Turkish, Russian⇔Arabic, Turkish⇔English, Turkish⇔Arabic and Arabic⇔English. All of the input data is user-generated content, so we faced a number of problems in building large-scale, robust, high-quality engines. Primarily, much of the source-language data is of ‘poor’ or at least ‘non-standard’ quality. This comes in many forms: (i) content produced by non-native speakers, (ii) content produced by native speakers containing non-deliberate typos, or (iii) content produced by native speakers which deliberately departs from spelling norms to bring about some linguistic effect. Accordingly, in addition to the ‘regular’ pre-processing techniques used in the building of our statistical MT systems, we needed to develop routines to deal with all these scenarios. In this paper, we describe how we handle shortforms, acronyms, typos, punctuation errors, non-dictionary slang, wordplay, censor avoidance and emoticons. We demonstrate automatic evaluation scores on the social network data, together with insights from the the social networking provider regarding some of the typical errors made by the MT engines, and how we managed to correct these in the engines.

pdf
SmartMATE: An Online End-To-End MT Post-Editing Framework
Sergio Penkale | Andy Way
Workshop on Post-Editing Technology and Practice

It is a well-known fact that the amount of content which is available to be translated and localized far outnumbers the current amount of translation resources. Automation in general and Machine Translation (MT) in particular are one of the key technologies which can help improve this situation. However, a tool that integrates all of the components needed for the localization process is still missing, and MT is still out of reach for most localisation professionals. In this paper we present an online translation environment which empowers users with MT by enabling engines to be created from their data, without a need for technical knowledge or special hardware requirements and at low cost. Documents in a variety of formats can then be post-edited after being processed with their Translation Memories, MT engines and glossaries. We give an overview of the tool and present a case study of a project for a large games company, showing the applicability of our tool.

pdf bib
Monolingual Data Optimisation for Bootstrapping SMT Engines
Jie Jiang | Andy Way | Nelson Ng | Rejwanul Haque | Mike Dillinger | Jun Lu
Workshop on Monolingual Machine Translation

Content localisation via machine translation (MT) is a sine qua non, especially for international online business. While most applications utilise rule-based solutions due to the lack of suitable in-domain parallel corpora for statistical MT (SMT) training, in this paper we investigate the possibility of applying SMT where huge amounts of monolingual content only are available. We describe a case study where an analysis of a very large amount of monolingual online trading data from eBay is conducted by ALS with a view to reducing this corpus to the most representative sample in order to ensure the widest possible coverage of the total data set. Furthermore, minimal yet optimal sets of sentences/words/terms are selected for generation of initial translation units for future SMT system-building.

pdf
Combining EBMT, SMT, TM and IR Technologies for Quality and Scale
Sandipan Dandapat | Sara Morrissey | Andy Way | Josef van Genabith
Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)

pdf bib
Proceedings of the 16th Annual Conference of the European Association for Machine Translation
Mauro Cettolo | Marcello Federico | Lucia Specia | Andy Way
Proceedings of the 16th Annual Conference of the European Association for Machine Translation

pdf bib
From Subtitles to Parallel Corpora
Mark Fishel | Yota Georgakopoulou | Sergio Penkale | Volha Petukhova | Matej Rojc | Martin Volk | Andy Way
Proceedings of the 16th Annual Conference of the European Association for Machine Translation

pdf
Domain Adaptation in SMT of User-Generated Forum Content Guided by OOV Word Reduction: Normalization and/or Supplementary Data
Pratyush Banerjee | Sudip Kumar Naskar | Johann Roturier | Andy Way | Josef van Genabith
Proceedings of the 16th Annual Conference of the European Association for Machine Translation

pdf
Extending CCG-based Syntactic Constraints in Hierarchical Phrase-Based SMT
Hala Almaghout | Jie Jiang | Andy Way
Proceedings of the 16th Annual Conference of the European Association for Machine Translation

2011

pdf
Incorporating Source-Language Paraphrases into Phrase-Based SMT with Confusion Networks
Jie Jiang | Jinhua Du | Andy Way
Proceedings of Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf
Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach
Yanjun Ma | Yifan He | Andy Way | Josef van Genabith
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts
Andy Way | Patrick Pantel
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

pdf
The DCU machine translation systems for IWSLT 2011
Pratyush Banerjee | Hala Almaghout | Sudip Naskar | Johann Roturier | Jie Jiang | Andy Way | Josef van Genabith
Proceedings of the 8th International Workshop on Spoken Language Translation: Evaluation Campaign

In this paper, we provide a description of the Dublin City University’s (DCU) submissions in the IWSLT 2011 evaluationcampaign.1 WeparticipatedintheArabic-Englishand Chinese-English Machine Translation(MT) track translation tasks. We use phrase-based statistical machine translation (PBSMT) models to create the baseline system. Due to the open-domain nature of the data to be translated, we use domain adaptation techniques to improve the quality of translation. Furthermore, we explore target-side syntactic augmentation for an Hierarchical Phrase-Based (HPB) SMT model. Combinatory Categorial Grammar (CCG) is used to extract labels for target-side phrases and non-terminals in the HPB system. Combining the domain adapted language models with the CCG-augmented HPB system gave us the best translations for both language pairs providing statistically significant improvements of 6.09 absolute BLEU points (25.94% relative) and 1.69 absolute BLEU points (15.89% relative) over the unadapted PBSMT baselines for the Arabic-English and Chinese-English language pairs, respectively.

pdf
Phonetic Representation-Based Speech Translation
Jie Jiang | Zeeshan Ahmed | Julie Carson-Berndsen | Peter Cahill | Andy Way
Proceedings of Machine Translation Summit XIII: Papers

pdf
Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component Level Mixture Modelling
Pratyush Banerjee | Sudip Kumar Naskar | Johann Roturier | Andy Way | Josef van Genabith
Proceedings of Machine Translation Summit XIII: Papers

pdf
Rich Linguistic Features for Translation Memory-Inspired Consistent Translation
Yifan He | Yanjun Ma | Andy Way | Josef van Genabith
Proceedings of Machine Translation Summit XIII: Papers

pdf
A Framework for Diagnostic Evaluation of MT Based on Linguistic Checkpoints
Sudip Kumar Naskar | Antonio Toral | Federico Gaspari | Andy Way
Proceedings of Machine Translation Summit XIII: Papers

pdf
Automatic acquisition of named entities for rule-based machine translation
Antonio Toral | Andy Way
Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation

This paper proposes to enrich RBMT dictionaries with Named Entities (NEs) automatically acquired from Wikipedia. The method is applied to the Apertium English–Spanish system and its performance compared to that of Apertium with and without handtagged NEs. The system with automatic NEs outperforms the one without NEs, while results vary when compared to a system with handtagged NEs (results are comparable for Spanish→English but slightly worst for English→Spanish). Apart from that, adding automatic NEs contributes to decreasing the amount of unknown terms by more than 10%.

pdf
A Comparative Evaluation of Research vs. Online MT Systems
Antonio Toral | Federico Gaspari | Sudip Kumar Naskar | Andy Way
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

pdf
Experiments on Domain Adaptation for Patent Machine Translation in the PLuTO project
Alexandru Ceauşu | John Tinsley | Jian Zhang | Andy Way
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

pdf
Towards a User-Friendly Webservice Architecture for Statistical Machine Translation in the PANACEA project
Antonio Toral | Pavel Pecina | Marc Poch | Andy Way
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

pdf
Preliminary Experiments on Using Users’ Post-Editions to Enhance a SMT System Oracle-based Training for Phrase-based Statistical Machine Translation
Ankit Srivastava | Yanjun Ma | Andy Way
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

pdf
Using Example-Based MT to Support Statistical MT when Translating Homogeneous Data in a Resource-Poor Setting
Sandipan Dandapat | Sara Morrissey | Andy Way | Mikel L. Forcada
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

pdf
Combining Semantic and Syntactic Generalization in Example-Based Machine Translation
Sarah Ebling | Andy Way | Martin Volk | Sudip Kumar Naskar
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

pdf
CCG Contextual labels in Hierarchical Phrase-Based SMT
Hala Almaghout | Jie Jiang | Andy Way
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

pdf
Towards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation
Pavel Pecina | Antonio Toral | Andy Way | Vassilis Papavassiliou | Prokopis Prokopidis | Maria Giagkou
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

2010

pdf
Statistical Analysis of Alignment Characteristics for Phrase-based Machine Translation
Patrik Lambert | Simon Petitrenaud | Yanjun Ma | Andy Way
Proceedings of the 14th Annual Conference of the European Association for Machine Translation

pdf
TMX Markup: A Challenge When Adapting SMT to the Localisation Environment
Jinhua Du | Johann Roturier | Andy Way
Proceedings of the 14th Annual Conference of the European Association for Machine Translation

pdf
Lattice Score Based Data Cleaning for Phrase-Based Statistical Machine Translation
Jie Jiang | Julie Carson-Berndsen | Andy Way
Proceedings of the 14th Annual Conference of the European Association for Machine Translation

pdf
The Impact of Source–Side Syntactic Reordering on Hierarchical Phrase-based SMT
Jinhua Du | Andy Way
Proceedings of the 14th Annual Conference of the European Association for Machine Translation

pdf
MATREX: The DCU MT System for WMT 2010
Sergio Penkale | Rejwanul Haque | Sandipan Dandapat | Pratyush Banerjee | Ankit K. Srivastava | Jinhua Du | Pavel Pecina | Sudip Kumar Naskar | Mikel L. Forcada | Andy Way
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf
An Augmented Three-Pass System Combination Framework: DCU Combination System for WMT 2010
Jinhua Du | Pavel Pecina | Andy Way
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf
The DCU Dependency-Based Metric in WMT-MetricsMATR 2010
Yifan He | Jinhua Du | Andy Way | Josef van Genabith
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf
Handling Named Entities and Compound Verbs in Phrase-Based Statistical Machine Translation
Santanu Pal | Sudip Kumar Naskar | Pavel Pecina | Sivaji Bandyopadhyay | Andy Way
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications

pdf
Source-side Syntactic Reordering Patterns with Functional Words for Improved Phrase-based SMT
Jie Jiang | Jinhua Du | Andy Way
Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation

pdf
HMM Word-to-Phrase Alignment with Dependency Constraints
Yanjun Ma | Andy Way
Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation

pdf
Multi-Word Expression-Sensitive Word Alignment
Tsuyoshi Okita | Alfredo Maldonado Guerra | Yvette Graham | Andy Way
Proceedings of the 4th Workshop on Cross Lingual Information Access

pdf
The DCU machine translation systems for IWSLT 2010
Hala Almaghout | Jie Jiang | Andy Way
Proceedings of the 7th International Workshop on Spoken Language Translation: Evaluation Campaign

pdf bib
CCG augmented hierarchical phrase-based machine translation
Hala Almaghout | Jie Jiang | Andy Way
Proceedings of the 7th International Workshop on Spoken Language Translation: Papers

pdf
Using TERp to Augment the System Combination for SMT
Jinhua Du | Andy Way
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

TER-Plus (TERp) is an extended TER evaluation metric incorporating morphology, synonymy and paraphrases. There are three new edit operations in TERp: Stem Matches, Synonym Matches and Phrase Substitutions (Paraphrases). In this paper, we propose a TERp-based augmented system combination in terms of the backbone selection and consensus decoding network. Combining the new properties of the TERp, we also propose a two-pass decoding strategy for the lattice-based phrase-level confusion network (CN) to generate the final result.The experiments conducted on the NIST2008 Chinese-to-English test set show that our TERp-based augmented system combination framework achieves significant improvements in terms of BLEU and TERp scores compared to the state-of-the-art word-level system combination framework and a TER-based combination strategy.

pdf
Improved Phrase-based SMT with Syntactic Reordering Patterns Learned from Lattice Scoring
Jie Jiang | Jinhua Du | Andy Way
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

In this paper, we present a novel approach to incorporate source-side syntactic reordering patterns into phrase-based SMT. The main contribution of this work is to use the lattice scoring approach to exploit and utilize reordering information that is favoured by the baseline PBSMT system. By referring to the parse trees of the training corpus, we represent the observed reorderings with source-side syntactic patterns. The extracted patterns are then used to convert the parsed inputs into word lattices, which contain both the original source sentences and their potential reorderings. Weights of the word lattices are estimated from the observations of the syntactic reordering patterns in the training corpus. Finally, the PBSMT system is tuned and tested on the generated word lattices to show the benefits of adding potential source-side reorderings in the inputs. We confirmed the effectiveness of our proposed method on a medium-sized corpus for Chinese-English machine translation task. Our method outperformed the baseline system by 1.67% relative on a randomly selected testset and 8.56% relative on the NIST 2008 testset in terms of BLEU score.

pdf
Combining Multi-Domain Statistical Machine Translation Models using Automatic Classifiers
Pratyush Banerjee | Jinhua Du | Baoli Li | Sudip Naskar | Andy Way | Josef van Genabith
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

This paper presents a set of experiments on Domain Adaptation of Statistical Machine Translation systems. The experiments focus on Chinese-English and two domain-specific corpora. The paper presents a novel approach for combining multiple domain-trained translation models to achieve improved translation quality for both domain-specific as well as combined sets of sentences. We train a statistical classifier to classify sentences according to the appropriate domain and utilize the corresponding domain-specific MT models to translate them. Experimental results show that the method achieves a statistically significant absolute improvement of 1.58 BLEU (2.86% relative improvement) score over a translation model trained on combined data, and considerable improvements over a model using multiple decoding paths of the Moses decoder, for the combined domain test set. Furthermore, even for domain-specific test sets, our approach works almost as well as dedicated domain-specific models and perfect classification.

pdf
Supertags as Source Language Context in Hierarchical Phrase-Based SMT
Rejwanul Haque | Sudip Naskar | Antal van den Bosch | Andy Way
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

Statistical machine translation (SMT) models have recently begun to include source context modeling, under the assumption that the proper lexical choice of the translation for an ambiguous word can be determined from the context in which it appears. Various types of lexical and syntactic features have been explored as effective source context to improve phrase selection in SMT. In the present work, we introduce lexico-syntactic descriptions in the form of supertags as source-side context features in the state-of-the-art hierarchical phrase-based SMT (HPB) model. These features enable us to exploit source similarity in addition to target similarity, as modelled by the language model. In our experiments two kinds of supertags are employed: those from lexicalized tree-adjoining grammar (LTAG) and combinatory categorial grammar (CCG). We use a memory-based classification framework that enables the efficient estimation of these features. Despite the differences between the two supertagging approaches, they give similar improvements. We evaluate the performance of our approach on an English-to-Dutch translation task, and report statistically significant improvements of 4.48% and 6.3% BLEU scores in translation quality when adding CCG and LTAG supertags, respectively, as context-informed features.

pdf
Improving the Post-Editing Experience using Translation Recommendation: A User Study
Yifan He | Yanjun Ma | Johann Roturier | Andy Way | Josef van Genabith
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

We report findings from a user study with professional post-editors using a translation recommendation framework (He et al., 2010) to integrate Statistical Machine Translation (SMT) output with Translation Memory (TM) systems. The framework recommends SMT outputs to a TM user when it predicts that SMT outputs are more suitable for post-editing than the hits provided by the TM. We analyze the effectiveness of the model as well as the reaction of potential users. Based on the performance statistics and the users’ comments, we find that translation recommendation can reduce the workload of professional post-editors and improve the acceptance of MT in the localization industry.

pdf
Accuracy-Based Scoring for Phrase-Based Statistical Machine Translation
Sergio Penkale | Yanjun May | Daniel Galron | Andy Way
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

Although the scoring features of state-of-the-art Phrase-Based Statistical Machine Translation (PB-SMT) models are weighted so as to optimise an objective function measuring translation quality, the estimation of the features themselves does not have any relation to such quality metrics. In this paper, we introduce a translation quality-based feature to PB-SMT in a bid to improve the translation quality of the system. Our feature is estimated by averaging the edit-distance between phrase pairs involved in the translation of oracle sentences, chosen by automatic evaluation metrics from the N-best outputs of a baseline system, and phrase pairs occurring in the N-best list. Using our method, we report a statistically significant 2.11% relative improvement in BLEU score for the WMT 2009 Spanish-to-English translation task. We also report that using our method we can achieve statistically significant improvements over the baseline using many other MT evaluation metrics, and a substantial increase in speed and reduction in memory use (due to a reduction in phrase-table size of 87%) while maintaining significant gains in translation quality.

pdf
PLuTO: MT for On-Line Patent Translation
John Tinsley | Andy Way | Páraic Sheridan
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Commercial MT User Program

PLuTO – Patent Language Translation Online – is a partially EU-funded commercialization project which specializes in the automatic retrieval and translation of patent documents. At the core of the PLuTO framework is a machine translation (MT) engine through which web-based translation services are offered. The fully integrated PLuTO architecture includes a translation engine coupling MT with translation memories (TM), and a patent search and retrieval engine. In this paper, we first describe the motivating factors behind the provision of such a service. Following this, we give an overview of the PLuTO framework as a whole, with particular emphasis on the MT components, and provide a real world use case scenario in which PLuTO MT services are ex- ploited.

pdf
A Discriminative Latent Variable-Based “DE” Classifier for Chinese-English SMT
Jinhua Du | Andy Way
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf
Integrating N-best SMT Outputs into a TM System
Yifan He | Yanjun Ma | Andy Way | Josef van Genabith
Coling 2010: Posters

pdf
Facilitating Translation Using Source Language Paraphrase Lattices
Jinhua Du | Jie Jiang | Andy Way
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf
Bridging SMT and TM with Translation Recommendation
Yifan He | Yanjun Ma | Josef van Genabith | Andy Way
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

2009

pdf
Bilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation
Yanjun Ma | Andy Way
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf
Lexicalized Semi-incremental Dependency Parsing
Hany Hassan | Khalil Sima’an | Andy Way
Proceedings of the International Conference RANLP-2009

pdf
Learning Labelled Dependencies in Machine Translation Evaluation
Yifan He | Andy Way
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

pdf
Optimal Bilingual Data for French-English PB-SMT
Sylwia Ozdowska | Andy Way
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

pdf
Marker-Based Filtering of Bilingual Phrase Pairs for SMT
Felipe Sánchez-Martínez | Andy Way
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

pdf
Using Supertags as Source Language Context in SMT
Rejwanul Haque | Sudip Kumar Naskar | Yanjun Ma | Andy Way
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

pdf
Tuning Syntactically Enhanced Word Alignment for Statistical Machine Translation
Yanjun Ma | Patrik Lambert | Andy Way
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

pdf
Low-resource machine translation using MaTrEx
Yanjun Ma | Tsuyoshi Okita | Özlem Çetinoğlu | Jinhua Du | Andy Way
Proceedings of the 6th International Workshop on Spoken Language Translation: Evaluation Campaign

In this paper, we give a description of the Machine Translation (MT) system developed at DCU that was used for our fourth participation in the evaluation campaign of the International Workshop on Spoken Language Translation (IWSLT 2009). Two techniques are deployed in our system in order to improve the translation quality in a low-resource scenario. The first technique is to use multiple segmentations in MT training and to utilise word lattices in decoding stage. The second technique is used to select the optimal training data that can be used to build MT systems. In this year’s participation, we use three different prototype SMT systems, and the output from each system are combined using standard system combination method. Our system is the top system for Chinese–English CHALLENGE task in terms of BLEU score.

pdf
Capturing Lexical Variation in MT Evaluation Using Automatically Built Sense-Cluster Inventories
Marianna Apidianaki | Yifan He | Andy Way
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 1

pdf
Dependency Relations as Source Context in Phrase-Based SMT
Rejwanul Haque | Sudip Kumar Naskar | Antal van den Bosch | Andy Way
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 1

pdf
Experiments on Domain Adaptation for English–Hindi SMT
Rejwanul Haque | Sudip Kumar Naskar | Josef van Genabith | Andy Way
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 2

pdf
MATREX: The DCU MT System for WMT 2009
Jinhua Du | Yifan He | Sergio Penkale | Andy Way
Proceedings of the Fourth Workshop on Statistical Machine Translation

pdf
Web Service Integration for Next Generation Localisation
David Lewis | Stephen Curran | Kevin Feeney | Zohar Etzioni | John Keeney | Andy Way | Reinhard Schäler
Proceedings of the Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing (SETQA-NLP 2009)

pdf
English-Hindi Transliteration Using Context-Informed PB-SMT: the DCU System for NEWS 2009
Rejwanul Haque | Sandipan Dandapat | Ankit Kumar Srivastava | Sudip Kumar Naskar | Andy Way
Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (NEWS 2009)

pdf
Accuracy-Based Scoring for DOT: Towards Direct Error Minimization for Data-Oriented Translation
Daniel Galron | Sergio Penkale | Andy Way | I. Dan Melamed
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf
A Syntactified Direct Translation Model with Linear-time Decoding
Hany Hassan | Khalil Sima’an | Andy Way
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf
Source-Side Context-Informed Hypothesis Alignment for Combining Outputs from Machine Translation Systems
Jinhua Du | Yanjun Ma | Andy Way
Proceedings of Machine Translation Summit XII: Posters

pdf
Improving the Objective Function in Minimum Error Rate Training
Yifan He | Andy Way
Proceedings of Machine Translation Summit XII: Posters

pdf
Tracking Relevant Alignment Characteristics for Machine Translation
Patrik Lambert | Yanjun Ma | Sylwia Ozdowska | Andy Way
Proceedings of Machine Translation Summit XII: Posters

pdf
Using Percolated Dependencies for Phrase Extraction in SMT
Ankit Srivastava | Andy Way
Proceedings of Machine Translation Summit XII: Posters

2008

pdf
Wide-Coverage Deep Statistical Parsing Using Automatic Dependency Structure Annotation
Aoife Cahill | Michael Burke | Ruth O’Donovan | Stefan Riezler | Josef van Genabith | Andy Way
Computational Linguistics, Volume 34, Number 1, March 2008

pdf
MaTrEx: The DCU MT System for WMT 2008
John Tinsley | Yanjun Ma | Sylwia Ozdowska | Andy Way
Proceedings of the Third Workshop on Statistical Machine Translation

pdf
Improving Word Alignment Using Syntactic Dependencies
Yanjun Ma | Sylwia Ozdowska | Yanli Sun | Andy Way
Proceedings of the ACL-08: HLT Second Workshop on Syntax and Structure in Statistical Translation (SSST-2)

pdf
The ATIS Sign Language Corpus
Jan Bungeroth | Daniel Stein | Philippe Dreuw | Hermann Ney | Sara Morrissey | Andy Way | Lynette van Zijl
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Systems that automatically process sign language rely on appropriate data. We therefore present the ATIS sign language corpus that is based on the domain of air travel information. It is available for five languages, English, German, Irish sign language, German sign language and South African sign language. The corpus can be used for different tasks like automatic statistical translation and automatic sign language recognition and it allows the specific modeling of spatial references in signing space.

pdf bib
Exploiting alignment techniques in MATREX: the DCU machine translation system for IWSLT 2008.
Yanjun Ma | John Tinsley | Hany Hassan | Jinhua Du | Andy Way
Proceedings of the 5th International Workshop on Spoken Language Translation: Evaluation Campaign

In this paper, we give a description of the machine translation (MT) system developed at DCU that was used for our third participation in the evaluation campaign of the International Workshop on Spoken Language Translation (IWSLT 2008). In this participation, we focus on various techniques for word and phrase alignment to improve system quality. Specifically, we try out our word packing and syntax-enhanced word alignment techniques for the Chinese–English task and for the English–Chinese task for the first time. For all translation tasks except Arabic–English, we exploit linguistically motivated bilingual phrase pairs extracted from parallel treebanks. We smooth our translation tables with out-of-domain word translations for the Arabic–English and Chinese–English tasks in order to solve the problem of the high number of out of vocabulary items. We also carried out experiments combining both in-domain and out-of-domain data to improve system performance and, finally, we deploy a majority voting procedure combining a language model-based method and a translation-based method for case and punctuation restoration. We participated in all the translation tasks and translated both the single-best ASR hypotheses and the correct recognition results. The translation results confirm that our new word and phrase alignment techniques are often helpful in improving translation quality, and the data combination method we proposed can significantly improve system performance.

pdf
Automatic Generation of Parallel Treebanks
Ventsislav Zhechev | Andy Way
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

2007

pdf bib
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers
Andy Way | Barbara Gawronska
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

pdf
Capturing translational divergences with a statistical tree-to-tree aligner
Mary Hearne | John Tinsley | Ventsislav Zhechev | Andy Way
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

pdf
Alignment-guided chunking
Yanjun Ma | Nicolas Stroppa | Andy Way
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

pdf
Hand in hand: automatic sign language to English translation
Daniel Stein | Philippe Dreuw | Hermann Ney | Sara Morrissey | Andy Way
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

pdf
Exploiting source similarity for SMT using context-informed features
Nicolas Stroppa | Antal van den Bosch | Andy Way
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

pdf
MaTrEx: the DCU machine translation system for IWSLT 2007
Hany Hassan | Yanjun Ma | Andy Way
Proceedings of the Fourth International Workshop on Spoken Language Translation

In this paper, we give a description of the machine translation system developed at DCU that was used for our second participation in the evaluation campaign of the International Workshop on Spoken Language Translation (IWSLT 2007). In this participation, we focus on some new methods to improve system quality. Specifically, we try our word packing technique for different language pairs, we smooth our translation tables with out-of-domain word translations for the Arabic–English and Chinese–English tasks in order to solve the high number of out of vocabulary items, and finally we deploy a translation-based model for case and punctuation restoration. We participated in both the classical and challenge tasks for the following translation directions: Chinese–English, Japanese–English and Arabic–English. For the last two tasks, we translated both the single-best ASR hypotheses and the correct recognition results; for Chinese–English, we just translated the correct recognition results. We report the results of the system for the provided evaluation sets, together with some additional experiments carried out following identification of some simple tokenisation errors in the official runs.

pdf
Comparing rule-based and data-driven approaches to Spanish-to-Basque machine translation
Gorka Labaka | Nicolas Stroppa | Andy Way | Kepa Sarasola
Proceedings of Machine Translation Summit XI: Papers

pdf
Combining data-driven MT systems for improved sign language translation
Sara Morrissey | Andy Way | Daniel Stein | Jan Bungeroth | Hermann Ney
Proceedings of Machine Translation Summit XI: Papers

pdf
Robust language pair-independent sub-tree alignment
John Tinsley | Ventsislav Zhechev | Mary Hearne | Andy Way
Proceedings of Machine Translation Summit XI: Papers

pdf
Dependency-Based Automatic Evaluation for Machine Translation
Karolina Owczarzak | Josef van Genabith | Andy Way
Proceedings of SSST, NAACL-HLT 2007 / AMTA Workshop on Syntax and Structure in Statistical Translation

pdf
Labelled Dependencies in Machine Translation Evaluation
Karolina Owczarzak | Josef van Genabith | Andy Way
Proceedings of the Second Workshop on Statistical Machine Translation

pdf
Supertagged Phrase-Based Statistical Machine Translation
Hany Hassan | Khalil Sima’an | Andy Way
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

pdf
Bootstrapping Word Alignment via Word Packing
Yanjun Ma | Nicolas Stroppa | Andy Way
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

2006

pdf
MATREX: DCU machine translation system for IWSLT 2006.
Nicolas Stroppa | Andy Way
Proceedings of the Third International Workshop on Spoken Language Translation: Evaluation Campaign

pdf
Multi-Engine Machine Translation by Recursive Sentence Decomposition
Bart Mellebeek | Karolina Owczarzak | Josef Van Genabith | Andy Way
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers

In this paper, we present a novel approach to combine the outputs of multiple MT engines into a consensus translation. In contrast to previous Multi-Engine Machine Translation (MEMT) techniques, we do not rely on word alignments of output hypotheses, but prepare the input sentence for multi-engine processing. We do this by using a recursive decomposition algorithm that produces simple chunks as input to the MT engines. A consensus translation is produced by combining the best chunk translations, selected through majority voting, a trigram language model score and a confidence score assigned to each MT engine. We report statistically significant relative improvements of up to 9% BLEU score in experiments (English→Spanish) carried out on an 800-sentence test set extracted from the Penn-II Treebank.

pdf
Wrapper Syntax for Example-based Machine Translation
Karolina Owczarzak | Bart Mellebeek | Declan Groves | Josef Van Genabith | Andy Way
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers

TransBooster is a wrapper technology designed to improve the performance of wide-coverage machine translation systems. Using linguistically motivated syntactic information, it automatically decomposes source language sentences into shorter and syntactically simpler chunks, and recomposes their translation to form target language sentences. This generally improves both the word order and lexical selection of the translation. To date, TransBooster has been successfully applied to rule-based MT, statistical MT, and multi-engine MT. This paper presents the application of TransBooster to Example-Based Machine Translation. In an experiment conducted on test sets extracted from Europarl and the Penn II Treebank we show that our method can raise the BLEU score up to 3.8% relative to the EBMT baseline. We also conduct a manual evaluation, showing that TransBooster-enhanced EBMT produces a better output in terms of fluency than the baseline EBMT in 55% of the cases and in terms of accuracy in 53% of the cases.

pdf
Example-Based Machine Translation of the Basque Language
Nicolas Stroppa | Declan Groves | Andy Way | Kepa Sarasola
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers

Basque is both a minority and a highly inflected language with free order of sentence constituents. Machine Translation of Basque is thus both a real need and a test bed for MT techniques. In this paper, we present a modular Data-Driven MT system which includes different chunkers as well as chunk aligners which can deal with the free order of sentence constituents of Basque. We conducted Basque to English translation experiments, evaluated on a large corpus (270,000 sentence pairs). The experimental results show that our system significantly outperforms state-of-the-art approaches according to several common automatic evaluation metrics.

pdf
Disambiguation Strategies for Data-Oriented Translation
Mary Hearne | Andy Way
Proceedings of the 11th Annual Conference of the European Association for Machine Translation

pdf
Hybridity in MT. Experiments on the Europarl Corpus
Declan Groves | Andy Way
Proceedings of the 11th Annual Conference of the European Association for Machine Translation

pdf
A Syntactic Skeleton for Statistical Machine Translation
Bart Mellebeek | Karolina Owczarzak | Declan Groves | Josef Van Genabith | Andy Way
Proceedings of the 11th Annual Conference of the European Association for Machine Translation

pdf
Contextual Bitext-Derived Paraphrases in Automatic MT Evaluation
Karolina Owczarzak | Declan Groves | Josef Van Genabith | Andy Way
Proceedings on the Workshop on Statistical Machine Translation

2005

pdf
Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks
Ruth O’Donovan | Michael Burke | Aoife Cahill | Josef van Genabith | Andy Way
Computational Linguistics, Volume 31, Number 3, September 2005

pdf
TransBooster: boosting the performance of wide-coverage machine translation systems
Bart Mellebeek | Anna Khasin | Josef Van Genabith | Andy Way
Proceedings of the 10th EAMT Conference: Practical applications of machine translation

pdf
Improving Online Machine Translation Systems
Bart Mellebeek | Anna Khasin | Karolina Owczarzak | Josef Van Genabith | Andy Way
Proceedings of Machine Translation Summit X: Papers

In (Mellebeek et al., 2005), we proposed the design, implementation and evaluation of a novel and modular approach to boost the translation performance of existing, wide-coverage, freely available machine translation systems, based on reliable and fast automatic decomposition of the translation input and corresponding composition of translation output. Despite showing some initial promise, our method did not improve on the baseline Logomedia1 and Systran2 MT systems. In this paper, we improve on the algorithm presented in (Mellebeek et al., 2005), and on the same test data, show increased scores for a range of automatic evaluation metrics. Our algorithm now outperforms Logomedia, obtains similar results to SDL3 and falls tantalisingly short of the performance achieved by Systran.

pdf
An Example-Based Approach to Translating Sign Language
Sara Morrissey | Andy Way
Workshop on example-based machine translation

Users of sign languages are often forced to use a language in which they have reduced competence simply because documentation in their preferred format is not available. While some research exists on translating between natural and sign languages, we present here what we believe to be the first attempt to tackle this problem using an example-based (EBMT) approach. Having obtained a set of English–Dutch Sign Language examples, we employ an approach to EBMT using the ‘Marker Hypothesis’ (Green, 1979), analogous to the successful system of (Way & Gough, 2003), (Gough & Way, 2004a) and (Gough & Way, 2004b). In a set of experiments, we show that encouragingly good translation quality may be obtained using such an approach.

pdf
Hybrid Example-Based SMT: the Best of Both Worlds?
Declan Groves | Andy Way
Proceedings of the ACL Workshop on Building and Using Parallel Texts

2004

pdf
Treebank-Based Acquisition of a Chinese Lexical-Functional Grammar
Michael Burke | Olivia Lam | Aoife Cahill | Rowena Chan | Ruth O’Donovan | Adams Bodomo | Josef van Genabith | Andy Way
Proceedings of the 18th Pacific Asia Conference on Language, Information and Computation

pdf
Example-based controlled translation
Nano Gough | Andy Way
Proceedings of the 9th EAMT Workshop: Broadening horizons of machine translation and its applications

pdf
Robust large-scale EBMT with marker-based segmentation
Nano Gough | Andy Way
Proceedings of the 10th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages

pdf
Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations
Aoife Cahill | Michael Burke | Ruth O’Donovan | Josef van Genabith | Andy Way
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

pdf
Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II Treebank
Ruth O’Donovan | Michael Burke | Aoife Cahill | Josef van Genabith | Andy Way
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

pdf
Robust Sub-Sentential Alignment of Phrase-Structure Trees
Declan Groves | Mary Hearne | Andy Way
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

2003

pdf
Controlled generation in example-based machine translation
Nano Gough | Andy Way
Proceedings of Machine Translation Summit IX: Papers

The theme of controlled translation is currently in vogue in the area of MT. Recent research (Scha ̈ler et al., 2003; Carl, 2003) hypothesises that EBMT systems are perhaps best suited to this challenging task. In this paper, we present an EBMT system where the generation of the target string is filtered by data written according to controlled language specifications. As far as we are aware, this is the only research available on this topic. In the field of controlled language applications, it is more usual to constrain the source language in this way rather than the target. We translate a small corpus of controlled English into French using the on-line MT system Logomedia, and seed the memories of our EBMT system with a set of automatically induced lexical resources using the Marker Hypothesis as a segmentation tool. We test our system on a large set of sentences extracted from a Sun Translation Memory, and provide both an automatic and a human evaluation. For comparative purposes, we also provide results for Logomedia itself.

pdf
Seeing the wood for the trees: data-oriented translation
Mary Hearne | Andy Way
Proceedings of Machine Translation Summit IX: Papers

Data-Oriented Translation (DOT), which is based on Data-Oriented Parsing (DOP), comprises an experience-based approach to translation, where new translations are derived with reference to grammatical analyses of previous translations. Previous DOT experiments [Poutsma, 1998, Poutsma, 2000a, Poutsma, 2000b] were small in scale because important advances in DOP technology were not incorporated into the translation model. Despite this, related work [Way, 1999, Way, 2003a, Way, 2003b] reports that DOT models are viable in that solutions to ‘hard’ translation cases are readily available. However, it has not been shown to date that DOT models scale to larger datasets. In this work, we describe a novel DOT system, inspired by recent advances in DOP parsing technology. We test our system on larger, more complex corpora than have been used heretofore, and present both automatic and human evaluations which show that high quality translations can be achieved at reasonable speeds.

pdf
Teaching and assessing empirical approaches to machine translation
Andy Way | Nano Gough
Workshop on Teaching Translation Technologies and Tools

Empirical methods in Natural Language Processing (NLP) and Machine Translation (MT) have become mainstream in the research field. Accordingly, it is important that the tools and techniques in these paradigms be taught to potential future researchers and developers in University courses. While many dedicated courses on Statistical NLP can be found, there are few, if any courses on Empirical Approaches to MT. This paper presents the development and assessment of one such course as taught to final year undergraduates taking a degree in NLP.

pdf
wEBMT: Developing and Validating an Example-Based Machine Translation System using the World Wide Web
Andy Way | Nano Gough
Computational Linguistics, Volume 29, Number 3, September 2003: Special Issue on the Web as Corpus

2002

pdf
Testing students’ understanding of complex transfer
Andy Way
Proceedings of the 6th EAMT Workshop: Teaching Machine Translation

pdf bib
Toward a hybrid integrated translation environment
Michael Carl | Andy Way | Reinhard Schäler
Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: Technical Papers

In this paper we present a model for the future use of Machine Translation (MT) and Computer Assisted Translation. In order to accommodate the future needs in middle value translations, we discuss a number of MT techniques and architectures. We anticipate a hybrid environment that integrates data- and rule-driven approaches where translations will be routed through the available translation options and consumers will receive accurate information on the quality, pricing and time implications of their translation choice.

pdf
Example-based machine translation via the Web
Nano Gough | Andy Way | Mary Hearne
Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: Technical Papers

One of the limitations of translation memory systems is that the smallest translation units currently accessible are aligned sentential pairs. We propose an example-based machine translation system which uses a ‘phrasal lexicon’ in addition to the aligned sentences in its database. These phrases are extracted from the Penn Treebank using the Marker Hypothesis as a constraint on segmentation. They are then translated by three on-line machine translation (MT) systems, and a number of linguistic resources are automatically constructed which are used in the translation of new input. We perform two experiments on testsets of sentences and noun phrases to demonstrate the effectiveness of our system. In so doing, we obtain insights into the strengths and weaknesses of the selected on-line MT systems. Finally, like many example-based machine translation systems, our approach also suffers from the problem of ‘boundary friction’. Where the quality of resulting translations is compromised as a result, we use a novel, post hoc validation procedure via the World Wide Web to correct imperfect translations prior to their being output to the user.

2001

pdf bib
Workshop on Example-Based machine Translation
Michael Carl | Andy Way
Workshop on Example-Based machine Translation

pdf
Translating with examples
Andy Way
Workshop on Example-Based machine Translation

pdf
Teaching machine translation & translation technology: a contrastive study
Dorothy Kenny | Andy Way
Workshop on Teaching Machine Translation

The Machine Translation course at Dublin City University is taught to undergraduate students in Applied Computational Linguistics, while Computer-Assisted Translation is taught on two translator-training programmes, one undergraduate and one postgraduate. Given the differing backgrounds of these sets of students, the course material, methods of teaching and assessment all differ. We report here on our experiences of teaching these courses over a number of years, which we hope will be of interest to lecturers of similar existing courses, as well as providing a reference point for others who may be considering the introduction of such material.

2000

pdf
LFG-DOT: Combining Constraint-Based and Empirical Methodologies for Robust MT
Andy Way
Proceedings of the 12th Nordic Conference of Computational Linguistics (NODALIDA 1999)

pdf
LFG-DOT: a probabilistic, constraint-based model for machine translation
Andy Way
Proceedings of the Fifth International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+5)

Search
Co-authors