David Vilar

2024

pdf bib abs
Quality-Aware Translation Models: Efficient Generation and Quality Estimation in a Single Model
Christian Tomani | David Vilar | Markus Freitag | Colin Cherry | Subhajit Naskar | Mara Finkelstein | Xavier Garcia | Daniel Cremers
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Maximum-a-posteriori (MAP) decoding is the most widely used decoding strategy for neural machine translation (NMT) models. The underlying assumption is that model probability correlates well with human judgment, with better translations getting assigned a higher score by the model. However, research has shown that this assumption does not always hold, and generation quality can be improved by decoding to optimize a utility function backed by a metric or quality-estimation signal, as is done by Minimum Bayes Risk (MBR) or Quality-Aware decoding. The main disadvantage of these approaches is that they require an additional model to calculate the utility function during decoding, significantly increasing the computational cost. In this paper, we propose to make the NMT models themselves quality-aware by training them to estimate the quality of their own output. Using this approach for MBR decoding we can drastically reduce the size of the candidate list, resulting in a speed-up of two-orders of magnitude. When applying our method to MAP decoding we obtain quality gains similar or even superior to quality reranking approaches, but with the efficiency of single pass decoding.

pdf bib abs
Introducing the NewsPaLM MBR and QE Dataset: LLM-Generated High-Quality Parallel Data Outperforms Traditional Web-Crawled Data
Mara Finkelstein | David Vilar | Markus Freitag
Proceedings of the Ninth Conference on Machine Translation

Recent research in neural machine translation (NMT) has shown that training on high-quality machine-generated data can outperform training on human-generated data. This work accompanies the first-ever release of a LLM-generated, MBR-decoded and QE-reranked dataset with both sentence-level and multi-sentence examples. We perform extensive experiments to demonstrate the quality of our dataset in terms of its downstream impact on NMT model performance. We find that training from scratch on our (machine-generated) dataset outperforms training on the (web-crawled) WMT’23 training dataset (which is 300 times larger), and also outperforms training on the top-quality subset of the WMT’23 training dataset. We also find that performing self-distillation by finetuning the LLM which generated this dataset outperforms the LLM’s strong few-shot baseline. These findings corroborate the quality of our dataset, and demonstrate the value of high-quality machine-generated data in improving performance of NMT models.

2023

pdf bib abs
Prompting PaLM for Translation: Assessing Strategies and Performance
David Vilar | Markus Freitag | Colin Cherry | Jiaming Luo | Viresh Ratnakar | George Foster
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) that have been trained on multilingual but not parallel text exhibit a remarkable ability to translate between languages. We probe this ability in an in-depth study of the pathways language model (PaLM), which has demonstrated the strongest machine translation (MT) performance among similarly-trained LLMs to date. We investigate various strategies for choosing translation examples for few-shot prompting, concluding that example quality is the most important factor. Using optimized prompts, we revisit previous assessments of PaLM’s MT capabilities with more recent test sets, modern MT metrics, and human evaluation, and find that its performance, while impressive, still lags that of state-of-the-art supervised systems. We conclude by providing an analysis of PaLM’s MT output which reveals some interesting properties and prospects for future work.

Quality Estimation (QE), the evaluation of machine translation output without the need of explicit references, has seen big improvements in the last years with the use of neural metrics. In this paper we analyze the viability of using QE metrics for filtering out bad quality sentence pairs in the training data of neural machine translation systems (NMT). While most corpus filtering methods are focused on detecting noisy examples in collections of texts, usually huge amounts of web crawled data, QE models are trained to discriminate more fine-grained quality differences. We show that by selecting the highest quality sentence pairs in the training data, we can improve translation quality while reducing the training size by half. We also provide a detailed analysis of the filtering results, which highlights the differences between both approaches.

2022

pdf bib abs
A Natural Diet: Towards Improving Naturalness of Machine Translation Output
Markus Freitag | David Vilar | David Grangier | Colin Cherry | George Foster
Findings of the Association for Computational Linguistics: ACL 2022

Machine translation (MT) evaluation often focuses on accuracy and fluency, without paying much attention to translation style. This means that, even when considered accurate and fluent, MT output can still sound less natural than high quality human translations or text originally written in the target language. Machine translation output notably exhibits lower lexical diversity, and employs constructs that mirror those in the source sentence. In this work we propose a method for training MT systems to achieve a more natural style, i.e. mirroring the style of text originally written in the target language. Our method tags parallel training data according to the naturalness of the target side by contrasting language models trained on natural and translated data. Tagging data allows us to put greater emphasis on target sentences originally written in the target language. Automatic metrics show that the resulting models achieve lexical richness on par with human translations, mimicking a style much closer to sentences originally written in the target language. Furthermore, we find that their output is preferred by human experts when compared to the baseline translations.

2021

pdf bib abs
Controlling Machine Translation for Multiple Attributes with Additive Interventions
Andrea Schioppa | David Vilar | Artem Sokolov | Katja Filippova
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Fine-grained control of machine translation (MT) outputs along multiple attributes is critical for many modern MT applications and is a requirement for gaining users’ trust. A standard approach for exerting control in MT is to prepend the input with a special tag to signal the desired output attribute. Despite its simplicity, attribute tagging has several drawbacks: continuous values must be binned into discrete categories, which is unnatural for certain applications; interference between multiple tags is poorly understood. We address these problems by introducing vector-valued interventions which allow for fine-grained control over multiple attributes simultaneously via a weighted linear combination of the corresponding vectors. For some attributes, our approach even allows for fine-tuning a model trained without annotations to support such interventions. In experiments with three attributes (length, politeness and monotonicity) and two language pairs (English to German and Japanese) our models achieve better control over a wider range of tasks compared to tagging, and translation quality does not degrade when no control is requested. Finally, we demonstrate how to enable control in an already trained model after a relatively cheap fine-tuning stage.

pdf bib abs
Bandits Don’t Follow Rules: Balancing Multi-Facet Machine Translation with Multi-Armed Bandits
Julia Kreutzer | David Vilar | Artem Sokolov
Findings of the Association for Computational Linguistics: EMNLP 2021

Training data for machine translation (MT) is often sourced from a multitude of large corpora that are multi-faceted in nature, e.g. containing contents from multiple domains or different levels of quality or complexity. Naturally, these facets do not occur with equal frequency, nor are they equally important for the test scenario at hand. In this work, we propose to optimize this balance jointly with MT model parameters to relieve system developers from manual schedule design. A multi-armed bandit is trained to dynamically choose between facets in a way that is most beneficial for the MT system. We evaluate it on three different multi-facet applications: balancing translationese and natural training data, or data from multiple domains or multiple language pairs. We find that bandit learning leads to competitive MT systems across tasks, and our analysis provides insights into its learned strategies and the underlying data sets.

pdf bib abs
A Statistical Extension of Byte-Pair Encoding
David Vilar | Marcello Federico
Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

Sub-word segmentation is currently a standard tool for training neural machine translation (MT) systems and other NLP tasks. The goal is to split words (both in the source and target languages) into smaller units which then constitute the input and output vocabularies of the MT system. The aim of reducing the size of the input and output vocabularies is to increase the generalization capabilities of the translation model, enabling the system to translate and generate infrequent and new (unseen) words at inference time by combining previously seen sub-word units. Ideally, we would expect the created units to have some linguistic meaning, so that words are created in a compositional way. However, the most popular word-splitting method, Byte-Pair Encoding (BPE), which originates from the data compression literature, does not include explicit criteria to favor linguistic splittings nor to find the optimal sub-word granularity for the given training data. In this paper, we propose a statistically motivated extension of the BPE algorithm and an effective convergence criterion that avoids the costly experimentation cycle needed to select the best sub-word vocabulary size. Experimental results with morphologically rich languages show that our model achieves nearly-optimal BLEU scores and produces morphologically better word segmentations, which allows to outperform BPE’s generalization in the translation of sentences containing new words, as shown via human evaluation.

2020

pdf bib
The Sockeye 2 Neural Machine Translation Toolkit at AMTA 2020
Tobias Domhan | Michael Denkowski | David Vilar | Xing Niu | Felix Hieber | Kenneth Heafield
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

pdf bib abs
Sockeye 2: A Toolkit for Neural Machine Translation
Felix Hieber | Tobias Domhan | Michael Denkowski | David Vilar
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

We present Sockeye 2, a modernized and streamlined version of the Sockeye neural machine translation (NMT) toolkit. New features include a simplified code base through the use of MXNet’s Gluon API, a focus on state of the art model architectures, and distributed mixed precision training. These improvements result in faster training and inference, higher automatic metric scores, and a shorter path from research to production.

2019

pdf bib
Automatic error classification with multiple error labels
Maja Popovic | David Vilar
Proceedings of Machine Translation Summit XVII: Research Track

2018

pdf bib abs
Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation
Matt Post | David Vilar
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

The end-to-end nature of neural machine translation (NMT) removes many ways of manually guiding the translation process that were available in older paradigms. Recent work, however, has introduced a new capability: lexically constrained or guided decoding, a modification to beam search that forces the inclusion of pre-specified words and phrases in the output. However, while theoretically sound, existing approaches have computational complexities that are either linear (Hokamp and Liu, 2017) or exponential (Anderson et al., 2017) in the number of constraints. We present a algorithm for lexically constrained decoding with a complexity of O(1) in the number of constraints. We demonstrate the algorithm’s remarkable ability to properly place these constraints, and use it to explore the shaky relationship between model and BLEU scores. Our implementation is available as part of Sockeye.

pdf bib abs
Learning Hidden Unit Contribution for Adapting Neural Machine Translation Models
David Vilar
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

In this paper we explore the use of Learning Hidden Unit Contribution for the task of neural machine translation. The method was initially proposed in the context of speech recognition for adapting a general system to the specific acoustic characteristics of each speaker. Similar in spirit, in a machine translation framework we want to adapt a general system to a specific domain. We show that the proposed method achieves improvements of up to 2.6 BLEU points over a general system, and up to 6 BLEU points if the initial system has only been trained on out-of-domain data, a situation which may easily happen in practice. The good performance together with its short training time and small memory footprint make it a very attractive solution for domain adaptation.

2014

pdf bib
Simple and Effective Approach for Consistent Training of Hierarchical Phrase-based Translation Models
Stephan Peitz | David Vilar | Hermann Ney
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

Human translators are the key to evaluating machine translation (MT) quality and also to addressing the so far unanswered question when and how to use MT in professional translation workflows. This paper describes the corpus developed as a result of a detailed large scale human evaluation consisting of three tightly connected tasks: ranking, error classification and post-editing.

2013

pdf bib
A CCG-based Quality Estimation Metric for Statistical Machine Translation Learning from Human Judgments of Machine Translation Output
Maja Popovic | Eleftherios Avramidis | Aljoscha Burchardt | Sabine Hunsicker | Sven Schmeier | Cindy Tscherwinka | David Vilar
Proceedings of Machine Translation Summit XIV: Posters

pdf bib
What can we learn about the selection mechanism for post-editing?
Maja Popović | Eleftherios Avramidis | Aljoscha Burchardt | David Vilar | Hans Uszkoreit
Proceedings of the 2nd Workshop on Post-editing Technology and Practice

pdf bib
A Performance Study of Cube Pruning for Large-Scale Hierarchical Machine Translation
Matthias Huck | David Vilar | Markus Freitag | Hermann Ney
Proceedings of the Seventh Workshop on Syntax, Semantics and Structure in Statistical Translation

2012

pdf bib
Towards the Integration of MT into a LSP Translation Workflow
David Vilar | Michael Schneider | Aljoscha Burchardt | Thomas Wedde
Proceedings of the 16th Annual Conference of the European Association for Machine Translation

pdf bib abs
Involving Language Professionals in the Evaluation of Machine Translation
Eleftherios Avramidis | Aljoscha Burchardt | Christian Federmann | Maja Popović | Cindy Tscherwinka | David Vilar
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Significant breakthroughs in machine translation only seem possible if human translators are taken into the loop. While automatic evaluation and scoring mechanisms such as BLEU have enabled the fast development of systems, it is not clear how systems can meet real-world (quality) requirements in industrial translation scenarios today. The taraXÜ project paves the way for wide usage of hybrid machine translation outputs through various feedback loops in system development. In a consortium of research and industry partners, the project integrates human translators into the development process for rating and post-editing of machine translation outputs thus collecting feedback for possible improvements.

pdf bib
DFKI’s SMT System for WMT 2012
David Vilar
Proceedings of the Seventh Workshop on Statistical Machine Translation

2011

pdf bib
Advancements in Arabic-to-English Hierarchical Machine Translation
Matthias Huck | David Vilar | Daniel Stein | Hermann Ney
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

pdf bib abs
DFKI’s SC and MT submissions to IWSLT 2011
David Vilar | Eleftherios Avramidis | Maja Popović | Sabine Hunsicker
Proceedings of the 8th International Workshop on Spoken Language Translation: Evaluation Campaign

We describe DFKI’s submission to the System Combination and Machine Translation tracks of the 2011 IWSLT Evaluation Campaign. We focus on a sentence selection mechanism which chooses the (hopefully) best sentence among a set of candidates. The rationale behind it is to take advantage of the strengths of each system, especially given an heterogeneous dataset like the one in this evaluation campaign, composed of TED Talks of very different topics. We focus on using features that correlate well with human judgement and, while our primary system still focus on optimizing the BLEU score on the development set, our goal is to move towards optimizing directly the correlation with human judgement. This kind of system is still under development and was used as a secondary submission.

pdf bib
Evaluate with Confidence Estimation: Machine ranking of translation outputs using grammatical features
Eleftherios Avramidis | Maja Popovic | David Vilar | Aljoscha Burchardt
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
Evaluation without references: IBM1 scores as evaluation metrics
Maja Popović | David Vilar | Eleftherios Avramidis | Aljoscha Burchardt
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
DFKI Hybrid Machine Translation System for WMT 2011 - On the Integration of SMT and RBMT
Jia Xu | Hans Uszkoreit | Casey Kennington | David Vilar | Xiaojun Zhang
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
Lightly-Supervised Training for Hierarchical Phrase-Based Machine Translation
Matthias Huck | David Vilar | Daniel Stein | Hermann Ney
Proceedings of the First workshop on Unsupervised Learning in NLP

2010

pdf bib abs
A Cocktail of Deep Syntactic Features for Hierarchical Machine Translation
Daniel Stein | Stephan Peitz | David Vilar | Hermann Ney
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

In this work we review and compare three additional syntactic enhancements for the hierarchical phrase-based translation model, which have been presented in the last few years. We compare their performance when applied separately and study whether the combination may yield additional improvements. Our findings show that the models are complementary, and their combination achieve an increase of 1% in BLEU and a reduction of nearly 2% in TER. The models presented in this work are made available as part of the Jane open source machine translation toolkit.

pdf bib abs
The RWTH Aachen machine translation system for IWSLT 2010
Saab Mansour | Stephan Peitz | David Vilar | Joern Wuebker | Hermann Ney
Proceedings of the 7th International Workshop on Spoken Language Translation: Evaluation Campaign

In this paper we describe the statistical machine translation system of the RWTH Aachen University developed for the translation task of the IWSLT 2010. This year, we participated in the BTEC translation task for the Arabic to English language direction. We experimented with two state-of-theart decoders: phrase-based and hierarchical-based decoders. Extensions to the decoders included phrase training (as opposed to heuristic phrase extraction) for the phrase-based decoder, and soft syntactic features for the hierarchical decoder. Additionally, we experimented with various rule-based and statistical-based segmenters for Arabic. Due to the different decoders and the different methodologies that we apply for segmentation, we expect that there will be complimentary variation in the results achieved by each system. The next step would be to exploit these variations and achieve better results by combining the systems. We try different strategies for system combination and report significant improvements over the best single system.

pdf bib abs
A combination of hierarchical systems with forced alignments from phrase-based systems
Carmen Heger | Joern Wuebker | David Vilar | Hermann Ney
Proceedings of the 7th International Workshop on Spoken Language Translation: Papers

Currently most state-of-the-art statistical machine translation systems present a mismatch between training and generation conditions. Word alignments are computed using the well known IBM models for single-word based translation. Afterwards phrases are extracted using extraction heuristics, unrelated to the stochastic models applied for finding the word alignment. In the last years, several research groups have tried to overcome this mismatch, but only with limited success. Recently, the technique of forced alignments has shown to improve translation quality for a phrase-based system, applying a more statistically sound approach to phrase extraction. In this work we investigate the first steps to combine forced alignment with a hierarchical model. Experimental results on IWSLT and WMT data show improvements in translation quality of up to 0.7% BLEU and 1.0% TER.

pdf bib
If I only had a parser: poor man’s syntax for hierarchical machine translation
David Vilar | Daniel Stein | Stephan Peitz | Hermann Ney
Proceedings of the 7th International Workshop on Spoken Language Translation: Papers

pdf bib
Jane: Open Source Hierarchical Translation, Extended with Reordering and Lexicon Models
David Vilar | Daniel Stein | Matthias Huck | Hermann Ney
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

2009

pdf bib
On LM Heuristics for the Cube Growing Algorithm
David Vilar | Hermann Ney
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

pdf bib
The RWTH Machine Translation System for WMT 2009
Maja Popović | David Vilar | Daniel Stein | Evgeny Matusov | Hermann Ney
Proceedings of the Fourth Workshop on Statistical Machine Translation

2008

RWTH’s system for the 2008 IWSLT evaluation consists of a combination of different phrase-based and hierarchical statistical machine translation systems. We participated in the translation tasks for the Chinese-to-English and Arabic-to-English language pairs. We investigated different preprocessing techniques, reordering methods for the phrase-based system, including reordering of speech lattices, and syntax-based enhancements for the hierarchical systems. We also tried the combination of the Arabic-to-English and Chinese-to-English outputs as an additional submission.

pdf bib abs
Analysing soft syntax features and heuristics for hierarchical phrase based machine translation.
David Vilar | Daniel Stein | Hermann Ney
Proceedings of the 5th International Workshop on Spoken Language Translation: Papers

Similar to phrase-based machine translation, hierarchical systems produce a large proportion of phrases, most of which are supposedly junk and useless for the actual translation. For the hierarchical case, however, the amount of extracted rules is an order of magnitude bigger. In this paper, we investigate several soft constraints in the extraction of hierarchical phrases and whether these help as additional scores in the decoding to prune unneeded phrases. We show the methods that help best.

2007

pdf bib abs
The RWTH machine translation system for IWSLT 2007
Arne Mauser | David Vilar | Gregor Leusch | Yuqi Zhang | Hermann Ney
Proceedings of the Fourth International Workshop on Spoken Language Translation

The RWTH system for the IWSLT 2007 evaluation is a combination of several statistical machine translation systems. The combination includes Phrase-Based models, a n-gram translation model and a hierarchical phrase model. We describe the individual systems and the method that was used for combining the system outputs. Compared to our 2006 system, we newly introduce a hierarchical phrase-based translation model and show improvements in system combination for Machine Translation. RWTH participated in the Italian-to-English and Chinese-to-English translation directions.

pdf bib
Analysis and System Combination of Phrase- and N-Gram-Based Statistical Machine Translation Systems
Marta R. Costa-jussà | Josep M. Crego | David Vilar | José A. R. Fonollosa | José B. Mariño | Hermann Ney
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

pdf bib
Can We Translate Letters?
David Vilar | Jan-Thorsten Peter | Hermann Ney
Proceedings of the Second Workshop on Statistical Machine Translation

pdf bib
Human Evaluation of Machine Translation Through Binary System Comparisons
David Vilar | Gregor Leusch | Hermann Ney | Rafael E. Banchs
Proceedings of the Second Workshop on Statistical Machine Translation

2006

pdf bib
AER: do we need to “improve” our alignments?
David Vilar | Maja Popovic | Hermann Ney
Proceedings of the Third International Workshop on Spoken Language Translation: Papers

pdf bib abs
Error Analysis of Statistical Machine Translation Output
David Vilar | Jia Xu | Luis Fernando D’Haro | Hermann Ney
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Evaluation of automatic translation output is a difficult task. Several performance measures like Word Error Rate, Position Independent Word Error Rate and the BLEU and NIST scores are widely use and provide a useful tool for comparing different systems and to evaluate improvements within a system. However the interpretation of all of these measures is not at all clear, and the identification of the most prominent source of errors in a given system using these measures alone is not possible. Therefore some analysis of the generated translations is needed in order to identify the main problems and to focus the research efforts. This area is however mostly unexplored and few works have dealt with it until now. In this paper we will present a framework for classification of the errors of a machine translation system and we will carry out an error analysis of the system used by the RWTH in the first TC-STAR evaluation.

2005

pdf bib
Comparison of generation strategies for interactive machine translation
Oliver Bender | Saša Hasan | David Vilar | Richard Zens | Hermann Ney
Proceedings of the 10th EAMT Conference: Practical applications of machine translation

pdf bib abs
Statistical Machine Translation of European Parliamentary Speeches
David Vilar | Evgeny Matusov | Sasa Hasan | Richard Zens | Hermann Ney
Proceedings of Machine Translation Summit X: Papers

In this paper we present the ongoing work at RWTH Aachen University for building a speech-to-speech translation system within the TC-Star project. The corpus we work on consists of parliamentary speeches held in the European Plenary Sessions. To our knowledge, this is the first project that focuses on speech-to-speech translation applied to a real-life task. We describe the statistical approach used in the development of our system and analyze its performance under different conditions: dealing with syntactically correct input, dealing with the exact transcription of speech and dealing with the (noisy) output of an automatic speech recognition system. Experimental results show that our system is able to perform adequately in each of these conditions.

pdf bib
Augmenting a Small Parallel Text with Morpho-Syntactic Language
Maja Popović | David Vilar | Hermann Ney | Slobodan Jovičić | Zoran Šarić
Proceedings of the ACL Workshop on Building and Using Parallel Texts

pdf bib
Novel Reordering Approaches in Phrase-Based Statistical Machine Translation
Stephan Kanthak | David Vilar | Evgeny Matusov | Richard Zens | Hermann Ney
Proceedings of the ACL Workshop on Building and Using Parallel Texts

pdf bib
Preprocessing and Normalization for Automatic Evaluation of Machine Translation
Gregor Leusch | Nicola Ueffing | David Vilar | Hermann Ney
Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization