Christof Monz

2024

pdf abs
Disentangling the Roles of Target-side Transfer and Regularization in Multilingual Machine Translation
Yan Meng | Christof Monz
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Multilingual Machine Translation (MMT) benefits from knowledge transfer across different language pairs. However, improvements in one-to-many translation compared to many-to-one translation are only marginal and sometimes even negligible. This performance discrepancy raises the question of to what extent positive transfer plays a role on the target-side for one-to-many MT. In this paper, we conduct a large-scale study that varies the auxiliary target-side languages along two dimensions, i.e., linguistic similarity and corpus size, to show the dynamic impact of knowledge transfer on the main language pairs. We show that linguistically similar auxiliary target languages exhibit strong ability to transfer positive knowledge. With an increasing size of similar target languages, the positive transfer is further enhanced to benefit the main language pairs. Meanwhile, we find distant auxiliary target languages can also unexpectedly benefit main language pairs, even with minimal positive transfer ability. Apart from transfer, we show distant auxiliary target languages can act as a regularizer to benefit translation performance by enhancing the generalization and model inference calibration.

pdf abs
Analyzing the Evaluation of Cross-Lingual Knowledge Transfer in Multilingual Language Models
Sara Rajaee | Christof Monz
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent advances in training multilingual language models on large datasets seem to have shown promising results in knowledge transfer across languages and achieve high performance on downstream tasks. However, we question to what extent the current evaluation benchmarks and setups accurately measure zero-shot cross-lingual knowledge transfer. In this work, we challenge the assumption that high zero-shot performance on target tasks reflects high cross-lingual ability by introducing more challenging setups involving instances with multiple languages. Through extensive experiments and analysis, we show that the observed high performance of multilingual models can be largely attributed to factors not requiring the transfer of actual linguistic knowledge, such as task- and surface-level knowledge. More specifically, we observe what has been transferred across languages is mostly data artifacts and biases, especially for low-resource languages. Our findings highlight the overlooked drawbacks of existing cross-lingual test data and evaluation setups, calling for a more nuanced understanding of the cross-lingual capabilities of multilingual models.

2023

pdf abs
NonFactS: NonFactual Summary Generation for Factuality Evaluation in Document Summarization
Amir Soleimani | Christof Monz | Marcel Worring
Findings of the Association for Computational Linguistics: ACL 2023

Pre-trained abstractive summarization models can generate fluent summaries and achieve high ROUGE scores. Previous research has found that these models often generate summaries that are inconsistent with their context document and contain nonfactual information. To evaluate factuality in document summarization, a document-level Natural Language Inference (NLI) classifier can be used. However, training such a classifier requires large-scale high-quality factual and nonfactual samples. To that end, we introduce NonFactS, a data generation model, to synthesize nonfactual summaries given a context document and a human-annotated (reference) factual summary. Compared to previous methods, our nonfactual samples are more abstractive and more similar to their corresponding factual samples, resulting in state-of-the-art performance on two factuality evaluation benchmarks, FALSESUM and SUMMAC. Our experiments demonstrate that even without human-annotated summaries, NonFactS can use random sentences to generate nonfactual summaries and a classifier trained on these samples generalizes to out-of-domain documents.

pdf abs
Ask Language Model to Clean Your Noisy Translation Data
Quinten Bolding | Baohao Liao | Brandon Denis | Jun Luo | Christof Monz
Findings of the Association for Computational Linguistics: EMNLP 2023

TTransformer models have demonstrated remarkable performance in neural machine translation (NMT). However, their vulnerability to noisy input poses a significant challenge in practical implementation, where generating clean output from noisy input is crucial. The MTNT dataset is widely used as a benchmark for evaluating the robustness of NMT models against noisy input. Nevertheless, its utility is limited due to the presence of noise in both the source and target sentences. To address this limitation, we focus on cleaning the noise from the target sentences in MTNT, making it more suitable as a benchmark for noise evaluation. Leveraging the capabilities of large language models (LLMs), we observe their impressive abilities in noise removal. For example, they can remove emojis while considering their semantic meaning. Additionally, we show that LLM can effectively rephrase slang, jargon, and profanities. The resulting datasets, called C-MTNT, exhibit significantly less noise in the target sentences while preserving the semantic integrity of the original sentences. Our human and GPT-4 evaluations also lead to a consistent conclusion that LLM performs well on this task. Lastly, experiments on C-MTNT showcased its effectiveness in evaluating the robustness of NMT models, highlighting the potential of advanced language models for data cleaning and emphasizing C-MTNT as a valuable resource.

pdf abs
Viewing Knowledge Transfer in Multilingual Machine Translation Through a Representational Lens
David Stap | Vlad Niculae | Christof Monz
Findings of the Association for Computational Linguistics: EMNLP 2023

We argue that translation quality alone is not a sufficient metric for measuring knowledge transfer in multilingual neural machine translation. To support this claim, we introduce Representational Transfer Potential (RTP), which measures representational similarities between languages. We show that RTP can measure both positive and negative transfer (interference), and find that RTP is strongly correlated with changes in translation quality, indicating that transfer does occur. Furthermore, we investigate data and language characteristics that are relevant for transfer, and find that multi-parallel overlap is an important yet under-explored feature. Based on this, we develop a novel training scheme, which uses an auxiliary similarity loss that encourages representations to be more invariant across languages by taking advantage of multi-parallel data. We show that our method yields increased translation quality for low- and mid-resource languages across multiple data and model setups.

pdf abs
Aligning Predictive Uncertainty with Clarification Questions in Grounded Dialog
Kata Naszadi | Putra Manggala | Christof Monz
Findings of the Association for Computational Linguistics: EMNLP 2023

Asking for clarification is fundamental to effective collaboration. An interactive artificial agent must know when to ask a human instructor for more information in order to ascertain their goals. Previous work bases the timing of questions on supervised models learned from interactions between humans. Instead of a supervised classification task, we wish to ground the need for questions in the acting agent’s predictive uncertainty. In this work, we investigate if ambiguous linguistic instructions can be aligned with uncertainty in neural models. We train an agent using the T5 encoder-decoder architecture to solve the Minecraft Collaborative Building Task and identify uncertainty metrics that achieve better distributional separation between clear and ambiguous instructions. We further show that well-calibrated prediction probabilities benefit the detection of ambiguous instructions. Lastly, we provide a novel empirical analysis on the relationship between uncertainty and dialog history length and highlight an important property that poses a difficulty for detection.

pdf bib
Proceedings of the Eighth Conference on Machine Translation
Philipp Koehn | Barry Haddow | Tom Kocmi | Christof Monz
Proceedings of the Eighth Conference on Machine Translation

This paper presents the results of the General Machine Translation Task organised as part of the 2023 Conference on Machine Translation (WMT). In the general MT task, participants were asked to build machine translation systems for any of 8 language pairs (corresponding to 14 translation directions), to be evaluated on test sets consisting of up to four different domains. We evaluate system outputs with professional human annotators using a combination of source-based Direct Assessment and scalar quality metric (DA+SQM).

pdf abs
UvA-MT’s Participation in the WMT 2023 General Translation Shared Task
Di Wu | Shaomu Tan | David Stap | Ali Araabi | Christof Monz
Proceedings of the Eighth Conference on Machine Translation

This paper describes the UvA-MT’s submission to the WMT 2023 shared task on general machine translation. We participate in the constrained track in two directions: English ↔ Hebrew. In this competition, we show that by using one model to handle bidirectional tasks, as a minimal setting of Multilingual Machine Translation (MMT), it is possible to achieve comparable results with that of traditional bilingual translation for both directions. By including effective strategies, like back-translation, re-parameterized embedding table, and task-oriented fine-tuning, we obtained competitive final results in the automatic evaluation for both English → Hebrew and Hebrew → English directions.

pdf abs
Parameter-Efficient Fine-Tuning without Introducing New Latency
Baohao Liao | Yan Meng | Christof Monz
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Parameter-efficient fine-tuning (PEFT) of pre-trained language models has recently demonstrated remarkable achievements, effectively matching the performance of full fine-tuning while utilizing significantly fewer trainable parameters, and consequently addressing the storage and communication constraints. Nonetheless, various PEFT methods are limited by their inherent characteristics. In the case of sparse fine-tuning, which involves modifying only a small subset of the existing parameters, the selection of fine-tuned parameters is task- and domain-specific, making it unsuitable for federated learning. On the other hand, PEFT methods with adding new parameters typically introduce additional inference latency. In this paper, we demonstrate the feasibility of generating a sparse mask in a task-agnostic manner, wherein all downstream tasks share a common mask. Our approach, which relies solely on the magnitude information of pre-trained parameters, surpasses existing methodologies by a significant margin when evaluated on the GLUE benchmark. Additionally, we introduce a novel adapter technique that directly applies the adapter to pre-trained parameters instead of the hidden representation, thereby achieving identical inference speed to that of full fine-tuning. Through extensive experiments, our proposed method attains a new state-of-the-art outcome in terms of both performance and storage efficiency, storing only 0.03% parameters of full fine-tuning.

pdf abs
Multilingual k-Nearest-Neighbor Machine Translation
David Stap | Christof Monz
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

k-nearest-neighbor machine translation has demonstrated remarkable improvements in machine translation quality by creating a datastore of cached examples. However, these improvements have been limited to high-resource language pairs, with large datastores, and remain a challenge for low-resource languages. In this paper, we address this issue by combining representations from multiple languages into a single datastore. Our results consistently demonstrate substantial improvements not only in low-resource translation quality (up to +3.6 BLEU), but also for high-resource translation quality (up to +0.5 BLEU). Our experiments show that it is possible to create multilingual datastores that are a quarter of the size, achieving a 5.3x speed improvement, by using linguistic similarities for datastore creation.

pdf abs
Beyond Shared Vocabulary: Increasing Representational Word Similarities across Languages for Multilingual Machine Translation
Di Wu | Christof Monz
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Using a shared vocabulary is common practice in Multilingual Neural Machine Translation (MNMT). In addition to its simple design, shared tokens play an important role in positive knowledge transfer, which manifests naturally when the shared tokens refer to similar meanings across languages. However, when words overlap is small, e.g., using different writing systems, transfer is inhibited. In this paper, we propose a re-parameterized method for building embeddings to alleviate this problem. More specifically, we define word-level information transfer pathways via word equivalence classes and rely on graph networks to fuse word embeddings across languages. Our experiments demonstrate the advantages of our approach: 1) the semantics of embeddings are better aligned across languages, 2) our method achieves evident BLEU improvements on high- and low-resource MNMT, and 3) only less than 1.0% additional trainable parameters are required with a limited increase in computational costs, while the inference time is identical to baselines.

pdf abs
Towards a Better Understanding of Variations in Zero-Shot Neural Machine Translation Performance
Shaomu Tan | Christof Monz
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Multilingual Neural Machine Translation (MNMT) facilitates knowledge sharing but often suffers from poor zero-shot (ZS) translation qualities. While prior work has explored the causes of overall low zero-shot translation qualities, our work introduces a fresh perspective: the presence of significant variations in zero-shot performance. This suggests that MNMT does not uniformly exhibit poor zero-shot capability; instead, certain translation directions yield reasonable results. Through systematic experimentation, spanning 1,560 language directions across 40 languages, we identify three key factors contributing to high variations in ZS NMT performance: 1) target-side translation quality, 2) vocabulary overlap, and 3) linguistic properties. Our findings highlight that the target side translation quality is the most influential factor, with vocabulary overlap consistently impacting zero-shot capabilities. Additionally, linguistic properties, such as language family and writing system, play a role, particularly with smaller models. Furthermore, we suggest that the off-target issue is a symptom of inadequate performance, emphasizing that zero-shot translation challenges extend beyond addressing the off-target problem. To support future research, we release the data and models as a benchmark for the study of ZS NMT.

pdf bib abs
Joint Dropout: Improving Generalizability in Low-Resource Neural Machine Translation through Phrase Pair Variables
Ali Araabi | Vlad Niculae | Christof Monz
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

Despite the tremendous success of Neural Machine Translation (NMT), its performance on low- resource language pairs still remains subpar, partly due to the limited ability to handle previously unseen inputs, i.e., generalization. In this paper, we propose a method called Joint Dropout, that addresses the challenge of low-resource neural machine translation by substituting phrases with variables, resulting in significant enhancement of compositionality, which is a key aspect of generalization. We observe a substantial improvement in translation quality for language pairs with minimal resources, as seen in BLEU and Direct Assessment scores. Furthermore, we conduct an error analysis, and find Joint Dropout to also enhance generalizability of low-resource NMT in terms of robustness and adaptability across different domains.

2022

pdf abs
Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token
Baohao Liao | David Thulke | Sanjika Hewavitharana | Hermann Ney | Christof Monz
Findings of the Association for Computational Linguistics: EMNLP 2022

The pre-training of masked language models (MLMs) consumes massive computation to achieve good results on downstream NLP tasks, resulting in a large carbon footprint. In the vanilla MLM, the virtual tokens, [MASK]s, act as placeholders and gather the contextualized information from unmasked tokens to restore the corrupted information. It raises the question of whether we can append [MASK]s at a later layer, to reduce the sequence length for earlier layers and make the pre-training more efficient. We show: (1) [MASK]s can indeed be appended at a later layer, being disentangled from the word embedding; (2) The gathering of contextualized information from unmasked tokens can be conducted with a few layers. By further increasing the masking rate from 15% to 50%, we can pre-train RoBERTa-base and RoBERTa-large from scratch with only 78% and 68% of the original computational budget without any degradation on the GLUE benchmark. When pre-training with the original budget, our method outperforms RoBERTa for 6 out of 8 GLUE tasks, on average by 0.4%.

pdf abs
How Effective is Byte Pair Encoding for Out-Of-Vocabulary Words in Neural Machine Translation?
Ali Araabi | Christof Monz | Vlad Niculae
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

Neural Machine Translation (NMT) is an open vocabulary problem. As a result, dealing with the words not occurring during training (a.k.a. out-of-vocabulary (OOV) words) have long been a fundamental challenge for NMT systems. The predominant method to tackle this problem is Byte Pair Encoding (BPE) which splits words, including OOV words, into sub-word segments. BPE has achieved impressive results for a wide range of translation tasks in terms of automatic evaluation metrics. While it is often assumed that by using BPE, NMT systems are capable of handling OOV words, the effectiveness of BPE in translating OOV words has not been explicitly measured. In this paper, we study to what extent BPE is successful in translating OOV words at the word-level. We analyze the translation quality of OOV words based on word type, number of segments, cross-attention weights, and the frequency of segment n-grams in the training data. Our experiments show that while careful BPE settings seem to be fairly useful in translating OOV words across datasets, a considerable percentage of OOV words are translated incorrectly. Furthermore, we highlight the slightly higher effectiveness of BPE in translating OOV words for special cases, such as named-entities and when the languages involved are linguistically close to each other.

This paper presents the results of the General Machine Translation Task organised as part of the Conference on Machine Translation (WMT) 2022. In the general MT task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting of four different domains. We evaluate system outputs with human annotators using two different techniques: reference-based direct assessment and (DA) and a combination of DA and scalar quality metric (DA+SQM).

2021

pdf abs
NLQuAD: A Non-Factoid Long Question Answering Data Set
Amir Soleimani | Christof Monz | Marcel Worring
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

We introduce NLQuAD, the first data set with baseline methods for non-factoid long question answering, a task requiring document-level language understanding. In contrast to existing span detection question answering data sets, NLQuAD has non-factoid questions that are not answerable by a short span of text and demanding multiple-sentence descriptive answers and opinions. We show the limitation of the F1 score for evaluation of long answers and introduce Intersection over Union (IoU), which measures position-sensitive overlap between the predicted and the target answer spans. To establish baseline performances, we compare BERT, RoBERTa, and Longformer models. Experimental results and human evaluations show that Longformer outperforms the other architectures, but results are still far behind a human upper bound, leaving substantial room for improvements. NLQuAD’s samples exceed the input limitation of most pre-trained Transformer-based models, encouraging future research on long sequence language models.

This paper presents the results of the newstranslation task, the multilingual low-resourcetranslation for Indo-European languages, thetriangular translation task, and the automaticpost-editing task organised as part of the Con-ference on Machine Translation (WMT) 2021.In the news task, participants were asked tobuild machine translation systems for any of10 language pairs, to be evaluated on test setsconsisting mainly of news stories. The taskwas also opened up to additional test suites toprobe specific aspects of translation.

2020

pdf abs
Optimizing Transformer for Low-Resource Neural Machine Translation
Ali Araabi | Christof Monz
Proceedings of the 28th International Conference on Computational Linguistics

Language pairs with limited amounts of parallel data, also known as low-resource languages, remain a challenge for neural machine translation. While the Transformer model has achieved significant improvements for many language pairs and has become the de facto mainstream architecture, its capability under low-resource conditions has not been fully investigated yet. Our experiments on different subsets of the IWSLT14 training data show that the effectiveness of Transformer under low-resource conditions is highly dependent on the hyper-parameter settings. Our experiments show that using an optimized Transformer for low-resource conditions improves the translation quality up to 7.3 BLEU points compared to using the Transformer default settings.

This paper presents the results of the news translation task and the similar language translation task, both organised alongside the Conference on Machine Translation (WMT) 2020. In the news task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting mainly of news stories. The task was also opened up to additional test suites to probe specific aspects of translation. In the similar language translation task, participants built machine translation systems for translating between closely related pairs of languages.

pdf abs
The Unreasonable Volatility of Neural Machine Translation Models
Marzieh Fadaee | Christof Monz
Proceedings of the Fourth Workshop on Neural Generation and Translation

Recent works have shown that Neural Machine Translation (NMT) models achieve impressive performance, however, questions about understanding the behavior of these models remain unanswered. We investigate the unexpected volatility of NMT models where the input is semantically and syntactically correct. We discover that with trivial modifications of source sentences, we can identify cases where unexpected changes happen in the translation and in the worst case lead to mistranslations. This volatile behavior of translating extremely similar sentences in surprisingly different ways highlights the underlying generalization problem of current NMT models. We find that both RNN and Transformer models display volatile behavior in 26% and 19% of sentence variations, respectively.

2019

This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019. Participants were asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. The task was also opened up to additional test suites to probe specific aspects of translation.

pdf
An Intrinsic Nearest Neighbor Analysis of Neural Machine Translation Architectures
Hamidreza Ghader | Christof Monz
Proceedings of Machine Translation Summit XVII: Research Track

pdf
Improving Neural Machine Translation Using Noisy Parallel Data through Distillation
Praveen Dakwale | Christof Monz
Proceedings of Machine Translation Summit XVII: Research Track

2018

pdf abs
Back-Translation Sampling by Targeting Difficult Words in Neural Machine Translation
Marzieh Fadaee | Christof Monz
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Neural Machine Translation has achieved state-of-the-art performance for several language pairs using a combination of parallel and synthetic data. Synthetic data is often generated by back-translating sentences randomly sampled from monolingual data using a reverse translation model. While back-translation has been shown to be very effective in many cases, it is not entirely clear why. In this work, we explore different aspects of back-translation, and show that words with high prediction loss during training benefit most from the addition of synthetic data. We introduce several variations of sampling strategies targeting difficult-to-predict words using prediction losses and frequencies of words. In addition, we also target the contexts of difficult words and sample sentences that are similar in context. Experimental results for the WMT news translation task show that our method improves translation quality by up to 1.7 and 1.2 Bleu points over back-translation using random sampling for German-English and English-German, respectively.

pdf abs
The Importance of Being Recurrent for Modeling Hierarchical Structure
Ke Tran | Arianna Bisazza | Christof Monz
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Recent work has shown that recurrent neural networks (RNNs) can implicitly capture and exploit hierarchical information when trained to solve common natural language processing tasks (Blevins et al., 2018) such as language modeling (Linzen et al., 2016; Gulordava et al., 2018) and neural machine translation (Shi et al., 2016). In contrast, the ability to model structured data with non-recurrent neural networks has received little attention despite their success in many NLP tasks (Gehring et al., 2017; Vaswani et al., 2017). In this work, we compare the two architectures—recurrent versus non-recurrent—with respect to their ability to model hierarchical structure and find that recurrency is indeed important for this purpose. The code and data used in our experiments is available at https://github.com/ketranm/fan_vs_rnn

This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2018. Participants were asked to build machine translation systems for any of 7 language pairs in both directions, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. This year, we also opened up the task to additional test sets to probe specific aspects of translation.

pdf
Examining the Tip of the Iceberg: A Data Set for Idiom Translation
Marzieh Fadaee | Arianna Bisazza | Christof Monz
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
Evaluation of Machine Translation Performance Across Multiple Genres and Languages
Marlies van der Wees | Arianna Bisazza | Christof Monz
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf abs
Learning Topic-Sensitive Word Representations
Marzieh Fadaee | Arianna Bisazza | Christof Monz
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Distributed word representations are widely used for modeling words in NLP tasks. Most of the existing models generate one representation per word and do not consider different meanings of a word. We present two approaches to learn multiple topic-sensitive representations per word by using Hierarchical Dirichlet Process. We observe that by modeling topics and integrating topic distributions for each document we obtain representations that are able to distinguish between different meanings of a given word. Our models yield statistically significant improvements for the lexical substitution task indicating that commonly used single word representations, even when combined with contextual information, are insufficient for this task.

pdf abs
Data Augmentation for Low-Resource Neural Machine Translation
Marzieh Fadaee | Arianna Bisazza | Christof Monz
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

The quality of a Neural Machine Translation system depends substantially on the availability of sizable parallel corpora. For low-resource language pairs this is not the case, resulting in poor translation quality. Inspired by work in computer vision, we propose a novel data augmentation approach that targets low-frequency words by generating new sentence pairs containing rare words in new, synthetically created contexts. Experimental results on simulated low-resource settings show that our method improves translation quality by up to 2.9 BLEU points over the baseline and up to 3.2 BLEU over back-translation.

pdf abs
Dynamic Data Selection for Neural Machine Translation
Marlies van der Wees | Arianna Bisazza | Christof Monz
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Intelligent selection of training data has proven a successful technique to simultaneously increase training efficiency and translation performance for phrase-based machine translation (PBMT). With the recent increase in popularity of neural machine translation (NMT), we explore in this paper to what extent and how NMT can also benefit from data selection. While state-of-the-art data selection (Axelrod et al., 2011) consistently performs well for PBMT, we show that gains are substantially lower for NMT. Next, we introduce ‘dynamic data selection’ for NMT, a method in which we vary the selected subset of training data between different training epochs. Our experiments show that the best results are achieved when applying a technique we call ‘gradual fine-tuning’, with improvements up to +2.6 BLEU over the original data selection approach and up to +3.1 BLEU over a general baseline.

pdf abs
What does Attention in Neural Machine Translation Pay Attention to?
Hamidreza Ghader | Christof Monz
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Attention in neural machine translation provides the possibility to encode relevant parts of the source sentence at each translation step. As a result, attention is considered to be an alignment model as well. However, there is no work that specifically studies attention and provides analysis of what is being learned by attention models. Thus, the question still remains that how attention is similar or different from the traditional alignment. In this paper, we provide detailed analysis of attention and compare it to traditional alignment. We answer the question of whether attention is only capable of modelling translational equivalent or it captures more information. We show that attention is different from alignment in some cases and is capturing useful information other than alignments.

pdf
Fine-Tuning for Neural Machine Translation with Limited Degradation across In- and Out-of-Domain Data
Praveen Dakwale | Christof Monz
Proceedings of Machine Translation Summit XVI: Research Track

2016

pdf
Improving Statistical Machine Translation Performance by Oracle-BLEU Model Re-estimation
Praveen Dakwale | Christof Monz
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf
Recurrent Memory Networks for Language Modeling
Ke Tran | Arianna Bisazza | Christof Monz
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf abs
Which Words Matter in Defining Phrase Reordering Behavior in Statistical Machine Translation?
Hamidreza Ghader | Christof Monz
Conferences of the Association for Machine Translation in the Americas: MT Researchers' Track

Lexicalized and hierarchical reordering models use relative frequencies of fully lexicalized phrase pairs to learn phrase reordering distributions. This results in unreliable estimation for infrequent phrase pairs which also tend to be longer phrases. There are some smoothing techniques used to smooth the distributions in these models. But these techniques are unable to address the similarities between phrase pairs and their reordering distributions. We propose two models to use shorter sub-phrase pairs of an original phrase pair to smooth the phrase reordering distributions. In the first model we follow the classic idea of backing off to shorter histories commonly used in language model smoothing. In the second model, we use syntactic dependencies to identify the most relevant words in a phrase to back off to. We show how these models can be easily applied to existing lexicalized and hierarchical reordering models. Our models achieve improvements of up to 0.40 BLEU points in Chinese-English translation compared to a baseline which uses a regular lexicalized reordering model and a hierarchical reordering model. The results show that not all the words inside a phrase pair are equally important in defining phrase reordering behavior and shortening towards important words will decrease the sparsity problem for long phrase pairs.

pdf abs
A Simple but Effective Approach to Improve Arabizi-to-English Statistical Machine Translation
Marlies van der Wees | Arianna Bisazza | Christof Monz
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

A major challenge for statistical machine translation (SMT) of Arabic-to-English user-generated text is the prevalence of text written in Arabizi, or Romanized Arabic. When facing such texts, a translation system trained on conventional Arabic-English data will suffer from extremely low model coverage. In addition, Arabizi is not regulated by any official standardization and therefore highly ambiguous, which prevents rule-based approaches from achieving good translation results. In this paper, we improve Arabizi-to-English machine translation by presenting a simple but effective Arabizi-to-Arabic transliteration pipeline that does not require knowledge by experts or native Arabic speakers. We incorporate this pipeline into a phrase-based SMT system, and show that translation quality after automatically transliterating Arabizi to Arabic yields results that are comparable to those achieved after human transliteration.

pdf abs
Ensemble Learning for Multi-Source Neural Machine Translation
Ekaterina Garmash | Christof Monz
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In this paper we describe and evaluate methods to perform ensemble prediction in neural machine translation (NMT). We compare two methods of ensemble set induction: sampling parameter initializations for an NMT system, which is a relatively established method in NMT (Sutskever et al., 2014), and NMT systems translating from different source languages into the same target language, i.e., multi-source ensembles, a method recently introduced by Firat et al. (2016). We are motivated by the observation that for different language pairs systems make different types of mistakes. We propose several methods with different degrees of parameterization to combine individual predictions of NMT systems so that they mutually compensate for each other’s mistakes and improve overall performance. We find that the biggest improvements can be obtained from a context-dependent weighting scheme for multi-source ensembles. This result offers stronger support for the linguistic motivation of using multi-source ensembles than previous approaches. Evaluation is carried out for German and French into English translation. The best multi-source ensemble method achieves an improvement of up to 2.2 BLEU points over the strongest single-source ensemble baseline, and a 2 BLEU improvement over a multi-source ensemble baseline.

pdf abs
Measuring the Effect of Conversational Aspects on Machine Translation Quality
Marlies van der Wees | Arianna Bisazza | Christof Monz
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Research in statistical machine translation (SMT) is largely driven by formal translation tasks, while translating informal text is much more challenging. In this paper we focus on SMT for the informal genre of dialogues, which has rarely been addressed to date. Concretely, we investigate the effect of dialogue acts, speakers, gender, and text register on SMT quality when translating fictional dialogues. We first create and release a corpus of multilingual movie dialogues annotated with these four dialogue-specific aspects. When measuring translation performance for each of these variables, we find that BLEU fluctuations between their categories are often significantly larger than randomly expected. Following this finding, we hypothesize and show that SMT of fictional dialogues benefits from adaptation towards dialogue acts and registers. Finally, we find that male speakers are harder to translate and use more vulgar language than female speakers, and that vulgarity is often not preserved during translation.

This paper describes a method that successfully exploits simple syntactic features for n-best translation candidate reranking using perceptrons. Our approach uses discriminative language modelling to rerank the n-best translations generated by a statistical machine translation system. The performance is evaluated for Arabic-to-English translation using NIST’s MT-Eval benchmarks. Whilst parse trees do not consistently help, we show how features extracted from a simple Part-of-Speech annotation layer outperform two competitive baselines, leading to significant BLEU improvements on three different test sets.

pdf
The QMUL system description for IWSLT 2010
Sirvan Yahyaei | Christof Monz
Proceedings of the 7th International Workshop on Spoken Language Translation: Evaluation Campaign

pdf
The UvA system description for IWSLT 2010
Spyros Martzoukos | Christof Monz
Proceedings of the 7th International Workshop on Spoken Language Translation: Evaluation Campaign

2009

pdf bib
Proceedings of the Fourth Workshop on Statistical Machine Translation
Chris Callison-Burch | Philipp Koehn | Christof Monz | Josh Schroeder
Proceedings of the Fourth Workshop on Statistical Machine Translation

pdf bib
Findings of the 2009 Workshop on Statistical Machine Translation
Chris Callison-Burch | Philipp Koehn | Christof Monz | Josh Schroeder
Proceedings of the Fourth Workshop on Statistical Machine Translation

pdf
Automatic Single-Document Key Fact Extraction from Newswire Articles
Itamar Kastner | Christof Monz
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf
Decoding by Dynamic Chunking for Statistical Machine Translation
Sirvan Yahyaei | Christof Monz
Proceedings of Machine Translation Summit XII: Papers

2008

pdf abs
TheQMUL system description for IWSLT 2008.
Simon Carter | Christof Monz | Sirvan Yahyaei
Proceedings of the 5th International Workshop on Spoken Language Translation: Evaluation Campaign

The QMUL system to the IWSLT 2008 evaluation campaign is a phrase-based statistical MT system implemented in C++. The decoder employs a multi-stack architecture, and uses a beam to manage the search space. We participated in both BTEC Arabic → English and Chinese → English tracks, as well as the PIVOT task. In our first submission to IWSLT, we are particularly interested in seeing how our SMT system performs with speech input, having so far only worked with and translated newswire data sets.

pdf bib
Proceedings of the Third Workshop on Statistical Machine Translation
Chris Callison-Burch | Philipp Koehn | Christof Monz | Josh Schroeder | Cameron Shaw Fordyce
Proceedings of the Third Workshop on Statistical Machine Translation

pdf
Further Meta-Evaluation of Machine Translation
Chris Callison-Burch | Cameron Fordyce | Philipp Koehn | Christof Monz | Josh Schroeder
Proceedings of the Third Workshop on Statistical Machine Translation

2007

pdf bib
Proceedings of the Second Workshop on Statistical Machine Translation
Chris Callison-Burch | Philipp Koehn | Cameron Shaw Fordyce | Christof Monz
Proceedings of the Second Workshop on Statistical Machine Translation

pdf
(Meta-) Evaluation of Machine Translation
Chris Callison-Burch | Cameron Fordyce | Philipp Koehn | Christof Monz | Josh Schroeder
Proceedings of the Second Workshop on Statistical Machine Translation

2006

pdf bib
Proceedings on the Workshop on Statistical Machine Translation
Philipp Koehn | Christof Monz
Proceedings on the Workshop on Statistical Machine Translation

pdf
Manual and Automatic Evaluation of Machine Translation between European Languages
Philipp Koehn | Christof Monz
Proceedings on the Workshop on Statistical Machine Translation

pdf abs
Challenges in Building an Arabic-English GHMT System with SMT Components
Nizar Habash | Bonnie Dorr | Christof Monz
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers

The research context of this paper is developing hybrid machine translation (MT) systems that exploit the advantages of linguistic rule-based and statistical MT systems. Arabic, as a morphologically rich language, is especially challenging even without addressing the hybridization question. In this paper, we describe the challenges in building an Arabic-English generation-heavy machine translation (GHMT) system and boosting it with statistical machine translation (SMT) components. We present an extensive evaluation of multiple system variants and report positive results on the advantages of hybridization.