Hieu Hoang

2025

pdf bib abs
Effects of automatic alignment on speech translation metrics
Matt Post | Hieu Hoang
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)

Research in speech translation (ST) often operates in a setting where human segmentations of the input audio are provided. This simplifying assumption avoids the evaluation-time difficulty of aligning the translated outputs to their references for segment-level evaluation, but it also means that the systems are not evaluated as they will be used in production settings, where automatic audio segmentation is an unavoidable component. A tool, mwerSegmenter, exists for aligning ST output to references, but its behavior is noisy and not well understood. We address this with an investigation of the effects automatic alignment on metric correlation with system-level human judgments; that is, as a metrics task. Using the eleven language tasks from the WMT24 data, we merge each system’s output at the domain level, align them to the references, compute metrics, and evaluate the correlation with the human system-level rankings. In addition to expanding analysis to many target languages, we also experiment with different subword models and with the generation of additional paraphrases. We find that automatic realignment has minimal effect on COMET-level system rankings, with accuracies still way above BLEU scores from manual segmentations. In the process, we also bring the community’s attention to the source code for the tool, which we have updated, modernized, and realized as a Python module, mweralign.

2024

pdf bib abs
On-the-Fly Fusion of Large Language Models and Machine Translation
Hieu Hoang | Huda Khayrallah | Marcin Junczys-Dowmunt
Findings of the Association for Computational Linguistics: NAACL 2024

We propose on-the-fly ensembling of a neural machine translation (NMT) model with a large language model (LLM), prompted on the same task and input. Through experiments on 4 language directions with varying data amounts, we find that a slightly weaker-at-translation LLM can improve translations of a NMT model, and such an ensemble can produce better translations than ensembling two stronger NMT models.We demonstrate that our ensemble method can be combined with various techniques from LLM prompting, such as in context learning and translation context.

2022

pdf bib abs
Revisiting Locality Sensitive Hashing for Vocabulary Selection in Fast Neural Machine Translation
Hieu Hoang | Marcin Junczys-dowmunt | Roman Grundkiewicz | Huda Khayrallah
Proceedings of the Seventh Conference on Machine Translation (WMT)

Neural machine translation models often contain large target vocabularies. The calculation of logits, softmax and beam search is computationally costly over so many classes. We investigate the use of locality sensitive hashing (LSH) to reduce the number of vocabulary items that must be evaluated and explore the relationship between the hashing algorithm, translation speed and quality. Compared to prior work, our LSH-based solution does not require additional augmentation via word-frequency lists or alignments. We propose a training procedure that produces models, which, when combined with our LSH inference algorithm increase translation speed by up to 87% over the baseline, while maintaining translation quality as measured by BLEU. Apart from just using BLEU, we focus on minimizing search errors compared to the full softmax, a much harsher quality criterion.

2020

We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.

2019

pdf bib
ParaCrawl: Web-scale parallel corpora for the languages of the EU
Miquel Esplà | Mikel Forcada | Gema Ramírez-Sánchez | Hieu Hoang
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

2018

We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.

pdf bib abs
Fast Neural Machine Translation Implementation
Hieu Hoang | Tomasz Dwojak | Rihards Krislauks | Daniel Torregrosa | Kenneth Heafield
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation

This paper describes the submissions to the efficiency track for GPUs at the Workshop for Neural Machine Translation and Generation by members of the University of Edinburgh, Adam Mickiewicz University, Tilde and University of Alicante. We focus on efficient implementation of the recurrent deep-learning model as implemented in Amun, the fast inference engine for neural machine translation. We improve the performance with an efficient mini-batching algorithm, and by fusing the softmax operation with the k-best extraction algorithm. Submissions using Amun were first, second and third fastest in the GPU efficiency track.

pdf bib abs
Marian: Cost-effective High-Quality Neural Machine Translation in C++
Marcin Junczys-Dowmunt | Kenneth Heafield | Hieu Hoang | Roman Grundkiewicz | Anthony Aue
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation

This paper describes the submissions of the “Marian” team to the WNMT 2018 shared task. We investigate combinations of teacher-student training, low-precision matrix products, auto-tuning and other methods to optimize the Transformer model on GPU and CPU. By further integrating these methods with the new averaging attention networks, a recently introduced faster Transformer variant, we create a number of high-quality, high-performance models on the GPU and CPU, dominating the Pareto frontier for this shared task.

2017

pdf bib abs
A Parallel Corpus for Evaluating Machine Translation between Arabic and European Languages
Nizar Habash | Nasser Zalmout | Dima Taji | Hieu Hoang | Maverick Alzate
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We present Arab-Acquis, a large publicly available dataset for evaluating machine translation between 22 European languages and Arabic. Arab-Acquis consists of over 12,000 sentences from the JRC-Acquis (Acquis Communautaire) corpus translated twice by professional translators, once from English and once from French, and totaling over 600,000 words. The corpus follows previous data splits in the literature for tuning, development, and testing. We describe the corpus and how it was created. We also present the first benchmarking results on translating to and from Arabic for 22 European languages.

2016

pdf bib abs
Fast, Scalable Phrase-Based SMT Decoding
Hieu Hoang | Nikolay Bogoychev | Lane Schwartz | Marcin Junczys-Dowmunt
Conferences of the Association for Machine Translation in the Americas: MT Researchers' Track

The utilization of statistical machine translation (SMT) has grown enormously over the last decade, many using open-source software developed by the NLP community. As commercial use has increased, there is need for software that is optimized for commercial requirements, in particular, fast phrase-based decoding and more efficient utilization of modern multicore servers. In this paper we re-examine the major components of phrase-based decoding and decoder implementation with particular emphasis on speed and scalability on multicore machines. The result is a drop-in replacement for the Moses decoder which is up to fifteen times faster and scales monotonically with the number of cores.

pdf bib abs
Is Neural Machine Translation Ready for Deployment? A Case Study on 30 Translation Directions
Marcin Junczys-Dowmunt | Tomasz Dwojak | Hieu Hoang
Proceedings of the 13th International Conference on Spoken Language Translation

In this paper we provide the largest published comparison of translation quality for phrase-based SMT and neural machine translation across 30 translation directions. For ten directions we also include hierarchical phrase-based MT. Experiments are performed for the recently published United Nations Parallel Corpus v1.0 and its large six-way sentence-aligned subcorpus. In the second part of the paper we investigate aspects of translation speed, introducing AmuNMT, our efficient neural machine translation decoder. We demonstrate that current neural machine translation could already be used for in-production systems when comparing words-persecond ratios.

pdf bib
Fast and highly parallelizable phrase table for statistical machine translation
Nikolay Bogoychev | Hieu Hoang
Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers

If you are interested in open-source machine translation but lack hands-on experience, this is the tutorial for you! We will start with background knowledge of statistical machine translation and then walk you through the process of installing and running an SMT system. We will show you how to prepare input data, and the most efficient way to train and use your translation systems. We shall also discuss solutions to some of the most common issues that face LSPs when using SMT, including how to tailor systems to specific clients, preserving document layout and formatting, and efficient ways of incorporating new translation memories. Previous years’ participants have included software engineers and managers who need to have a detailed understanding of the SMT process. This is a fast-paced, hands-on tutorial that will cover the skills you need to get you up and running with open-source SMT. The teaching will be based on the Moses toolkit, the most popular open-source machine translation software currently available. No prior knowledge of MT is necessary, only an interest in it. A laptop is required for this tutorial, and you should have rudimentary knowledge of using the command line on Windows or Linux.

2011

pdf bib abs
Left language model state for syntactic machine translation
Kenneth Heafield | Hieu Hoang | Philipp Koehn | Tetsuo Kiso | Marcello Federico
Proceedings of the 8th International Workshop on Spoken Language Translation: Evaluation Campaign

Many syntactic machine translation decoders, including Moses, cdec, and Joshua, implement bottom-up dynamic programming to integrate N-gram language model probabilities into hypothesis scoring. These decoders concatenate hypotheses according to grammar rules, yielding larger hypotheses and eventually complete translations. When hypotheses are concatenated, the language model score is adjusted to account for boundary-crossing n-grams. Words on the boundary of each hypothesis are encoded in state, consisting of left state (the first few words) and right state (the last few words). We speed concatenation by encoding left state using data structure pointers in lieu of vocabulary indices and by avoiding unnecessary queries. To increase the decoder’s opportunities to recombine hypothesis, we minimize the number of words encoded by left state. This has the effect of reducing search errors made by the decoder. The resulting gain in model score is smaller than for right state minimization, which we explain by observing a relationship between state minimization and language model probability. With a fixed cube pruning pop limit, we show a 3-6% reduction in CPU time and improved model scores. Reducing the pop limit to the point where model scores tie the baseline yields a net 11% reduction in CPU time.

2010

pdf bib
Machine Translation with Open source Software
Philipp Koehn | Hieu Hoang
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Tutorials

pdf bib
More Linguistic Annotation for Statistical Machine Translation
Philipp Koehn | Barry Haddow | Philip Williams | Hieu Hoang
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
Improved Translation with Source Syntax Labels
Hieu Hoang | Philipp Koehn
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

2009

pdf bib abs
A unified framework for phrase-based, hierarchical, and syntax-based statistical machine translation
Hieu Hoang | Philipp Koehn | Adam Lopez
Proceedings of the 6th International Workshop on Spoken Language Translation: Papers

Despite many differences between phrase-based, hierarchical, and syntax-based translation models, their training and testing pipelines are strikingly similar. Drawing on this fact, we extend the Moses toolkit to implement hierarchical and syntactic models, making it the first open source toolkit with end-to-end support for all three of these popular models in a single package. This extension substantially lowers the barrier to entry for machine translation research across multiple models.

pdf bib
Improving Mid-Range Re-Ordering Using Templates of Factors
Hieu Hoang | Philipp Koehn
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf bib
A Systematic Analysis of Translation Model Search Spaces
Michael Auli | Adam Lopez | Hieu Hoang | Philipp Koehn
Proceedings of the Fourth Workshop on Statistical Machine Translation