Proyag Pal


2023

pdf
Cheating to Identify Hard Problems for Neural Machine Translation
Proyag Pal | Kenneth Heafield
Findings of the Association for Computational Linguistics: EACL 2023

We identify hard problems for neural machine translation models by analyzing progressively higher-scoring translations generated by letting models cheat to various degrees. If a system cheats and still gets something wrong, that suggests it is a hard problem. We experiment with two forms of cheating: providing the model a compressed representation of the target as an additional input, and fine-tuning on the test set. Contrary to popular belief, we find that the most frequent tokens are not necessarily the most accurately translated due to these often being function words and punctuation that can be used more flexibly in translation, or content words which can easily be paraphrased. We systematically analyze system outputs to identify categories of tokens which are particularly hard for the model to translate, and find that this includes certain types of named entities, subordinating conjunctions, and unknown and foreign words. We also encounter a phenomenon where words, often names, which were not infrequent in the training data are still repeatedly mistranslated by the models — we dub this the Fleetwood Mac problem.

2022

pdf
Cheat Codes to Quantify Missing Source Information in Neural Machine Translation
Proyag Pal | Kenneth Heafield
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

This paper describes a method to quantify the amount of information H(t|s) added by the target sentence t that is not present in the source s in a neural machine translation system. We do this by providing the model the target sentence in a highly compressed form (a “cheat code”), and exploring the effect of the size of the cheat code. We find that the model is able to capture extra information from just a single float representation of the target and nearly reproduces the target with two 32-bit floats per target token.

2021

pdf
The University of Edinburgh’s Bengali-Hindi Submissions to the WMT21 News Translation Task
Proyag Pal | Alham Fikri Aji | Pinzhen Chen | Sukanta Sen
Proceedings of the Sixth Conference on Machine Translation

We describe the University of Edinburgh’s BengaliHindi constrained systems submitted to the WMT21 News Translation task. We submitted ensembles of Transformer models built with large-scale back-translation and fine-tuned on subsets of training data retrieved based on similarity to the target domain.