Michel Simard

2023

pdf abs
Terminology in Neural Machine Translation: A Case Study of the Canadian Hansard
Rebecca Knowles | Samuel Larkin | Marc Tessier | Michel Simard
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

Incorporating terminology into a neural machine translation (NMT) system is a feature of interest for many users of machine translation. In this case study of English-French Canadian Parliamentary text, we examine the performance of standard NMT systems at handling terminology and consider the tradeoffs between potential performance improvements and the efforts required to maintain terminological resources specifically for NMT.

2022

pdf abs
Refining an Almost Clean Translation Memory Helps Machine Translation
Shivendra Bhardwa | David Alfonso-Hermelo | Philippe Langlais | Gabriel Bernier-Colborne | Cyril Goutte | Michel Simard
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

While recent studies have been dedicated to cleaning very noisy parallel corpora to improve Machine Translation training, we focus in this work on filtering a large and mostly clean Translation Memory. This problem of practical interest has not received much consideration from the community, in contrast with, for example, filtering large web-mined parallel corpora. We experiment with an extensive, multi-domain proprietary Translation Memory and compare five approaches involving deep-, feature-, and heuristic-based solutions. We propose two ways of evaluating this task, manual annotation and resulting Machine Translation quality. We report significant gains over a state-of-the-art, off-the-shelf cleaning system, using two MT engines.

2021

pdf abs
Like Chalk and Cheese? On the Effects of Translationese in MT Training
Samuel Larkin | Michel Simard | Rebecca Knowles
Proceedings of Machine Translation Summit XVIII: Research Track

We revisit the topic of translation direction in the data used for training neural machine translation systems and focusing on a real-world scenario with known translation direction and imbalances in translation direction: the Canadian Hansard. According to automatic metrics and we observe that using parallel data that was produced in the “matching” translation direction (Authentic source and translationese target) improves translation quality. In cases of data imbalance in terms of translation direction and we find that tagging of translation direction can close the performance gap. We perform a human evaluation that differs slightly from the automatic metrics and but nevertheless confirms that for this French-English dataset that is known to contain high-quality translations and authentic or tagged mixed source improves over translationese source for training.

2020

Deep neural models tremendously improved machine translation. In this context, we investigate whether distinguishing machine from human translations is still feasible. We trained and applied 18 classifiers under two settings: a monolingual task, in which the classifier only looks at the translation; and a bilingual task, in which the source text is also taken into consideration. We report on extensive experiments involving 4 neural MT systems (Google Translate, DeepL, as well as two systems we trained) and varying the domain of texts. We show that the bilingual task is the easiest one and that transfer-based deep-learning classifiers perform best, with mean accuracies around 85% in-domain and 75% out-of-domain .

bib
Workshop on the Impact of Machine Translation (iMpacT 2020)
Sharon O'Brien | Michel Simard
Workshop on the Impact of Machine Translation (iMpacT 2020)

2019

pdf abs
Fully Unsupervised Crosslingual Semantic Textual Similarity Metric Based on BERT for Identifying Parallel Data
Chi-kiu Lo | Michel Simard
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

We present a fully unsupervised crosslingual semantic textual similarity (STS) metric, based on contextual embeddings extracted from BERT – Bidirectional Encoder Representations from Transformers (Devlin et al., 2019). The goal of crosslingual STS is to measure to what degree two segments of text in different languages express the same meaning. Not only is it a key task in crosslingual natural language understanding (XLU), it is also particularly useful for identifying parallel resources for training and evaluating downstream multilingual natural language processing (NLP) applications, such as machine translation. Most previous crosslingual STS methods relied heavily on existing parallel resources, thus leading to a circular dependency problem. With the advent of massively multilingual context representation models such as BERT, which are trained on the concatenation of non-parallel data from each language, we show that the deadlock around parallel resources can be broken. We perform intrinsic evaluations on crosslingual STS data sets and extrinsic evaluations on parallel corpus filtering and human translation equivalence assessment tasks. Our results show that the unsupervised crosslingual STS metric using BERT without fine-tuning achieves performance on par with supervised or weakly supervised approaches.

2018

pdf abs
Measuring sentence parallelism using Mahalanobis distances: The NRC unsupervised submissions to the WMT18 Parallel Corpus Filtering shared task
Patrick Littell | Samuel Larkin | Darlene Stewart | Michel Simard | Cyril Goutte | Chi-kiu Lo
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

The WMT18 shared task on parallel corpus filtering (Koehn et al., 2018b) challenged teams to score sentence pairs from a large high-recall, low-precision web-scraped parallel corpus (Koehn et al., 2018a). Participants could use existing sample corpora (e.g. past WMT data) as a supervisory signal to learn what a “clean” corpus looks like. However, in lower-resource situations it often happens that the target corpus of the language is the only sample of parallel text in that language. We therefore made several unsupervised entries, setting ourselves an additional constraint that we not utilize the additional clean parallel corpora. One such entry fairly consistently scored in the top ten systems in the 100M-word conditions, and for one task—translating the European Medicines Agency corpus (Tiedemann, 2009)—scored among the best systems even in the 10M-word conditions.

pdf abs
Accurate semantic textual similarity for cleaning noisy parallel corpora using semantic machine translation evaluation metric: The NRC supervised submissions to the Parallel Corpus Filtering task
Chi-kiu Lo | Michel Simard | Darlene Stewart | Samuel Larkin | Cyril Goutte | Patrick Littell
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We present our semantic textual similarity approach in filtering a noisy web crawled parallel corpus using YiSi—a novel semantic machine translation evaluation metric. The systems mainly based on this supervised approach perform well in the WMT18 Parallel Corpus Filtering shared task (4th place in 100-million-word evaluation, 8th place in 10-million-word evaluation, and 6th place overall, out of 48 submissions). In fact, our best performing system—NRC-yisi-bicov is one of the only four submissions ranked top 10 in both evaluations. Our submitted systems also include some initial filtering steps for scaling down the size of the test corpus and a final redundancy removal step for better semantic and token coverage of the filtered corpus. In this paper, we also describe our unsuccessful attempt in automatically synthesizing a noisy parallel development corpus for tuning the weights to combine different parallelism and fluency features.

Users of Statistical Machine Translation (SMT) sometimes turn to the Web to obtain data to train their systems. One problem with this approach is the potential for “MT contamination”: when large amounts of parallel data are collected automatically, there is a risk that a non-negligible portion consists of machine-translated text. Theoretically, using this kind of data to train SMT systems is likely to reinforce the errors committed by other systems, or even by an earlier versions of the same system. In this paper, we study the effect of MT-contaminated training data on SMT quality, by performing controlled simulations under a wide range of conditions. Our experiments highlight situations in which MT contamination can be harmful, and assess the potential of decontamination techniques.

pdf
CNRC-TMT: Second Language Writing Assistant System Description
Cyril Goutte | Michel Simard | Marine Carpuat
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

2013

pdf
PEPr: Post-Edit Propagation Using Phrase-based Statistical Machine Translation
Michel Simard | George Foster
Proceedings of Machine Translation Summit XIV: Papers

2012

pdf abs
A Poor Man’s Translation Memory Using Machine Translation Evaluation Metrics
Michel Simard | Atsushi Fujita
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers

We propose straightforward implementations of translation memory (TM) functionality for research purposes, using machine translation evaluation metrics as similarity functions. Experiments under various conditions demonstrate the effectiveness of the approach, but also highlight problems in evaluating the results using an MT evaluation methodology.

pdf bib
Workshop on Post-Editing Technology and Practice
Sharon O'Brien | Michel Simard | Lucia Specia
Workshop on Post-Editing Technology and Practice

pdf
Book Review: Bitext Alignment by Jörg Tiedemann
Michel Simard
Computational Linguistics, Volume 38, Issue 2 - June 2012

pdf
The Trouble with SMT Consistency
Marine Carpuat | Michel Simard
Proceedings of the Seventh Workshop on Statistical Machine Translation

2009

pdf
Phrase-based Machine Translation in a Computer-assisted Translation Environment
Michel Simard | Pierre Isabelle
Proceedings of Machine Translation Summit XII: Papers

2007

pdf
NRC‘s PORTAGE System for WMT 2007
Nicola Ueffing | Michel Simard | Samuel Larkin | Howard Johnson
Proceedings of the Second Workshop on Statistical Machine Translation

pdf
Rule-Based Translation with Statistical Phrase-Based Post-Editing
Michel Simard | Nicola Ueffing | Pierre Isabelle | Roland Kuhn
Proceedings of the Second Workshop on Statistical Machine Translation

pdf
Statistical Phrase-Based Post-Editing
Michel Simard | Cyril Goutte | Pierre Isabelle
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference

pdf
Domain adaptation of MT systems through automatic post-editing
Pierre Isabelle | Cyril Goutte | Michel Simard
Proceedings of Machine Translation Summit XI: Papers

2006

2005

Cet article présente une méthode de traduction automatique statistique basée sur des segments non-continus, c’est-à-dire des segments formés de mots qui ne se présentent pas nécéssairement de façon contiguë dans le texte. On propose une méthode pour produire de tels segments à partir de corpus alignés au niveau des mots. On présente également un modèle de traduction statistique capable de tenir compte de tels segments, de même qu’une méthode d’apprentissage des paramètres du modèle visant à maximiser l’exactitude des traductions produites, telle que mesurée avec la métrique NIST. Les traductions optimales sont produites par le biais d’une recherche en faisceau. On présente finalement des résultats expérimentaux, qui démontrent comment la méthode proposée permet une meilleure généralisation à partir des données d’entraînement.

2003

We describe an experiment in rapid development of a statistical machine translation (SMT) system from scratch, using limited resources: under this heading we include not only training data, but also computing power, linguistic knowledge, programming effort, and absolute time.

pdf
Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval
Wessel Kraaij | Jian-Yun Nie | Michel Simard
Computational Linguistics, Volume 29, Number 3, September 2003: Special Issue on the Web as Corpus

pdf abs
De la traduction probabiliste aux mémoires de traduction (ou l’inverse)
Philippe Langlais | Michel Simard
Actes de la 10ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

En dépit des travaux réalisés cette dernière décennie dans le cadre général de la traduction probabiliste, nous sommes toujours bien loin du jour où un engin de traduction automatique (probabiliste ou pas) sera capable de répondre pleinement aux besoins d’un traducteur professionnel. Dans une étude récente (Langlais, 2002), nous avons montré comment un engin de traduction probabiliste pouvait bénéficier de ressources terminologiques extérieures. Dans cette étude, nous montrons que les techniques de traduction probabiliste peuvent être utilisées pour extraire des informations sous-phrastiques d’une mémoire de traduction. Ces informations peuvent à leur tour s’avérer utiles à un engin de traduction probabiliste. Nous rapportons des résultats sur un corpus de test de taille importante en utilisant la mémoire de traduction d’un concordancier bilingue commercial.

pdf
Statistical Translation Alignment with Compositionality Constraints
Michel Simard | Philippe Langlais
Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond

pdf
Translation Spotting for Translation Memories
Michel Simard
Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond

2002

pdf abs
Merging example-based and statistical machine translation: an experiment
Philippe Langlais | Michel Simard
Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: Technical Papers

Despite the exciting work accomplished over the past decade in the field of Statistical Machine Translation (SMT), we are still far from the point of being able to say that machine translation fully meets the needs of real-life users. In a previous study [6], we have shown how a SMT engine could benefit from terminological resources, especially when translating texts very different from those used to train the system. In the present paper, we discuss the opening of SMT to examples automatically extracted from a Translation Memory (TM). We report results on a fair-sized translation task using the database of a commercial bilingual concordancer.

2001

pdf abs
Sub-sentential exploitation of translation memories
Michel Simard | Philippe Langlais
Proceedings of Machine Translation Summit VIII

Translation memory systems (TMS) are a family of computer tools whose purpose is to facilitate and encourage the re-use of existing translations. By searching a database of past translations, these systems can retrieve the translation of whole segments of text and propose them to the translator for re-use. However, the usefulness of existing TMS’s is limited by the nature of the text segments that that they are able to put in correspondence, generally whole sentences. This article examines the potential of a type of system that is able to recuperate the translation of sub-sentential sequences of words.

pdf abs
Récupération de segments sous-phrastiques dans une mémoire de traduction
Philippe Langlais | Michel Simard
Actes de la 8ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

L’utilité des outils d’aide à la traduction reposant sur les mémoires de traduction est souvent limitée par la nature des segments que celles-ci mettent en correspondance, le plus souvent des phrases entières. Cet article examine le potentiel d’un type de système qui serait en mesure de récupérer la traduction de séquences de mots de longueur arbitraire.