Philippe Langlais

Also published as: Phillippe Langlais

2023

pdf abs
On the utility of enhancing BERT syntactic bias with Token Reordering Pretraining
Yassir El Mesbahi | Atif Mahmud | Abbas Ghaddar | Mehdi Rezagholizadeh | Phillippe Langlais | Prasanna Parthasarathi
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)

Self-supervised Language Modelling (LM) objectives —like BERT masked LM— have become the default choice for pretraining language models. TOken Reordering (TOR) pretraining objectives, beyond token prediction, have not been extensively studied yet. In this work, we explore challenges that underlie the development and usefulness of such objectives on downstream language tasks. In particular, we design a novel TOR pretraining objective which predicts whether two tokens are adjacent or not given a partial bag-of-tokens input. In addition, we investigate the usefulness of Graph Isomorphism Network (GIN), when placed on top of the BERT encoder, in order to enhance the overall model ability to leverage topological signal from the encoded representations. We compare language understanding abilities of TOR to the one of MLM on word-order sensitive (e.g. Dependency Parsing) and insensitive (e.g. text classification) tasks in both full training and few-shot settings. Our results indicate that TOR is competitive to MLM on the GLUE language understanding benchmark, and slightly superior on syntax-dependent datasets, especially in the few-shot setting.

pdf abs
LABO: Towards Learning Optimal Label Regularization via Bi-level Optimization
Peng Lu | Ahmad Rashid | Ivan Kobyzev | Mehdi Rezagholizadeh | Phillippe Langlais
Findings of the Association for Computational Linguistics: ACL 2023

Regularization techniques are crucial to improving the generalization performance and training efficiency of deep neural networks. Many deep learning algorithms rely on weight decay, dropout, batch/layer normalization to converge faster and generalize. Label Smoothing (LS) is another simple, versatile and efficient regularization which can be applied to various supervised classification tasks. Conventional LS, however, regardless of the training instance assumes that each non-target class is equally likely. In this work, we present a general framework for training with label regularization, which includes conventional LS but can also model instance-specific variants. Based on this formulation, we propose an efficient way of learning LAbel regularization by devising a Bi-level Optimization (LABO) problem. We derive a deterministic and interpretable solution of the inner loop as the optimal label smoothing without the need to store the parameters or the output of a trained model. Finally, we conduct extensive experiments and demonstrate our LABO consistently yields improvement over conventional label regularization on various fields, including seven machine translation and three image classification tasks across various neural network architectures while maintaining training efficiency.

pdf abs
Is ChatGPT the ultimate Data Augmentation Algorithm?
Frédéric Piedboeuf | Philippe Langlais
Findings of the Association for Computational Linguistics: EMNLP 2023

In the aftermath of GPT-3.5, commonly known as ChatGPT, research have attempted to assess its capacity for lowering annotation cost, either by doing zero-shot learning, generating new data, or replacing human annotators. Some studies have also investigated its use for data augmentation (DA), but only in limited contexts, which still leaves the question of how ChatGPT performs compared to state-of-the-art algorithms. In this paper, we use ChatGPT to create new data both with paraphrasing and with zero-shot generation, and compare it to seven other algorithms. We show that while ChatGPT performs exceptionally well on some simpler data, it overall does not perform better than the other algorithms, yet demands a much larger implication from the practitioner due to the ChatGPT often refusing to answer due to sensitive content in the datasets.

pdf abs
RaTE: a Reproducible automatic Taxonomy Evaluation by Filling the Gap
Phillippe Langlais | Tianjian Lucas Gao
Proceedings of the 15th International Conference on Computational Semantics

Taxonomies are an essential knowledge representation, yet most studies on automatic taxonomy construction (ATC) resort to manual evaluation to score proposed algorithms. We argue that automatic taxonomy evaluation (ATE) is just as important as taxonomy construction. We propose RaTE, an automatic label-free taxonomy scoring procedure, which relies on a large pre-trained language model. We apply our evaluation procedure to three state-of-the-art ATC algorithms with which we built seven taxonomies from the Yelp domain, and show that 1) RaTE correlates well with human judgments and 2) artificially degrading a taxonomy leads to decreasing RaTE score.

2022

pdf abs
Refining an Almost Clean Translation Memory Helps Machine Translation
Shivendra Bhardwa | David Alfonso-Hermelo | Philippe Langlais | Gabriel Bernier-Colborne | Cyril Goutte | Michel Simard
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

While recent studies have been dedicated to cleaning very noisy parallel corpora to improve Machine Translation training, we focus in this work on filtering a large and mostly clean Translation Memory. This problem of practical interest has not received much consideration from the community, in contrast with, for example, filtering large web-mined parallel corpora. We experiment with an extensive, multi-domain proprietary Translation Memory and compare five approaches involving deep-, feature-, and heuristic-based solutions. We propose two ways of evaluating this task, manual annotation and resulting Machine Translation quality. We report significant gains over a state-of-the-art, off-the-shelf cleaning system, using two MT engines.

pdf abs
Unsupervised multiple-choice question generation for out-of-domain Q&A fine-tuning
Guillaume Le Berre | Christophe Cerisara | Philippe Langlais | Guy Lapalme
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Pre-trained models have shown very good performances on a number of question answering benchmarks especially when fine-tuned on multiple question answering datasets at once. In this work, we propose an approach for generating a fine-tuning dataset thanks to a rule-based algorithm that generates questions and answers from unannotated sentences. We show that the state-of-the-art model UnifiedQA can greatly benefit from such a system on a multiple-choice benchmark about physics, biology and chemistry it has never been trained on. We further show that improved performances may be obtained by selecting the most challenging distractors (wrong answers), with a dedicated ranker based on a pretrained RoBERTa model.

pdf abs
A Methodology for Building a Diachronic Dataset of Semantic Shifts and its Application to QC-FR-Diac-V1.0, a Free Reference for French
David Kletz | Philippe Langlais | François Lareau | Patrick Drouin
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Different algorithms have been proposed to detect semantic shifts (changes in a word meaning over time) in a diachronic corpus. Yet, and somehow surprisingly, no reference corpus has been designed so far to evaluate them, leaving researchers to fallback to troublesome evaluation strategies. In this work, we introduce a methodology for the construction of a reference dataset for the evaluation of semantic shift detection, that is, a list of words where we know for sure whether they present a word meaning change over a period of interest. We leverage a state-of-the-art word-sense disambiguation model to associate a date of first appearance to all the senses of a word. Significant changes in sense distributions as well as clear stability are detected and the resulting words are inspected by experts using a dedicated interface before populating a reference dataset. As a proof of concept, we apply this methodology to a corpus of newspapers from Quebec covering the whole 20th century. We manually verified a subset of candidates, leading to QC-FR-Diac-V1.0, a corpus of 151 words allowing one to evaluate the identification of semantic shifts in French between 1910 and 1990.

Intermediate layer knowledge distillation (KD) can improve the standard KD technique (which only targets the output of teacher and student models) especially over large pre-trained language models. However, intermediate layer distillation suffers from excessive computational burdens and engineering efforts required for setting up a proper layer mapping. To address these problems, we propose a RAndom Intermediate Layer Knowledge Distillation (RAIL-KD) approach in which, intermediate layers from the teacher model are selected randomly to be distilled into the intermediate layers of the student model. This randomized selection enforces that all teacher layers are taken into account in the training process, while reducing the computational cost of intermediate layer distillation. Also, we show that it acts as a regularizer for improving the generalizability of the student model. We perform extensive experiments on GLUE tasks as well as on out-of-domain test sets. We show that our proposed RAIL-KD approach outperforms other state-of-the-art intermediate layer KD methods considerably in both performance and training-time.

pdf abs
Improving Generalization of Pre-trained Language Models via Stochastic Weight Averaging
Peng Lu | Ivan Kobyzev | Mehdi Rezagholizadeh | Ahmad Rashid | Ali Ghodsi | Phillippe Langlais
Findings of the Association for Computational Linguistics: EMNLP 2022

Knowledge Distillation (KD) is a commonly used technique for improving the generalization of compact Pre-trained Language Models (PLMs) on downstream tasks. However, such methods impose the additional burden of training a separate teacher model for every new dataset.Alternatively, one may directly work on the improvement of the optimization procedure of the compact model towards better generalization. Recent works observe that the flatness of the local minimum correlates well with better generalization.In this work, we adapt Stochastic Weight Averaging (SWA), a method encouraging convergence to a flatter minimum, to fine-tuning PLMs. We conduct extensive experiments on various NLP tasks (text classification, question answering, and generation) and different model architectures and demonstrate that our adaptation improves the generalization without extra computation cost. Moreover, we observe that this simple optimization technique is able to outperform the state-of-the-art KD methods for compact models.

pdf bib abs
About Evaluating Bilingual Lexicon Induction
Martin Laville | Emmanuel Morin | Phillippe Langlais
Proceedings of the BUCC Workshop within LREC 2022

With numerous new methods proposed recently, the evaluation of Bilingual Lexicon Induction have been quite hazardous and inconsistent across works. Some studies proposed some guidance to sanitize this; yet, they are not necessarily followed by practitioners. In this study, we try to gather these different recommendations and add our owns, with the aim to propose an unified evaluation protocol. We further show that the easiness of a benchmark while being correlated to the proximity of the language pairs being considered, is even more conditioned on the graphical similarities within the test word pairs.

There is a growing body of work in recent years to develop pre-trained language models (PLMs) for the Arabic language. This work addresses two major problems in existing Arabic PLMs that limit the progress of the Arabic NLU and NLG fields. First, existing Arabic PLMs are not well-explored and their pre-training can be improved significantly using a more methodical approach. Second, there is a lack of systematic and reproducible evaluation of these models in the literature. We revisit both the pre-training and evaluation of Arabic PLMs. In terms of pre-training, we explore the impact of the quality of the pretraining data, the size of the model, and the incorporation of character-level information on Arabic PLM. As a result, we release three new Arabic BERT-style models ( JABER, Char-JABER, and SABER), and two T5-style models (AT5S and AT5B). In terms of evaluation, we conduct a comprehensive empirical study to systematically evaluate the performance of existing state-of-the-art models on ALUE, a leaderboard-powered benchmark for Arabic NLU tasks, and on a subset of the Arabic generative tasks. We show that our models significantly outperform existing Arabic PLMs and achieve a new state-of-the-art performance on discriminative and generative Arabic NLU and NLG tasks. Our models and source code to reproduce results will be made available upon acceptance.

pdf abs
Effective Data Augmentation for Sentence Classification Using One VAE per Class
Frédéric Piedboeuf | Philippe Langlais
Proceedings of the 29th International Conference on Computational Linguistics

In recent years, data augmentation has become an important field of machine learning. While images can use simple techniques such as cropping or rotating, textual data augmentation needs more complex manipulations to ensure that the generated examples are useful. Variational auto-encoders (VAE) and its conditional variant the Conditional-VAE (CVAE) are often used to generate new textual data, both relying on a good enough training of the generator so that it doesn’t create examples of the wrong class. In this paper, we explore a simpler way to use VAE for data augmentation: the training of one VAE per class. We show on several dataset sizes, as well as on four different binary classification tasks, that it systematically outperforms other generative data augmentation techniques.

pdf abs
CILDA: Contrastive Data Augmentation Using Intermediate Layer Knowledge Distillation
Md Akmal Haidar | Mehdi Rezagholizadeh | Abbas Ghaddar | Khalil Bibi | Phillippe Langlais | Pascal Poupart
Proceedings of the 29th International Conference on Computational Linguistics

Knowledge distillation (KD) is an efficient framework for compressing large-scale pre-trained language models. Recent years have seen a surge of research aiming to improve KD by leveraging Contrastive Learning, Intermediate Layer Distillation, Data Augmentation, and Adversarial Training. In this work, we propose a learning-based data augmentation technique tailored for knowledge distillation, called CILDA. To the best of our knowledge, this is the first time that intermediate layer representations of the main task are used in improving the quality of augmented samples. More precisely, we introduce an augmentation technique for KD based on intermediate layer matching using contrastive loss to improve masked adversarial data augmentation. CILDA outperforms existing state-of-the-art KD approaches on the GLUE benchmark, as well as in an out-of-domain evaluation.

2021

pdf abs
Exploiting Domain-Specific Knowledge for Judgment Prediction Is No Panacea
Olivier Salaün | Philippe Langlais | Karim Benyekhlef
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Legal judgment prediction (LJP) usually consists in a text classification task aimed at predicting the verdict on the basis of the fact description. The literature shows that the use of articles as input features helps improve the classification performance. In this work, we designed a verdict prediction task based on landlord-tenant disputes and we applied BERT-based models to which we fed different article-based features. Although the results obtained are consistent with the literature, the improvements with the articles are mostly obtained with the most frequent labels, suggesting that pre-trained and fine-tuned transformer-based models are not scalable as is for legal reasoning in real life scenarios as they would only excel in accurately predicting the most recurrent verdicts to the detriment of other legal outcomes.

Knowledge Distillation (KD) is extensively used to compress and deploy large pre-trained language models on edge devices for real-world applications. However, one neglected area of research is the impact of noisy (corrupted) labels on KD. We present, to the best of our knowledge, the first study on KD with noisy labels in Natural Language Understanding (NLU). We document the scope of the problem and present two methods to mitigate the impact of label noise. Experiments on the GLUE benchmark show that our methods are effective even under high noise levels. Nevertheless, our results indicate that more research is necessary to cope with label noise under the KD.

pdf
End-to-End Self-Debiasing Framework for Robust NLU Training
Abbas Ghaddar | Phillippe Langlais | Mehdi Rezagholizadeh | Ahmad Rashid
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Knowledge Distillation (KD) is extensively used in Natural Language Processing to compress the pre-training and task-specific fine-tuning phases of large neural language models. A student model is trained to minimize a convex combination of the prediction loss over the labels and another over the teacher output. However, most existing works either fix the interpolating weight between the two losses apriori or vary the weight using heuristics. In this work, we propose a novel sample-wise loss weighting method, RW-KD. A meta-learner, simultaneously trained with the student, adaptively re-weights the two losses for each sample. We demonstrate, on 7 datasets of the GLUE benchmark, that RW-KD outperforms other loss re-weighting methods for KD.

pdf abs
Context-aware Adversarial Training for Name Regularity Bias in Named Entity Recognition
Abbas Ghaddar | Philippe Langlais | Ahmad Rashid | Mehdi Rezagholizadeh
Transactions of the Association for Computational Linguistics, Volume 9

In this work, we examine the ability of NER models to use contextual information when predicting the type of an ambiguous entity. We introduce NRB, a new testbed carefully designed to diagnose Name Regularity Bias of NER models. Our results indicate that all state-of-the-art models we tested show such a bias; BERT fine-tuned models significantly outperforming feature-based (LSTM-CRF) ones on NRB, despite having comparable (sometimes lower) performance on standard benchmarks. To mitigate this bias, we propose a novel model-agnostic training method that adds learnable adversarial noise to some entity mentions, thus enforcing models to focus more strongly on the contextual signal, leading to significant gains on NRB. Combining it with two other training strategies, data augmentation and parameter freezing, leads to further gains.

2020

pdf abs
HardEval: Focusing on Challenging Tokens to Assess Robustness of NER
Gabriel Bernier-Colborne | Phillippe Langlais
Proceedings of the Twelfth Language Resources and Evaluation Conference

To assess the robustness of NER systems, we propose an evaluation method that focuses on subsets of tokens that represent specific sources of errors: unknown words and label shift or ambiguity. These subsets provide a system-agnostic basis for evaluating specific sources of NER errors and assessing room for improvement in terms of robustness. We analyze these subsets of challenging tokens in two widely-used NER benchmarks, then exploit them to evaluate NER systems in both in-domain and out-of-domain settings. Results show that these challenging tokens explain the majority of errors made by modern NER systems, although they represent only a small fraction of test tokens. They also indicate that label shift is harder to deal with than unknown words, and that there is much more room for improvement than the standard NER evaluation procedure would suggest. We hope this work will encourage NLP researchers to adopt rigorous and meaningful evaluation methods, and will help them develop more robust models.

pdf abs
SEDAR: a Large Scale French-English Financial Domain Parallel Corpus
Abbas Ghaddar | Phillippe Langlais
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper describes the acquisition, preprocessing and characteristics of SEDAR, a large scale English-French parallel corpus for the financial domain. Our extensive experiments on machine translation show that SEDAR is essential to obtain good performance on finance. We observe a large gain in the performance of machine translation systems trained on SEDAR when tested on finance, which makes SEDAR suitable to study domain adaptation for neural machine translation. The first release of the corpus comprises 8.6 million high quality sentence pairs that are publicly available for research at https://github.com/autorite/sedar-bitext.

pdf abs
Data Selection for Bilingual Lexicon Induction from Specialized Comparable Corpora
Martin Laville | Amir Hazem | Emmanuel Morin | Phillippe Langlais
Proceedings of the 28th International Conference on Computational Linguistics

Narrow specialized comparable corpora are often small in size. This particularity makes it difficult to build efficient models to acquire translation equivalents, especially for less frequent and rare words. One way to overcome this issue is to enrich the specialized corpora with out-of-domain resources. Although some recent studies have shown improvements using data augmentation, the enrichment method was roughly conducted by adding out-of-domain data with no particular attention given to how to enrich words and how to do it optimally. In this paper, we contrast several data selection techniques to improve bilingual lexicon induction from specialized comparable corpora. We first apply two well-established data selection techniques often used in machine translation that is: Tf-Idf and cross entropy. Then, we propose to exploit BERT for data selection. Overall, all the proposed techniques improve the quality of the extracted bilingual lexicons by a large margin. The best performing model is the cross entropy, obtaining a gain of about 4 points in MAP while decreasing computation time by a factor of 10.

Deep neural models tremendously improved machine translation. In this context, we investigate whether distinguishing machine from human translations is still feasible. We trained and applied 18 classifiers under two settings: a monolingual task, in which the classifier only looks at the translation; and a bilingual task, in which the source text is also taken into consideration. We report on extensive experiments involving 4 neural MT systems (Google Translate, DeepL, as well as two systems we trained) and varying the domain of texts. We show that the bilingual task is the easiest one and that transfer-based deep-learning classifiers perform best, with mean accuracies around 85% in-domain and 75% out-of-domain .

2019

pdf bib abs
WiRe57 : A Fine-Grained Benchmark for Open Information Extraction
William Lechelle | Fabrizio Gotti | Phillippe Langlais
Proceedings of the 13th Linguistic Annotation Workshop

We build a reference for the task of Open Information Extraction, on five documents. We tentatively resolve a number of issues that arise, including coreference and granularity, and we take steps toward addressing inference, a significant problem. We seek to better pinpoint the requirements for the task. We produce our annotation guidelines specifying what is correct to extract and what is not. In turn, we use this reference to score existing Open IE systems. We address the non-trivial problem of evaluating the extractions produced by systems against the reference tuples, and share our evaluation script. Among seven compared extractors, we find the MinIE system to perform best.

pdf abs
Contextualized Word Representations from Distant Supervision with and for NER
Abbas Ghaddar | Phillippe Langlais
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

We describe a special type of deep contextualized word representation that is learned from distant supervision annotations and dedicated to named entity recognition. Our extensive experiments on 7 datasets show systematic gains across all domains over strong baselines, and demonstrate that our representation is complementary to previously proposed embeddings. We report new state-of-the-art results on CONLL and ONTONOTES datasets.

pdf abs
SC-LSTM: Learning Task-Specific Representations in Multi-Task Learning for Sequence Labeling
Peng Lu | Ting Bai | Philippe Langlais
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Multi-task learning (MTL) has been studied recently for sequence labeling. Typically, auxiliary tasks are selected specifically in order to improve the performance of a target task. Jointly learning multiple tasks in a way that benefit all of them simultaneously can increase the utility of MTL. In order to do so, we propose a new LSTM cell which contains both shared parameters that can learn from all tasks, and task-specific parameters that can learn task-specific information. We name it a Shared-Cell Long-Short Term Memory SC-LSTM. Experimental results on three sequence labeling benchmarks (named-entity recognition, text chunking, and part-of-speech tagging) demonstrate the effectiveness of our SC-LSTM cell.

2018

pdf abs
Extracting Parallel Sentences with Bidirectional Recurrent Neural Networks to Improve Machine Translation
Francis Grégoire | Philippe Langlais
Proceedings of the 27th International Conference on Computational Linguistics

Parallel sentence extraction is a task addressing the data sparsity problem found in multilingual natural language processing applications. We propose a bidirectional recurrent neural network based approach to extract parallel sentences from collections of multilingual texts. Our experiments with noisy parallel corpora show that we can achieve promising results against a competitive baseline by removing the need of specific feature engineering or additional external resources. To justify the utility of our approach, we extract sentence pairs from Wikipedia articles to train machine translation systems and show significant improvements in translation performance.

pdf abs
Robust Lexical Features for Improved Neural Network Named-Entity Recognition
Abbas Ghaddar | Phillippe Langlais
Proceedings of the 27th International Conference on Computational Linguistics

Neural network approaches to Named-Entity Recognition reduce the need for carefully hand-crafted features. While some features do remain in state-of-the-art systems, lexical features have been mostly discarded, with the exception of gazetteers. In this work, we show that this is unfair: lexical features are actually quite useful. We propose to embed words and entity types into a low-dimensional vector space we train from annotated data produced by distant supervision thanks to Wikipedia. From this, we compute — offline — a feature vector representing each word. When used with a vanilla recurrent neural network model, this representation yields substantial improvements. We establish a new state-of-the-art F1 score of 87.95 on ONTONOTES 5.0, while matching state-of-the-art performance with a F1 score of 91.73 on the over-studied CONLL-2003 dataset.

pdf
Revisiting the Task of Scoring Open IE Relations
William Léchelle | Philippe Langlais
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
Transforming Wikipedia into a Large-Scale Fine-Grained Entity Type Corpus
Abbas Ghaddar | Philippe Langlais
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf abs
Reranking Translation Candidates Produced by Several Bilingual Word Similarity Sources
Laurent Jakubina | Phillippe Langlais
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We investigate the reranking of the output of several distributional approaches on the Bilingual Lexicon Induction task. We show that reranking an n-best list produced by any of those approaches leads to very substantial improvements. We further demonstrate that combining several n-best lists by reranking is an effective way of further boosting performance.

pdf bib abs
Users and Data: The Two Neglected Children of Bilingual Natural Language Processing Research
Phillippe Langlais
Proceedings of the 10th Workshop on Building and Using Comparable Corpora

Despite numerous studies devoted to mining parallel material from bilingual data, we have yet to see the resulting technologies wholeheartedly adopted by professional translators and terminologists alike. I argue that this state of affairs is mainly due to two factors: the emphasis published authors put on models (even though data is as important), and the conspicuous lack of concern for actual end-users.

pdf abs
BUCC 2017 Shared Task: a First Attempt Toward a Deep Learning Framework for Identifying Parallel Sentences in Comparable Corpora
Francis Grégoire | Philippe Langlais
Proceedings of the 10th Workshop on Building and Using Comparable Corpora

This paper describes our participation in BUCC 2017 shared task: identifying parallel sentences in comparable corpora. Our goal is to leverage continuous vector representations and distributional semantics with a minimal use of external preprocessing and postprocessing tools. We report experiments that were conducted after transmitting our results.

pdf abs
Translating Implicit Discourse Connectives Based on Cross-lingual Annotation and Alignment
Hongzheng Li | Philippe Langlais | Yaohong Jin
Proceedings of the Third Workshop on Discourse in Machine Translation

Implicit discourse connectives and relations are distributed more widely in Chinese texts, when translating into English, such connectives are usually translated explicitly. Towards Chinese-English MT, in this paper we describe cross-lingual annotation and alignment of dis-course connectives in a parallel corpus, describing related surveys and findings. We then conduct some evaluation experiments to testify the translation of implicit connectives and whether representing implicit connectives explicitly in source language can improve the final translation performance significantly. Preliminary results show it has little improvement by just inserting explicit connectives for implicit relations.

pdf abs
WiNER: A Wikipedia Annotated Corpus for Named Entity Recognition
Abbas Ghaddar | Phillippe Langlais
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We revisit the idea of mining Wikipedia in order to generate named-entity annotations. We propose a new methodology that we applied to English Wikipedia to build WiNER, a large, high quality, annotated corpus. We evaluate its usefulness on 6 NER tasks, comparing 4 popular state-of-the art approaches. We show that LSTM-CRF is the approach that benefits the most from our corpus. We report impressive gains with this model when using a small portion of WiNER on top of the CONLL training material. Last, we propose a simple but efficient method for exploiting the full range of WiNER, leading to further improvements.

2016

pdf
BAD LUC@WMT 2016: a Bilingual Document Alignment Platform Based on Lucene
Laurent Jakubina | Phillippe Langlais
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf
Coreference in Wikipedia: Main Concept Resolution
Abbas Ghaddar | Phillippe Langlais
Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning

pdf abs
WikiCoref: An English Coreference-annotated Corpus of Wikipedia Articles
Abbas Ghaddar | Phillippe Langlais
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents WikiCoref, an English corpus annotated for anaphoric relations, where all documents are from the English version of Wikipedia. Our annotation scheme follows the one of OntoNotes with a few disparities. We annotated each markable with coreference type, mention type and the equivalent Freebase topic. Since most similar annotation efforts concentrate on very specific types of written text, mainly newswire, there is a lack of resources for otherwise over-used Wikipedia texts. The corpus described in this paper addresses this issue. We present a freely available resource we initially devised for improving coreference resolution algorithms dedicated to Wikipedia texts. Our corpus has no restriction on the topics of the documents being annotated, and documents of various sizes have been considered for annotation.

2015

pdf
Projective methods for mining missing translations in DBpedia
Laurent Jakubina | Phillippe Langlais
Proceedings of the Eighth Workshop on Building and Using Comparable Corpora

2014

pdf
Using distributed word representations for robust semantic role labeling (Utilisation de représentations de mots pour l’étiquetage de rôles sémantiques suivant FrameNet) [in French]
William Léchelle | Philippe Langlais
Proceedings of TALN 2014 (Volume 1: Long Papers)

pdf abs
An Iterative Approach for Mining Parallel Sentences in a Comparable Corpus
Lise Rebout | Phillippe Langlais
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We describe an approach for mining parallel sentences in a collection of documents in two languages. While several approaches have been proposed for doing so, our proposal differs in several respects. First, we use a document level classifier in order to focus on potentially fruitful document pairs, an understudied approach. We show that mining less, but more parallel documents can lead to better gains in machine translation. Second, we compare different strategies for post-processing the output of a classifier trained to recognize parallel sentences. Last, we report a simple bootstrapping experiment which shows that promising sentence pairs extracted in a first stage can help to mine new sentence pairs in a second stage. We applied our approach on the English-French Wikipedia. Gains of a statistical machine translation (SMT) engine are analyzed along different test sets.

pdf abs
Hashtag Occurrences, Layout and Translation: A Corpus-driven Analysis of Tweets Published by the Canadian Government
Fabrizio Gotti | Phillippe Langlais | Atefeh Farzindar
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present an aligned bilingual corpus of 8758 tweet pairs in French and English, derived from Canadian government agencies. Hashtags appear in a tweet’s prologue, announcing its topic, or in the tweet’s text in lieu of traditional words, or in an epilogue. Hashtags are words prefixed with a pound sign in 80% of the cases. The rest is mostly multiword hashtags, for which we describe a segmentation algorithm. A manual analysis of the bilingual alignment of 5000 hashtags shows that 5% (French) to 18% (English) of them don’t have a counterpart in their containing tweet’s translation. This analysis shows that 80% of multiword hashtags are correctly translated by humans, and that the mistranslation of the rest may be due to incomplete translation directives regarding social media. We show how these resources and their analysis can guide the design of a machine translation pipeline, and its evaluation. A baseline system implementing a tweet-specific tokenizer yields promising results. The system is improved by translating epilogues, prologues, and text separately. We attempt to feed the SMT engine with the original hashtag and some alternatives (“dehashed” version or a segmented version of multiword hashtags), but translation quality improves at the cost of hashtag recall.

pdf
Fourteen Light Tasks for comparing Analogical and Phrase-based Machine Translation
Rafik Rhouma | Phillippe Langlais
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf
Translating Government Agencies’ Tweet Feeds: Specificities, Problems and (a few) Solutions
Fabrizio Gotti | Philippe Langlais | Atefeh Farzindar
Proceedings of the Workshop on Language Analysis in Social Media

pdf
Yet Another Fast, Robust and Open Source Sentence Aligner. Time toReconsider Sentence Alignment?
Fethi Lamraoui | Philippe Langlais
Proceedings of Machine Translation Summit XIV: Papers

pdf
Mapping Source to Target Strings without Alignment by Analogical Learning: A Case Study with Transliteration
Phillippe Langlais
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

pdf abs
Texto4Science: a Quebec French Database of Annotated Short Text Messages
Philippe Langlais | Patrick Drouin | Amélie Paulus | Eugénie Rompré Brodeur | Florent Cottin
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In October 2009, was launched the Quebec French part of the international sms4science project, called texto4science. Over a period of 10 months, we collected slightly more than 7000 SMSs that we carefully annotated. This database is now ready to be used by the community. The purpose of this article is to relate the efforts put into designing this database and provide some data analysis of the main linguistic phenomenon that we have annotated. We also report on a socio-linguistic survey we conducted within the project.

pdf bib abs
Identifying Infrequent Translations by Aligning Non Parallel Sentences
Julien Bourdaillet | Philippe Langlais
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers

Aligning a sequence of words to one of its infrequent translations is a difficult task. We propose a simple and original solution to this problem that yields to significant gains over a state-of-the-art transpotting task. Our approach consists in aligning non parallel sentences from the training data in order to reinforce online the alignment models. We show that using only a few pairs of non parallel sentences allows to improve significantly the alignment of infrequent translations.

pdf bib
Atténuation des surdétections d’un correcteur grammatical de qualité commerciale [Reducing overdetections in a commercial grade grammar checker]
Fabrizio Gotti | Philippe Langlais | Guy Lapalme | Simon Charest | Eric Brunelle
Traitement Automatique des Langues, Volume 53, Numéro 3 : Du bruit dans le signal : gestion des erreurs en traitement automatique des langues [Managing noise in the signal: Error handling in natural language processing]

2011

pdf
Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia.
Alexandre Patry | Philippe Langlais
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web

pdf abs
Comparaison d’une approche miroir et d’une approche distributionnelle pour l’extraction de mots sémantiquement reliés (Comparing a mirror approach and a distributional approach for extracting semantically related words)
Philippe Muller | Philippe Langlais
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans (Muller & Langlais, 2010), nous avons comparé une approche distributionnelle et une variante de l’approche miroir proposée par Dyvik (2002) sur une tâche d’extraction de synonymes à partir d’un corpus en français. Nous présentons ici une analyse plus fine des relations extraites automatiquement en nous intéressant cette fois-ci à la langue anglaise pour laquelle de plus amples ressources sont disponibles. Différentes façons d’évaluer notre approche corroborent le fait que l’approche miroir se comporte globalement mieux que l’approche distributionnelle décrite dans (Lin, 1998), une approche de référence dans le domaine.

pdf bib
Moranapho: un système multilingue d’analyse morphologique basé sur l’analogie formelle [Moranapho: a multilingual system for morphological analysis based on formal analogy]
Jean-François Lavallée | Philippe Langlais
Traitement Automatique des Langues, Volume 52, Numéro 2 : Vers la morphologie et au-delà [Toward Morphology and beyond]

pdf
Going Beyond Word Cooccurrences in Global Lexical Selection for Statistical Machine Translation using a Multilayer Perceptron
Alexandre Patry | Philippe Langlais
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

pdf bib
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Conférences invitées
Philippe Langlais | Michel Gagnon
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Conférences invitées

pdf bib
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs
Philippe Langlais | Michel Gagnon
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

pdf bib abs
Alignement de traductions rares à l’aide de paires de phrases non alignées
Julien Bourdaillet | Stéphane Huet | Philippe Langlais
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Bien souvent, le sens d’un mot ou d’une expression peut être rendu dans une autre langue par plusieurs traductions. Parmi celles-ci, certaines se révèlent très fréquentes alors que d’autres le sont beaucoup moins, conformément à une loi zipfienne. La googlisation de notre monde n’échappe pas aux mémoires de traduction, qui mettent souvent à mal ou simplement ignorent ces traductions rares qui sont souvent de bonne qualité. Dans cet article, nous nous intéressons à ces traductions rares sous l’angle du repérage de traductions. Nous argumentons qu’elles sont plus difficiles à identifier que les traductions plus fréquentes. Nous décrivons une approche originale qui permet de mieux les identifier en tirant profit de l’alignement au niveau des mots de paires de phrases qui ne sont pas alignées. Nous montrons que cette approche permet d’améliorer l’identification de ces traductions rares.

pdf abs
Apprentissage non supervisé de la morphologie d’une langue par généralisation de relations analogiques
Jean-François Lavallée | Philippe Langlais
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Bien que les approches fondées sur la théorie de l’information sont prédominantes dans le domaine de l’analyse morphologique non supervisée, depuis quelques années, d’autres approches ont gagné en popularité, dont celles basées sur l’analogie formelle. Cette dernière reste tout de même marginale due notamment à son coût de calcul élevé. Dans cet article, nous proposons un algorithme basé sur l’analogie formelle capable de traiter les lexiques volumineux. Nous introduisons pour cela le concept de règle de cofacteur qui permet de généraliser l’information capturée par une analogie tout en contrôlant les temps de traitement. Nous comparons notre système à 2 systèmes : Morfessor (Creutz & Lagus, 2005), un système de référence dans de nombreux travaux sur l’analyse morphologique et le système analogique décrit par Langlais (2009). Nous en montrons la supériorité pour 3 des 5 langues étudiées ici : le finnois, le turc, et l’allemand.

pdf bib
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts
Philippe Langlais | Michel Gagnon
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

pdf
Comparaison de ressources lexicales pour l’extraction de synonymes
Philippe Muller | Philippe Langlais
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

pdf bib
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations
Philippe Langlais | Michel Gagnon
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

pdf abs
TransSearch : un moteur de recherche de traductions
Julien Bourdaillet | Fabrizio Gotti | Stéphane Huet | Philippe Langlais | Guy Lapalme
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

Malgré les nombreuses études visant à améliorer la traduction automatique, la traduction assistée par ordinateur reste la solution préférée des traducteurs lorsqu’une sortie de qualité est recherchée. Cette démonstration vise à présenter le moteur de recherche de traductions TransSearch. Cetteapplication commerciale, accessible sur leWeb, repose d’une part sur l’exploitation d’un bitexte aligné au niveau des phrases, et d’autre part sur des modèles statistiques d’alignement de mots.

pdf bib
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues
Alexandre Patry | Philippe Langlais | Aurélien Max
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues

pdf
The RALI Machine Translation System for WMT 2010
Stéphane Huet | Julien Bourdaillet | Alexandre Patry | Philippe Langlais
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf
Revisiting Context-based Projection Methods for Term-Translation Spotting in Comparable Corpora
Audrey Laroche | Philippe Langlais
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
PARADOCS : l’entremetteur de documents parallèles indépendant de la langue [PARADOCS: A Language Independant Go-Between for Mating Parallel Documents]
Alexandre Patry | Philippe Langlais
Traitement Automatique des Langues, Volume 51, Numéro 2 : Multilinguisme et traitement automatique des langues [Multilingualism and Natural Language Processing]

2009

pdf
Improvements in Analogical Learning: Application to Translating Multi-Terms of the Medical Domain
Philippe Langlais | François Yvon | Pierre Zweigenbaum
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf
TS3: an Improved Version of the Bilingual Concordancer TransSearch
Stéphane Huet | Julien Bourdaillet | Philippe Langlais
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

pdf
Prediction of Words in Statistical Machine Translation using a Multilayer Perceptron
Alexandre Patry | Philippe Langlais
Proceedings of Machine Translation Summit XII: Papers

pdf
Harnessing the Redundant Results of Translation Spotting
Stéphane Huet | Julien Bourdaillet | Philippe Langlais | Guy Lapalme
Proceedings of Machine Translation Summit XII: Posters

pdf abs
Étude quantitative de liens entre l’analogie formelle et la morphologie constructionnelle
Philippe Langlais
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Plusieurs travaux ont récemment étudié l’apport de l’apprentissage analogique dans des applications du traitement automatique des langues comme la traduction automatique, ou la recherche d’information. Il est souvent admis que les relations analogiques de forme entre les mots capturent des informations de nature morphologique. Le but de cette étude est de présenter une analyse des points de rencontre entre l’analyse morphologique et les analogies de forme. C’est à notre connaissance la première étude de ce type portant sur des corpus de grande taille et sur plusieurs langues. Bien que notre étude ne soit pas dédiée à une tâche particulière du traitement des langues, nous montrons cependant que le principe d’analogie permet de segmenter des mots en morphèmes avec une bonne précision.

pdf abs
Intégration de l’alignement de mots dans le concordancier bilingue TransSearch
Stéphane Huet | Julien Bourdaillet | Philippe Langlais
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Malgré les nombreuses études visant à améliorer la traduction automatique, la traduction assistée par ordinateur reste la solution préférée des traducteurs lorsqu’une sortie de qualité est recherchée. Dans cet article, nous présentons nos travaux menés dans le but d’améliorer le concordancier bilingue TransSearch. Ce service, accessible sur le Web, repose principalement sur un alignement au niveau des phrases. Dans cette étude, nous discutons et évaluons l’intégration d’un alignement statistique au niveau des mots. Nous présentons deux nouvelles problématiques essentielles au succès de notre nouveau prototype : la détection des traductions erronées et le regroupement des variantes de traduction similaires.

pdf abs
Prise en compte de dépendances syntaxiques pour la traduction contextuelle de segments
Aurélien Max | Rafik Maklhoufi | Philippe Langlais
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans un système standard de traduction statistique basé sur les segments, le score attribué aux différentes traductions d’un segment ne dépend pas du contexte dans lequel il apparaît. Plusieurs travaux récents tendent à montrer l’intérêt de prendre en compte le contexte source lors de la traduction, mais ces études portent sur des systèmes traduisant vers l’anglais, une langue faiblement fléchie. Dans cet article, nous décrivons nos expériences sur la prise en compte du contexte source dans un système statistique traduisant de l’anglais vers le français, basé sur l’approche proposée par Stroppa et al. (2007). Nous étudions l’impact de différents types d’indices capturant l’information contextuelle, dont des dépendances syntaxiques typées. Si les mesures automatiques d’évaluation de la qualité d’une traduction ne révèlent pas de gains significatifs de notre système par rapport à un système à l’état de l’art ne faisant pas usage du contexte, une évaluation manuelle conduite sur 100 phrases choisies aléatoirement est en faveur de notre système. Cette évaluation fait également ressortir que la prise en compte de certaines dépendances syntaxiques est bénéfique à notre système.

2008

pdf abs
MISTRAL: a Statistical Machine Translation Decoder for Speech Recognition Lattices
Alexandre Patry | Philippe Langlais
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents MISTRAL, an open source statistical machine translation decoder dedicated to spoken language translation. While typical machine translation systems take a written text as input, MISTRAL translates word lattices produced by automatic speech recognition systems. The lattices are translated in two passes using a phrase-based model. Our experiments reveal an improvement in BLEU when translating lattices instead of sentences returned by a speech recognition system.

pdf
Scaling up Analogical Learning
Philippe Langlais | François Yvon
Coling 2008: Companion volume: Posters

pdf
Explorations in using grammatical dependencies for contextual phrase translation disambiguation
Aurélien Max | Rafik Makhloufi | Philippe Langlais
Proceedings of the 12th Annual Conference of the European Association for Machine Translation

pdf bib
Enrichissement d’un lexique bilingue par apprentissage analogique [Enrichment of a Bilingual Lexicon by Analogical Learning]
Philippe Langlais | Alexandre Patry
Traitement Automatique des Langues, Volume 49, Numéro 1 : Varia [Varia]

pdf abs
Recherche locale pour la traduction statistique à base de segments
Philippe Langlais | Alexandre Patry | Fabrizio Gotti
Actes de la 15ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans cette étude, nous nous intéressons à des algorithmes de recherche locale pour la traduction statistique à base de segments (phrase-based machine translation). Les algorithmes que nous étudions s’appuient sur une formulation complète d’un état dans l’espace de recherche contrairement aux décodeurs couramment utilisés qui explorent l’espace des préfixes des traductions possibles. Nous montrons que la recherche locale seule, permet de produire des traductions proches en qualité de celles fournies par les décodeurs usuels, en un temps nettement inférieur et à un coût mémoire constant. Nous montrons également sur plusieurs directions de traduction qu’elle permet d’améliorer de manière significative les traductions produites par le système à l’état de l’art Pharaoh (Koehn, 2004).

2007

pdf
Translating Unknown Words by Analogical Learning
Philippe Langlais | Alexandre Patry
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf
A greedy decoder for phrase-based statistical machine translation
Philippe Langlais | Alexandre Patry | Fabrizio Gotti
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

pdf abs
MISTRAL: a lattice translation system for IWSLT 2007
Alexandre Patry | Philippe Langlais | Frédéric Béchet
Proceedings of the Fourth International Workshop on Spoken Language Translation

This paper describes MISTRAL, the lattice translation system that we developed for the Italian-English track of the International Workshop on Spoken Language Translation 2007. MISTRAL is a discriminative phrase-based system that translates a source word lattice in two passes. The first pass extracts a list of top ranked sentence pairs from the lattice and the second pass rescores this list with more complex features. Our experiments show that our system, when translating pruned lattices, is at least as good as a fair baseline that translates the first ranked sentences returned by a speech recognition system.

pdf abs
Enrichissement d’un lexique bilingue par analogie
Philippe Langlais | Alexandre Patry
Actes de la 14ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

La présence de mots inconnus dans les applications langagières représente un défi de taille bien connu auquel n’échappe pas la traduction automatique. Les systèmes professionnels de traduction offrent à cet effet à leurs utilisateurs la possibilité d’enrichir un lexique de base avec de nouvelles entrées. Récemment, Stroppa et Yvon (2005) démontraient l’intérêt du raisonnement par analogie pour l’analyse morphologique d’une langue. Dans cette étude, nous montrons que le raisonnement par analogie offre également une réponse adaptée au problème de la traduction d’entrées lexicales inconnues.

2006

pdf abs
De la Chambre des communes à la chambre d’isolement : adaptabilité d’un système de traduction basé sur les segments de phrases
Philippe Langlais | Fabrizio Gotti | Alexandre Patry
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Nous présentons notre participation à la deuxième campagne d’évaluation de CESTA, un projet EVALDA de l’action Technolangue. Le but de cette campagne consistait à tester l’aptitude des systèmes de traduction à s’adapter rapidement à une tâche spécifique. Nous analysons la fragilité d’un système de traduction probabiliste entraîné sur un corpus hors-domaine et dressons la liste des expériences que nous avons réalisées pour adapter notre système au domaine médical.

pdf abs
Vers l’intégration du contexte dans une mémoire de traduction sous-phrastique : détection du domaine de traduction
Fabrizio Gotti | Philippe Langlais | Claude Coulombe
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

Nous présentons dans cet article une mémoire de traduction sous-phrastique sensible au domaine de traduction, une première étape vers l’intégration du contexte. Ce système est en mesure de recycler les traductions déjà « vues » par la mémoire, non seulement pour des phrases complètes, mais également pour des sous-séquences contiguës de ces phrases, via un aligneur de mots. Les séquences jugées intéressantes sont proposées au traducteur. Nous expliquons également la création d’un utilisateur artificiel, indispensable pour tester les performances du système en l’absence d’intervention humaine. Nous le testons lors de la traduction d’un ensemble disparate de corpus. Ces performances sont exprimées par un ensemble de métriques que nous définissons. Enfin, nous démontrons que la détection automatique du contexte de traduction peut s’avérer bénéfique et prometteuse pour améliorer le fonctionnement d’une telle mémoire, en agissant comme un filtre sur le matériel cible suggéré.

pdf abs
MOOD: A Modular Object-Oriented Decoder for Statistical Machine Translation
Alexandre Patry | Fabrizio Gotti | Philippe Langlais
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We present an Open Source framework called MOOD developed in order tofacilitate the development of a Statistical Machine Translation Decoder.MOOD has been modularized using an object-oriented approach which makes itespecially suitable for the fast development of state-of-the-art decoders. Asa proof of concept, a clone of the pharaoh decoder has been implemented andevaluated. This clone named ramses is part of the current distribution of MOOD.

pdf
Phrase-Based SMT with Shallow Tree-Phrases
Philippe Langlais | Fabrizio Gotti
Proceedings on the Workshop on Statistical Machine Translation

pdf
Mood at work: Ramses versus Pharaoh
Alexandre Patry | Fabrizio Gotti | Philippe Langlais
Proceedings on the Workshop on Statistical Machine Translation

2005

pdf abs
EBMT by Tree-Phrasing: a Pilot Study
Philippe Langlais | Fabrizio Gotti | Didier Bourigault | Claude Coulombe
Workshop on example-based machine translation

We present a study we conducted to build a repository storing associations between simple dependency treelets in a source language and their corresponding phrases in a target language. To assess the impact of this resource in EBMT, we used the repository to compute coverage statistics on a test bitext and on a n-best list of translation candidates produced by a standard phrase-based decoder.

pdf abs
Paradocs: un système d’identification automatique de documents parallèles
Alexandre Patry | Philippe Langlais
Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Les corpus parallèles sont d’une importance capitale pour les applications multilingues de traitement automatique des langues. Malheureusement, leur rareté est le maillon faible de plusieurs applications d’intérêt. Extraire de tels corpus duWeb est une solution viable, mais elle introduit une nouvelle problématique : il n’est pas toujours trivial d’identifier les documents parallèles parmi tous ceux qui ont été extraits. Dans cet article, nous nous intéressons à l’identification automatique des paires de documents parallèles contenues dans un corpus bilingue. Nous montrons que cette tâche peut être accomplie avec précision en utilisant un ensemble restreint d’invariants lexicaux. Nous évaluons également notre approche sur une tâche de traduction automatique et montrons qu’elle obtient des résultats supérieurs à un système de référence faisant usage d’un lexique bilingue.

Cet article présente une méthode de traduction automatique statistique basée sur des segments non-continus, c’est-à-dire des segments formés de mots qui ne se présentent pas nécéssairement de façon contiguë dans le texte. On propose une méthode pour produire de tels segments à partir de corpus alignés au niveau des mots. On présente également un modèle de traduction statistique capable de tenir compte de tels segments, de même qu’une méthode d’apprentissage des paramètres du modèle visant à maximiser l’exactitude des traductions produites, telle que mesurée avec la métrique NIST. Les traductions optimales sont produites par le biais d’une recherche en faisceau. On présente finalement des résultats expérimentaux, qui démontrent comment la méthode proposée permet une meilleure généralisation à partir des données d’entraînement.

pdf abs
Approches en corpus pour la traduction : le cas MÉTÉO
Philippe Langlais | Thomas Leplus | Simona Gandrabur | Guy Lapalme
Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

La traduction automatique (TA) attire depuis plusieurs années l’intérêt d’un nombre grandissant de chercheurs. De nombreuses approches sont proposées et plusieurs campagnes d’évaluation rythment les avancées faites. La tâche de traduction à laquelle les participants de ces campagnes se prêtent consiste presque invariablement à traduire des articles journalistiques d’une langue étrangère vers l’anglais; tâche qui peut sembler artificielle. Dans cette étude, nous nous intéressons à savoir ce que différentes approches basées sur les corpus peuvent faire sur une tâche réelle. Nous avons reconstruit à cet effet l’un des plus grands succès de la TA: le système MÉTÉO. Nous montrons qu’une combinaison de mémoire de traduction et d’approches statistiques permet d’obtenir des résultats comparables à celles du système MÉTÉO, tout en offrant un cycle de développement plus court et de plus grandes possibilités d’ajustements.

pdf
From the real world to real words: the METEO case
Philippe Langlais | Thomas Leplus | Simona Gandrabur | Guy Lapalme
Proceedings of the 10th EAMT Conference: Practical applications of machine translation

pdf
NUKTI: English-Inuktitut Word Alignment System Description
Philippe Langlais | Fabrizio Gotti | Guihong Cao
Proceedings of the ACL Workshop on Building and Using Parallel Texts

pdf
RALI: SMT Shared Task System Description
Philippe Langlais | Guihong Cao | Fabrizio Gotti
Proceedings of the ACL Workshop on Building and Using Parallel Texts

2004

pdf abs
Désambiguïsation de corpus monolingues par des approches de type Lesk
Florentina Vasilescu | Philippe Langlais
Actes de la 11ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article présente une analyse détaillée des facteurs qui déterminent les performances des approches de désambiguïsation dérivées de la méthode de Lesk (1986). Notre étude porte sur une série d’expériences concernant la méthode originelle de Lesk et des variantes que nous avons adaptées aux caractéristiques de WORDNET. Les variantes implémentées ont été évaluées sur le corpus de test de SENSEVAL2, English All Words, ainsi que sur des extraits du corpus SEMCOR. Notre évaluation se base d’un côté, sur le calcul de la précision et du rappel, selon le modèle de SENSEVAL, et d’un autre côté, sur une taxonomie des réponses qui permet de mesurer la prise de risque d’un décideur par rapport à un système de référence.

pdf bib abs
Mots composés dans les modèles de langue pour la recherche d’information
Carmen Alvarez | Philippe Langlais | Jian-Yun Nie
Actes de la 11ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

Une approche classique en recherche d’information (RI) consiste à bâtir une représentation des documents et des requêtes basée sur les mots simples les constituant. L’utilisation de modèles bigrammes a été étudiée, mais les contraintes sur l’ordre et l’adjacence des mots dans ces travaux ne sont pas toujours justifiées pour la recherche d’information. Nous proposons une nouvelle approche basée sur les modèles de langue qui incorporent des affinités lexicales (ALs), c’est à dire des paires non ordonnées de mots qui se trouvent proches dans un texte. Nous décrivons ce modèle et le comparons aux plus traditionnels modèles unigrammes et bigrammes ainsi qu’au modèle vectoriel.

pdf
Experimenting with phrase-based statistical translation within the IWSLT Chinese-to-English shared translation task
Philippe Langlais | Michael Carl | Oliver Streiter
Proceedings of the First International Workshop on Spoken Language Translation: Evaluation Campaign

pdf abs
Weather report translation using a translation memory
Thomas Leplus | Philippe Langlais | Guy Lapalme
Proceedings of the 6th Conference of the Association for Machine Translation in the Americas: Technical Papers

We describe the use of a translation memory in the context of a reconstruction of a landmark application of machine translation, the Canadian English to French weather report translation system. This system, which has been in operation for more than 20 years, was developed using a classical symbolic approach. We describe our experiment in developing an alternative approach based on the analysis of hundreds of thousands of weather reports. We show that it is possible to obtain excellent translations using translation memory techniques and we analyze the kinds of translation errors that are induced by this approach.

pdf
Evaluating Variants of the Lesk Approach for Disambiguating Words
Florentina Vasilescu | Philippe Langlais | Guy Lapalme
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
CESTA: Machine Translation Evaluation Campaign [Work-in-Progress Project Report]
Widad Mustafa El Hadi | Marianne Dabbadie | Ismaïl Timimi | Martin Rajman | Philippe Langlais | Antony Hartley | Andrei Popescu Belis
Proceedings of the Second International Workshop on Language Resources for Translation Work, Research and Training

pdf
Adaptive Language and Translation Models for Interactive Machine Translation
Laurent Nepveu | Guy Lapalme | Philippe Langlais | George Foster
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

2003

pdf
Tuning general translation knowledge to a sublanguage
Michael Carl | Philippe Langlais
EAMT Workshop: Improving MT through other language technology tools: resources and tools for building MT

pdf
Statistical Translation Alignment with Compositionality Constraints
Michel Simard | Philippe Langlais
Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond

pdf abs
De la traduction probabiliste aux mémoires de traduction (ou l’inverse)
Philippe Langlais | Michel Simard
Actes de la 10ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

En dépit des travaux réalisés cette dernière décennie dans le cadre général de la traduction probabiliste, nous sommes toujours bien loin du jour où un engin de traduction automatique (probabiliste ou pas) sera capable de répondre pleinement aux besoins d’un traducteur professionnel. Dans une étude récente (Langlais, 2002), nous avons montré comment un engin de traduction probabiliste pouvait bénéficier de ressources terminologiques extérieures. Dans cette étude, nous montrons que les techniques de traduction probabiliste peuvent être utilisées pour extraire des informations sous-phrastiques d’une mémoire de traduction. Ces informations peuvent à leur tour s’avérer utiles à un engin de traduction probabiliste. Nous rapportons des résultats sur un corpus de test de taille importante en utilisant la mémoire de traduction d’un concordancier bilingue commercial.

We describe an experiment in rapid development of a statistical machine translation (SMT) system from scratch, using limited resources: under this heading we include not only training data, but also computing power, linguistic knowledge, programming effort, and absolute time.

2002

pdf abs
Text prediction with fuzzy alignment
George Foster | Philippe Langlais | Guy Lapalme
Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: Technical Papers

Text prediction is a form of interactive machine translation that is well suited to skilled translators. In recent work it has been shown that simple statistical translation models can be applied within a usermodeling framework to improve translator productivity by over 10% in simulated results. For the sake of efficiency in making real-time predictions, these models ignore the alignment relation between source and target texts. In this paper we introduce a new model that captures fuzzy alignments in a very simple way, and show that it gives modest improvements in predictive performance without significantly increasing the time required to generate predictions.

pdf abs
Merging example-based and statistical machine translation: an experiment
Philippe Langlais | Michel Simard
Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: Technical Papers

Despite the exciting work accomplished over the past decade in the field of Statistical Machine Translation (SMT), we are still far from the point of being able to say that machine translation fully meets the needs of real-life users. In a previous study [6], we have shown how a SMT engine could benefit from terminological resources, especially when translating texts very different from those used to train the system. In the present paper, we discuss the opening of SMT to examples automatically extracted from a Translation Memory (TM). We report results on a fair-sized translation task using the database of a commercial bilingual concordancer.

pdf
Translators at work with TRANSTYPE: Resource and Evaluation.
Philippe Langlais | Marie Loranger | Guy Lapalme
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf
User-Friendly Text Prediction For Translators
George Foster | Philippe Langlais | Guy Lapalme
Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002)

pdf bib
An Intelligent Terminology Database as a Pre-processor for Statistical Machine Translation
Michael Carl | Philippe Langlais
COLING-02: COMPUTERM 2002: Second International Workshop on Computational Terminology

pdf
Improving a general-purpose Statistical Translation Engine by Terminological lexicons
Philippe Langlais
COLING-02: COMPUTERM 2002: Second International Workshop on Computational Terminology

pdf bib abs
Ressources terminologiques et traduction probabiliste: premiers pas positifs vers un système adaptatif
Philippe Langlais
Actes de la 9ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cette dernière décennie a été le témoin d’importantes avancées dans le domaine de la traduction statistique (TS). Aucune évaluation fine n’a cependant été proposée pour mesurer l’adéquation de l’approche statistique dans un contexte applicatif réel. Dans cette étude, nous étudions le comportement d’un engin de traduction probabiliste lorsqu’il traduit un texte de nature très éloignée de celle du corpus utilisé lors de l’entraînement. Nous quantifions en particulier la baisse de performance du système et développons l’idée que l’intégration de ressources terminologiques dans le processus est une solution naturelle et salutaire à la traduction. Nous décrivons cette intégration et évaluons son potentiel.

2001

pdf abs
Récupération de segments sous-phrastiques dans une mémoire de traduction
Philippe Langlais | Michel Simard
Actes de la 8ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

L’utilité des outils d’aide à la traduction reposant sur les mémoires de traduction est souvent limitée par la nature des segments que celles-ci mettent en correspondance, le plus souvent des phrases entières. Cet article examine le potentiel d’un type de système qui serait en mesure de récupérer la traduction de séquences de mots de longueur arbitraire.

pdf abs
Integrating bilingual lexicons in a probabilistic translation assistant
Philippe Langlais | George Foster | Guy Lapalme
Proceedings of Machine Translation Summit VIII

In this paper, we present a way to integrate bilingual lexicons into an operational probabilistic translation assistant (TransType). These lexicons could be any resource available to the translator (e.g. terminological lexicons) or any resource statistically derived from training material. We describe a bilingual lexicon acquisition process that we developped and we evaluate from a theoretical point of view its benefits to a translation completion task.

pdf abs
Sub-sentential exploitation of translation memories
Michel Simard | Philippe Langlais
Proceedings of Machine Translation Summit VIII

Translation memory systems (TMS) are a family of computer tools whose purpose is to facilitate and encourage the re-use of existing translations. By searching a database of past translations, these systems can retrieve the translation of whole segments of text and propose them to the translator for re-use. However, the usefulness of existing TMS’s is limited by the nature of the text segments that that they are able to put in correspondence, generally whole sentences. This article examines the potential of a type of system that is able to recuperate the translation of sub-sentential sequences of words.