Alexandre Allauzen

2021

pdf abs
Measure and Evaluation of Semantic Divergence across Two Languages
Syrielle Montariol | Alexandre Allauzen
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Languages are dynamic systems: word usage may change over time, reflecting various societal factors. However, all languages do not evolve identically: the impact of an event, the influence of a trend or thinking, can differ between communities. In this paper, we propose to track these divergences by comparing the evolution of a word and its translation across two languages. We investigate several methods of building time-varying and bilingual word embeddings, using contextualised and non-contextualised embeddings. We propose a set of scenarios to characterize semantic divergence across two languages, along with a setup to differentiate them in a bilingual corpus. We evaluate the different methods by generating a corpus of synthetic semantic change across two languages, English and French, before applying them to newspaper corpora to detect bilingual semantic divergence and provide qualitative insight for the task. We conclude that BERT embeddings coupled with a clustering step lead to the best performance on synthetic corpora; however, the performance of CBOW embeddings is very competitive and more adapted to an exploratory analysis on a large corpus.

pdf abs
Transport Optimal pour le Changement Sémantique à partir de Plongements Contextualisés (Optimal Transport for Semantic Change Detection using Contextualised Embeddings )
Syrielle Montariol | Alexandre Allauzen
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Plusieurs méthodes de détection des changements sémantiques utilisant des plongements lexicaux contextualisés sont apparues récemment. Elles permettent une analyse fine du changement d’usage des mots, en agrégeant les plongements contextualisés en clusters qui reflètent les différents usages d’un mot. Nous proposons une nouvelle méthode basée sur le transport optimal. Nous l’évaluons sur plusieurs corpus annotés, montrant un gain de précision par rapport aux autres méthodes utilisant des plongements contextualisés, et l’illustrons sur un corpus d’articles de journaux.

2020

pdf abs
FlauBERT : des modèles de langue contextualisés pré-entraînés pour le français (FlauBERT : Unsupervised Language Model Pre-training for French)
Hang Le | Loïc Vial | Jibril Frej | Vincent Segonne | Maximin Coavoux | Benjamin Lecouteux | Alexandre Allauzen | Benoît Crabbé | Laurent Besacier | Didier Schwab
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

Les modèles de langue pré-entraînés sont désormais indispensables pour obtenir des résultats à l’état-de-l’art dans de nombreuses tâches du TALN. Tirant avantage de l’énorme quantité de textes bruts disponibles, ils permettent d’extraire des représentations continues des mots, contextualisées au niveau de la phrase. L’efficacité de ces représentations pour résoudre plusieurs tâches de TALN a été démontrée récemment pour l’anglais. Dans cet article, nous présentons et partageons FlauBERT, un ensemble de modèles appris sur un corpus français hétérogène et de taille importante. Des modèles de complexité différente sont entraînés à l’aide du nouveau supercalculateur Jean Zay du CNRS. Nous évaluons nos modèles de langue sur diverses tâches en français (classification de textes, paraphrase, inférence en langage naturel, analyse syntaxique, désambiguïsation automatique) et montrons qu’ils surpassent souvent les autres approches sur le référentiel d’évaluation FLUE également présenté ici.

pdf abs
Étude des variations sémantiques à travers plusieurs dimensions (Studying semantic variations through several dimensions )
Syrielle Montariol | Alexandre Allauzen
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

Au sein d’une langue, l’usage des mots varie selon deux axes : diachronique (dimension temporelle) et synchronique (variation selon l’auteur, la communauté, la zone géographique... ). Dans ces travaux, nous proposons une méthode de détection et d’interprétation des variations d’usages des mots à travers ces différentes dimensions. Pour cela, nous exploitons les capacités d’une nouvelle ligne de plongements lexicaux contextualisés, en particulier le modèle BERT. Nous expérimentons sur un corpus de rapports financiers d’entreprises françaises, pour appréhender les enjeux et préoccupations propres à certaines périodes, acteurs et secteurs d’activités.

Language models have become a key step to achieve state-of-the art results in many different Natural Language Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient way to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their contextualization at the sentence level. This has been widely demonstrated for English using contextualized representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et al., 2019; Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified evaluation protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research community for further reproducible experiments in French NLP.

pdf bib
Variations in Word Usage for the Financial Domain
Syrielle Montariol | Alexandre Allauzen | Asanobu Kitamoto
Proceedings of the Second Workshop on Financial Technology and Natural Language Processing

2019

pdf abs
Empirical Study of Diachronic Word Embeddings for Scarce Data
Syrielle Montariol | Alexandre Allauzen
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Word meaning change can be inferred from drifts of time-varying word embeddings. However, temporal data may be too sparse to build robust word embeddings and to discriminate significant drifts from noise. In this paper, we compare three models to learn diachronic word embeddings on scarce data: incremental updating of a Skip-Gram from Kim et al. (2014), dynamic filtering from Bamler & Mandt (2017), and dynamic Bernoulli embeddings from Rudolph & Blei (2018). In particular, we study the performance of different initialisation schemes and emphasise what characteristics of each model are more suitable to data scarcity, relying on the distribution of detected drifts. Finally, we regularise the loss of these models to better adapt to scarce data.

pdf bib abs
Apprentissage de plongements de mots dynamiques avec régularisation de la dérive (Learning dynamic word embeddings with drift regularisation)
Syrielle Montariol | Alexandre Allauzen
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume I : Articles longs

L’usage, le sens et la connotation des mots peuvent changer au cours du temps. Les plongements lexicaux diachroniques permettent de modéliser ces changements de manière non supervisée. Dans cet article nous étudions l’impact de plusieurs fonctions de coût sur l’apprentissage de plongements dynamiques, en comparant les comportements de variantes du modèle Dynamic Bernoulli Embeddings. Les plongements dynamiques sont estimés sur deux corpus couvrant les mêmes deux décennies, le New York Times Annotated Corpus en anglais et une sélection d’articles du journal Le Monde en français, ce qui nous permet de mettre en place un processus d’analyse bilingue de l’évolution de l’usage des mots.

pdf abs
Exploring sentence informativeness
Syrielle Montariol | Aina Garí Soler | Alexandre Allauzen
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts

This study is a preliminary exploration of the concept of informativeness –how much information a sentence gives about a word it contains– and its potential benefits to building quality word representations from scarce data. We propose several sentence-level classifiers to predict informativeness, and we perform a manual annotation on a set of sentences. We conclude that these two measures correspond to different notions of informativeness. However, our experiments show that using the classifiers’ predictions to train word embeddings has an impact on embedding quality.

pdf bib
LIMSI-MULTISEM at the IJCAI SemDeep-5 WiC Challenge: Context Representations for Word Usage Similarity Estimation
Aina Garí Soler | Marianna Apidianaki | Alexandre Allauzen
Proceedings of the 5th Workshop on Semantic Deep Learning (SemDeep-5)

pdf bib abs
Word Usage Similarity Estimation with Sentence Representations and Automatic Substitutes
Aina Garí Soler | Marianna Apidianaki | Alexandre Allauzen
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

Usage similarity estimation addresses the semantic proximity of word instances in different contexts. We apply contextualized (ELMo and BERT) word and sentence embeddings to this task, and propose supervised models that leverage these representations for prediction. Our models are further assisted by lexical substitute annotations automatically assigned to word instances by context2vec, a neural model that relies on a bidirectional LSTM. We perform an extensive comparison of existing word and sentence representations on benchmark datasets addressing both graded and binary similarity.The best performing models outperform previous methods in both settings.

2018

pdf bib
Apprentissage profond pour le traitement automatique des langues [Deep Learning for Natural Language Processing]
Alexandre Allauzen | Hinrich Schütze
Traitement Automatique des Langues 2018 Volume 59 Numéro 2

pdf abs
Learning with Noise-Contrastive Estimation: Easing training by learning to scale
Matthieu Labeau | Alexandre Allauzen
Proceedings of the 27th International Conference on Computational Linguistics

Noise-Contrastive Estimation (NCE) is a learning criterion that is regularly used to train neural language models in place of Maximum Likelihood Estimation, since it avoids the computational bottleneck caused by the output softmax. In this paper, we analyse and explain some of the weaknesses of this objective function, linked to the mechanism of self-normalization, by closely monitoring comparative experiments. We then explore several remedies and modifications to propose tractable and efficient NCE training strategies. In particular, we propose to make the scaling factor a trainable parameter of the model, and to use the noise distribution to initialize the output bias. These solutions, yet simple, yield stable and competitive performances in either small and large scale language modelling tasks.

pdf abs
Algorithmes à base d’échantillonage pour l’entraînement de modèles de langue neuronaux (Here the title in English)
Matthieu Labeau | Alexandre Allauzen
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

L’estimation contrastive bruitée (NCE) et l’échantillonage par importance (IS) sont des procédures d’entraînement basées sur l’échantillonage, que l’on utilise habituellement à la place de l’estimation du maximum de vraisemblance (MLE) pour éviter le calcul du softmax lorsque l’on entraîne des modèles de langue neuronaux. Dans cet article, nous cherchons à résumer le fonctionnement de ces algorithmes, et leur utilisation dans la littérature du TAL. Nous les comparons expérimentalement, et présentons des manières de faciliter l’entraînement du NCE.

pdf abs
A comparative study of word embeddings and other features for lexical complexity detection in French
Aina Garí Soler | Marianna Apidianaki | Alexandre Allauzen
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Lexical complexity detection is an important step for automatic text simplification which serves to make informed lexical substitutions. In this study, we experiment with word embeddings for measuring the complexity of French words and combine them with other features that have been shown to be well-suited for complexity prediction. Our results on a synonym ranking task show that embeddings perform better than other features in isolation, but do not outperform frequency-based systems in this language.

2017

pdf bib abs
Character and Subword-Based Word Representation for Neural Language Modeling Prediction
Matthieu Labeau | Alexandre Allauzen
Proceedings of the First Workshop on Subword and Character Level Models in NLP

Most of neural language models use different kinds of embeddings for word prediction. While word embeddings can be associated to each word in the vocabulary or derived from characters as well as factored morphological decomposition, these word representations are mainly used to parametrize the input, i.e. the context of prediction. This work investigates the effect of using subword units (character and factored morphological decomposition) to build output representations for neural language modeling. We present a case study on Czech, a morphologically-rich language, experimenting with different input and output representations. When working with the full training vocabulary, despite unstable training, our experiments show that augmenting the output word representations with character-based embeddings can significantly improve the performance of the model. Moreover, reducing the size of the output look-up table, to let the character-based embeddings represent rare words, brings further improvement.

pdf
LIMSI@WMT’17
Franck Burlot | Pooyan Safari | Matthieu Labeau | Alexandre Allauzen | François Yvon
Proceedings of the Second Conference on Machine Translation

pdf abs
Représentations continues dérivées des caractères pour un modèle de langue neuronal à vocabulaire ouvert (Opening the vocabulary of neural language models with character-level word representations)
Matthieu Labeau | Alexandre Allauzen
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 - Articles longs

Cet article propose une architecture neuronale pour un modèle de langue à vocabulaire ouvert. Les représentations continues des mots sont calculées à la volée à partir des caractères les composant, gràce à une couche convolutionnelle suivie d’une couche de regroupement (pooling). Cela permet au modèle de représenter n’importe quel mot, qu’il fasse partie du contexte ou soit évalué pour la prédiction. La fonction objectif est dérivée de l’estimation contrastive bruitée (Noise Contrastive Estimation, ou NCE), calculable dans notre cas sans vocabulaire. Nous évaluons la capacité de notre modèle à construire des représentations continues de mots inconnus sur la tâche de traduction automatique IWSLT-2016, de l’Anglais vers le Tchèque, en ré-évaluant les N meilleures hypothèses (N-best reranking). Les résultats expérimentaux permettent des gains jusqu’à 0,7 point BLEU. Ils montrent aussi la difficulté d’utiliser des représentations dérivées des caractères pour la prédiction.

pdf abs
Adaptation au domaine pour l’analyse morpho-syntaxique (Domain Adaptation for PoS tagging)
Éléonor Bartenlian | Margot Lacour | Matthieu Labeau | Alexandre Allauzen | Guillaume Wisniewski | François Yvon
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

Ce travail cherche à comprendre pourquoi les performances d’un analyseur morpho-syntaxiques chutent fortement lorsque celui-ci est utilisé sur des données hors domaine. Nous montrons à l’aide d’une expérience jouet que ce comportement peut être dû à un phénomène de masquage des caractéristiques lexicalisées par les caractéristiques non lexicalisées. Nous proposons plusieurs modèles essayant de réduire cet effet.

pdf abs
An experimental analysis of Noise-Contrastive Estimation: the noise distribution matters
Matthieu Labeau | Alexandre Allauzen
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Noise Contrastive Estimation (NCE) is a learning procedure that is regularly used to train neural language models, since it avoids the computational bottleneck caused by the output softmax. In this paper, we attempt to explain some of the weaknesses of this objective function, and to draw directions for further developments. Experiments on a small task show the issues raised by an unigram noise distribution, and that a context dependent noise distribution, such as the bigram distribution, can solve these issues and provide stable and data-efficient learning.

2016

pdf abs
Une méthode non-supervisée pour la segmentation morphologique et l’apprentissage de morphotactique à l’aide de processus de Pitman-Yor (An unsupervised method for joint morphological segmentation and morphotactics learning using Pitman-Yor processes)
Kevin Löser | Alexandre Allauzen
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

Cet article présente un modèle bayésien non-paramétrique pour la segmentation morphologique non supervisée. Ce modèle semi-markovien s’appuie sur des classes latentes de morphèmes afin de modéliser les caractéristiques morphotactiques du lexique, et son caractère non-paramétrique lui permet de s’adapter aux données sans avoir à spécifier à l’avance l’inventaire des morphèmes ainsi que leurs classes. Un processus de Pitman-Yor est utilisé comme a priori sur les paramètres afin d’éviter une convergence vers des solutions dégénérées et inadaptées au traitemement automatique des langues. Les résultats expérimentaux montrent la pertinence des segmentations obtenues pour le turc et l’anglais. Une étude qualitative montre également que le modèle infère une morphotactique linguistiquement pertinente, sans le recours à des connaissances expertes quant à la structure morphologique des formes de mots.

This paper describes LIMSI’s submission to the MT track of IWSLT 2016. We report results for translation from English into Czech. Our submission is an attempt to address the difficulties of translating into a morphologically rich language by paying special attention to the morphology generation on target side. To this end, we propose two ways of improving the morphological fluency of the output: 1. by performing translation and inflection of the target language in two separate steps, and 2. by using a neural language model with characted-based word representation. We finally present the combination of both methods used for our primary system submission.

2015

pdf abs
Oublier ce qu’on sait, pour mieux apprendre ce qu’on ne sait pas : une étude sur les contraintes de type dans les modèles CRF
Nicolas Pécheux | Alexandre Allauzen | Thomas Lavergne | Guillaume Wisniewski | François Yvon
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Quand on dispose de connaissances a priori sur les sorties possibles d’un problème d’étiquetage, il semble souhaitable d’inclure cette information lors de l’apprentissage pour simplifier la tâche de modélisation et accélérer les traitements. Pourtant, même lorsque ces contraintes sont correctes et utiles au décodage, leur utilisation lors de l’apprentissage peut dégrader sévèrement les performances. Dans cet article, nous étudions ce paradoxe et montrons que le manque de contraste induit par les connaissances entraîne une forme de sous-apprentissage qu’il est cependant possible de limiter.

pdf abs
Apprentissage discriminant des modèles continus de traduction
Quoc-Khanh Do | Alexandre Allauzen | François Yvon
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Alors que les réseaux neuronaux occupent une place de plus en plus importante dans le traitement automatique des langues, les méthodes d’apprentissage actuelles utilisent pour la plupart des critères qui sont décorrélés de l’application. Cet article propose un nouveau cadre d’apprentissage discriminant pour l’estimation des modèles continus de traduction. Ce cadre s’appuie sur la définition d’un critère d’optimisation permettant de prendre en compte d’une part la métrique utilisée pour l’évaluation de la traduction et d’autre part l’intégration de ces modèles au sein des systèmes de traduction automatique. De plus, cette méthode d’apprentissage est comparée aux critères existants d’estimation que sont le maximum de vraisemblance et l’estimation contrastive bruitée. Les expériences menées sur la tâches de traduction des séminaires TED Talks de l’anglais vers le français montrent la pertinence d’un cadre discriminant d’apprentissage, dont les performances restent toutefois très dépendantes du choix d’une stratégie d’initialisation idoine. Nous montrons qu’avec une initialisation judicieuse des gains significatifs en termes de scores BLEU peuvent être obtenus.

pdf
Non-lexical neural architecture for fine-grained POS Tagging
Matthieu Labeau | Kevin Löser | Alexandre Allauzen
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf
A Discriminative Training Procedure for Continuous Translation Models
Quoc-Khanh Do | Alexandre Allauzen | François Yvon
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf
ListNet-based MT Rescoring
Jan Niehues | Quoc Khanh Do | Alexandre Allauzen | Alex Waibel
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality
Alexandre Allauzen | Edward Grefenstette | Karl Moritz Hermann | Hugo Larochelle | Scott Wen-tau Yih
Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality

2014

pdf
Cross-Lingual POS Tagging through Ambiguous Learning: First Experiments (Apprentissage partiellement supervisé d’un étiqueteur morpho-syntaxique par transfert cross-lingue) [in French]
Guillaume Wisniewski | Nicolas Pécheux | Elena Knyazeva | Alexandre Allauzen | François Yvon
Proceedings of TALN 2014 (Volume 1: Long Papers)

pdf
Comparison of scheduling methods for the learning rate of neural network language models (Modèles de langue neuronaux: une comparaison de plusieurs stratégies d’apprentissage) [in French]
Quoc-Khanh Do | Alexandre Allauzen | François Yvon
Proceedings of TALN 2014 (Volume 1: Long Papers)

pdf bib
Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC)
Alexandre Allauzen | Raffaella Bernardi | Edward Grefenstette | Hugo Larochelle | Christopher Manning | Scott Wen-tau Yih
Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC)

This paper documents the systems developed by LIMSI for the IWSLT 2014 speech translation task (English→French). The main objective of this participation was twofold: adapting different components of the ASR baseline system to the peculiarities of TED talks and improving the machine translation quality on the automatic speech recognition output data. For the latter task, various techniques have been considered: punctuation and number normalization, adaptation to ASR errors, as well as the use of structured output layer neural network models for speech data.

pdf abs
Discriminative adaptation of continuous space translation models
Quoc-Khanh Do | Alexandre Allauzen | François Yvon
Proceedings of the 11th International Workshop on Spoken Language Translation: Papers

In this paper we explore various adaptation techniques for continuous space translation models (CSTMs). We consider the following practical situation: given a large scale, state-of-the-art SMT system containing a CSTM, the task is to adapt the CSTM to a new domain using a (relatively) small in-domain parallel corpus. Our method relies on the definition of a new discriminative loss function for the CSTM that borrows from both the max-margin and pair-wise ranking approaches. In our experiments, the baseline out-of-domain SMT system is initially trained for the WMT News translation task, and the CSTM is to be adapted to the lecture translation task as defined by IWSLT evaluation campaign. Experimental results show that an improvement of 1.5 BLEU points can be achieved with the proposed adaptation method.

LIMSI took part in the IWSLT 2011 TED task in the MT track for English to French using the in-house n-code system, which implements the n-gram based approach to Machine Translation. This framework not only allows to achieve state-of-the-art results for this language pair, but is also appealing due to its conceptual simplicity and its use of well understood statistical language models. Using this approach, we compare several ways to adapt our existing systems and resources to the TED task with mixture of language models and try to provide an analysis of the modest gains obtained by training a log linear combination of inand out-of-domain models.

pdf abs
How good are your phrases? Assessing phrase quality with single class classification
Nadi Tomeh | Marco Turchi | Guillaume Wisinewski | Alexandre Allauzen | François Yvon
Proceedings of the 8th International Workshop on Spoken Language Translation: Papers

We present a novel translation quality informed procedure for both extraction and scoring of phrase pairs in PBSMT systems. We reformulate the extraction problem in the supervised learning framework. Our goal is twofold. First, We attempt to take the translation quality into account; and second we incorporating arbitrary features in order to circumvent alignment errors. One-Class SVMs and the Mapping Convergence algorithm permit training a single-class classifier to discriminate between useful and useless phrase pairs. Such classifier can be learned from a training corpus that comprises only useful instances. The confidence score, produced by the classifier for each phrase pairs, is employed as a selection criteria. The smoothness of these scores allow a fine control over the size of the resulting translation model. Finally, confidence scores provide a new accuracy-based feature to score phrase pairs. Experimental evaluation of the method shows accurate assessments of phrase pairs quality even for regions in the space of possible phrase pairs that are ignored by other approaches. This enhanced evaluation of phrase pairs leads to improvements in the translation performance as measured by BLEU.

pdf abs
Estimation d’un modèle de traduction à partir d’alignements mot-à-mot non-déterministes (Estimating a translation model from non-deterministic word-to-word alignments)
Nadi Tomeh | Alexandre Allauzen | François Yvon
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans les systèmes de traduction statistique à base de segments, le modèle de traduction est estimé à partir d’alignements mot-à-mot grâce à des heuristiques d’extraction et de valuation. Bien que ces alignements mot-à-mot soient construits par des modèles probabilistes, les processus d’extraction et de valuation utilisent ces modèles en faisant l’hypothèse que ces alignements sont déterministes. Dans cet article, nous proposons de lever cette hypothèse en considérant l’ensemble de la matrice d’alignement, d’une paire de phrases, chaque association étant valuée par sa probabilité. En comparaison avec les travaux antérieurs, nous montrons qu’en utilisant un modèle exponentiel pour estimer de manière discriminante ces probabilités, il est possible d’obtenir des améliorations significatives des performances de traduction. Ces améliorations sont mesurées à l’aide de la métrique BLEU sur la tâche de traduction de l’arabe vers l’anglais de l’évaluation NIST MT’09, en considérant deux types de conditions selon la taille du corpus de données parallèles utilisées.

pdf
From n-gram-based to CRF-based Translation Models
Thomas Lavergne | Alexandre Allauzen | Josep Maria Crego | François Yvon
Proceedings of the Sixth Workshop on Statistical Machine Translation

2010

pdf
LIMSI’s Statistical Translation Systems for WMT’10
Alexandre Allauzen | Josep M. Crego | İlknur Durgar El-Kahlout | François Yvon
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf
Training Continuous Space Language Models: Some Practical Issues
Hai Son Le | Alexandre Allauzen | Guillaume Wisniewski | François Yvon
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf
Assessing Phrase-Based Translation Models with Oracle Decoding
Guillaume Wisniewski | Alexandre Allauzen | François Yvon
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf abs
Refining Word Alignment with Discriminative Training
Nadi Tomeh | Alexandre Allauzen | François Yvon | Guillaume Wisniewski
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

The quality of statistical machine translation systems depends on the quality of the word alignments that are computed during the translation model training phase. IBM alignment models, as implemented in the GIZA++ toolkit, constitute the de facto standard for performing these computations. The resulting alignments and translation models are however very noisy, and several authors have tried to improve them. In this work, we propose a simple and effective approach, which considers alignment as a series of independent binary classification problems in the alignment matrix. Through extensive feature engineering and the use of stacking techniques, we were able to obtain alignments much closer to manually defined references than those obtained by the IBM models. These alignments also yield better translation models, delivering improved performance in a large scale Arabic to English translation task.

This paper describes LIMSI’s Statistical Machine Translation systems (SMT) for the IWSLT evaluation, where we participated in two tasks (Talk for English to French and BTEC for Turkish to English). For the Talk task, we studied an extension of our in-house n-code SMT system (the integration of a bilingual reordering model over generalized translation units), as well as the use of training data extracted from Wikipedia in order to adapt the target language model. For the BTEC task, we concentrated on pre-processing schemes on the Turkish side in order to reduce the morphological discrepancies with the English side. We also evaluated the use of two different continuous space language models for such a small size of training data.

2009

pdf
LIMSI‘s Statistical Translation Systems for WMT‘09
Alexandre Allauzen | Josep Crego | Aurélien Max | François Yvon
Proceedings of the Fourth Workshop on Statistical Machine Translation

2008

pdf abs
Training and Evaluation of POS Taggers on the French MULTITAG Corpus
Alexandre Allauzen | Hélène Bonneau-Maynard
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The explicit introduction of morphosyntactic information into statistical machine translation approaches is receiving an important focus of attention. The current freely available Part of Speech (POS) taggers for the French language are based on a limited tagset which does not account for some flectional particularities. Moreover, there is a lack of a unified framework of training and evaluation for these kinds of linguistic resources. Therefore in this paper, three standard POS taggers (Treetagger, Brills tagger and the standard HMM POS tagger) are trained and evaluated in the same conditions on the French MULTITAG corpus. This POS-tagged corpus provides a tagset richer than the usual ones, including gender and number distinctions, for example. Experimental results show significant differences of performance between the taggers. According to the tagging accuracy estimated with a tagset of 300 items, taggers may be ranked as follows: Treetagger (95.7%), Brills tagger (94.6%), HMM tagger (93.4%). Examples of translation outputs illustrate how considering gender and number distinctions in the POS tagset can be relevant.

2007

pdf
A state-of-the-art statistical machine translation system based on Moses
Daniel Déchelotte | Holger Schwenk | Hélène Bonneau-Maynard | Alexandre Allauzen | Gilles Adda
Proceedings of Machine Translation Summit XI: Papers

pdf abs
Modèles statistiques enrichis par la syntaxe pour la traduction automatique
Holger Schwenk | Daniel Déchelotte | Hélène Bonneau-Maynard | Alexandre Allauzen
Actes de la 14ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

La traduction automatique statistique par séquences de mots est une voie prometteuse. Nous présentons dans cet article deux évolutions complémentaires. La première permet une modélisation de la langue cible dans un espace continu. La seconde intègre des catégories morpho-syntaxiques aux unités manipulées par le modèle de traduction. Ces deux approches sont évaluées sur la tâche Tc-Star. Les résultats les plus intéressants sont obtenus par la combinaison de ces deux méthodes.

pdf
Combining Morphosyntactic Enriched Representation with n-best Reranking in Statistical Translation
Hélène Bonneau-Maynard | Alexandre Allauzen | Daniel Déchelotte | Holger Schwenk
Proceedings of SSST, NAACL-HLT 2007 / AMTA Workshop on Syntax and Structure in Statistical Translation