Marc Dymetman

2023

pdf bib abs
Should you marginalize over possible tokenizations?
Nadezhda Chirkova | Germán Kruszewski | Jos Rozen | Marc Dymetman
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Autoregressive language models (LMs) map token sequences to probabilities. The usual practice for computing the probability of any character string (e.g. English sentences) is to first transform it into a sequence of tokens that is scored by the model. However, there are exponentially many token sequences that represent any given string. To truly compute the probability of a string one should marginalize over all tokenizations, which is typically intractable. Here, we analyze whether the practice of ignoring the marginalization is justified. To this end, we devise an importance-sampling-based algorithm that allows us to compute estimates of the marginal probabilities and compare them to the default procedure in a range of state-of-the-art models and datasets. Our results show that the gap in log-likelihood is no larger than 0.5% in most cases, but that it becomes more pronounced for data with long complex words.

pdf abs
disco: a toolkit for Distributional Control of Generative Models
Germán Kruszewski | Jos Rozen | Marc Dymetman
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

Pre-trained language models and other generative models have revolutionized NLP and beyond. However, these models tend to reproduce undesirable biases present in their training data. Also, they may overlook patterns that are important but challenging to capture. To address these limitations, researchers have introduced distributional control techniques. These techniques, not limited to language, allow controlling the prevalence (i.e. expectations) of any features of interest in the model’s outputs. Despite their potential, the widespread adoption of these techniques has been hindered by the difficulty in adapting the complex, disconnected code. Here, we present disco, an open-source Python library that brings these techniques to the broader public

2019

pdf abs
Machine Translation of Restaurant Reviews: New Corpus for Domain Adaptation and Robustness
Alexandre Berard | Ioan Calapodescu | Marc Dymetman | Claude Roux | Jean-Luc Meunier | Vassilina Nikoulina
Proceedings of the 3rd Workshop on Neural Generation and Translation

We share a French-English parallel corpus of Foursquare restaurant reviews, and define a new task to encourage research on Neural Machine Translation robustness and domain adaptation, in a real-world scenario where better-quality MT would be greatly beneficial. We discuss the challenges of such user-generated content, and train good baseline models that build upon the latest techniques for MT robustness. We also perform an extensive evaluation (automatic and human) that shows significant improvements over existing online systems. Finally, we propose task-specific metrics based on sentiment analysis or translation accuracy of domain-specific polysemous words.

pdf abs
Global Autoregressive Models for Data-Efficient Sequence Learning
Tetiana Parshakova | Jean-Marc Andreoli | Marc Dymetman
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Standard autoregressive seq2seq models are easily trained by max-likelihood, but tend to show poor results under small-data conditions. We introduce a class of seq2seq models, GAMs (Global Autoregressive Models), which combine an autoregressive component with a log-linear component, allowing the use of global a priori features to compensate for lack of data. We train these models in two steps. In the first step, we obtain an unnormalized GAM that maximizes the likelihood of the data, but is improper for fast inference or evaluation. In the second step, we use this GAM to train (by distillation) a second autoregressive model that approximates the normalized distribution associated with the GAM, and can be used for fast inference and evaluation. Our experiments focus on language modelling under synthetic conditions and show a strong perplexity reduction of using the second autoregressive model over the standard one.

2018

pdf abs
Char2char Generation with Reranking for the E2E NLG Challenge
Shubham Agarwal | Marc Dymetman | Éric Gaussier
Proceedings of the 11th International Conference on Natural Language Generation

This paper describes our submission to the E2E NLG Challenge. Recently, neural seq2seq approaches have become mainstream in NLG, often resorting to pre- (respectively post-) processing delexicalization (relexicalization) steps at the word-level to handle rare words. By contrast, we train a simple character level seq2seq model, which requires no pre/post-processing (delexicalization, tokenization or even lowercasing), with surprisingly good results. For further improvement, we explore two re-ranking approaches for scoring candidates. We also introduce a synthetic dataset creation procedure, which opens up a new way of creating artificial datasets for Natural Language Generation.

2017

pdf abs
A surprisingly effective out-of-the-box char2char model on the E2E NLG Challenge dataset
Shubham Agarwal | Marc Dymetman
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue

We train a char2char model on the E2E NLG Challenge data, by exploiting “out-of-the-box” the recently released tfseq2seq framework, using some of the standard options offered by this tool. With minimal effort, and in particular without delexicalization, tokenization or lowercasing, the obtained raw predictions, according to a small scale human evaluation, are excellent on the linguistic side and quite reasonable on the adequacy side, the primary downside being the possible omissions of semantic material. However, in a significant number of cases (more than 70%), a perfect solution can be found in the top-20 predictions, indicating promising directions for solving the remaining issues.

2016

pdf
Orthogonality regularizer for question answering
Chunyang Xiao | Guillaume Bouchard | Marc Dymetman | Claire Gardent
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

pdf
LSTM-Based Mixture-of-Experts for Knowledge-Aware Dialogues
Phong Le | Marc Dymetman | Jean-Michel Renders
Proceedings of the 1st Workshop on Representation Learning for NLP

pdf abs
Natural Language Generation through Character-based RNNs with Finite-state Prior Knowledge
Raghav Goyal | Marc Dymetman | Eric Gaussier
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Recently Wen et al. (2015) have proposed a Recurrent Neural Network (RNN) approach to the generation of utterances from dialog acts, and shown that although their model requires less effort to develop than a rule-based system, it is able to improve certain aspects of the utterances, in particular their naturalness. However their system employs generation at the word-level, which requires one to pre-process the data by substituting named entities with placeholders. This pre-processing prevents the model from handling some contextual effects and from managing multiple occurrences of the same attribute. Our approach uses a character-level model, which unlike the word-level model makes it possible to learn to “copy” information from the dialog act to the target without having to pre-process the input. In order to avoid generating non-words and inventing information not present in the input, we propose a method for incorporating prior knowledge into the RNN in the form of a weighted finite-state automaton over character sequences. Automatic and human evaluations show improved performance over baselines on several evaluation criteria.

pdf
Sequence-based Structured Prediction for Semantic Parsing
Chunyang Xiao | Marc Dymetman | Claire Gardent
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

pdf
Reversibility reconsidered: finite-state factors for efficient probabilistic sampling in parsing and generation
Marc Dymetman | Sriram Venkatapathy | Chunyang Xiao
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf abs
Adaptation par enrichissement terminologique en traduction automatique statistique fondée sur la génération et le filtrage de bi-segments virtuels
Christophe Servan | Marc Dymetman
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Nous présentons des travaux préliminaires sur une approche permettant d’ajouter des termes bilingues à un système de Traduction Automatique Statistique (TAS) à base de segments. Les termes sont non seulement inclus individuellement, mais aussi avec des contextes les englobant. Tout d’abord nous générons ces contextes en généralisant des motifs (ou patrons) observés pour des mots de même nature syntaxique dans un corpus bilingue. Enfin, nous filtrons les contextes qui n’atteignent pas un certain seuil de confiance, à l’aide d’une méthode de sélection de bi-segments inspirée d’une approche de sélection de données, précédemment appliquée à des textes bilingues alignés.

2014

For the task of online translation of scientific video lectures, using huge models is not possible. In order to get smaller and efficient models, we perform data selection. In this paper, we perform a qualitative and quantitative comparison of several data selection techniques, based on cross-entropy and infrequent n-gram criteria. In terms of BLEU, a combination of translation and language model cross-entropy achieves the most stable results. As another important criterion for measuring translation quality in our application, we identify the number of out-of-vocabulary words. Here, infrequent n-gram recovery shows superior performance. Finally, we combine the two selection techniques in order to benefit from both their strengths.

pdf
A Lightweight Terminology Verification Service for External Machine Translation Engines
Alessio Bosca | Vassilina Nikoulina | Marc Dymetman
Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf
Exact Decoding for Phrase-Based Statistical Machine Translation
Wilker Aziz | Marc Dymetman | Lucia Specia
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We present a conditional-random-field approach to discriminatively-trained phrase-based machine translation in which training and decoding are both cast in a sampling framework and are implemented uniformly in a new probabilistic programming language for factor graphs. In traditional phrase-based translation, decoding infers both a "Viterbi" alignment and the target sentence. In contrast, in our approach, a rich overlapping-phrase alignment is produced by a fast deterministic method, while probabilistic decoding infers only the target sentence, which is then able to leverage arbitrary features of the entire source sentence, target sentence and alignment. By using SampleRank for learning we could in principle efficiently estimate hundreds of thousands of parameters. Test-time decoding is done by MCMC sampling with annealing. To demonstrate the potential of our approach we show preliminary experiments leveraging alignments that may contain overlapping bi-phrases.

pdf bib
Intersecting Hierarchical and Phrase-Based Models of Translation: Formal Aspects and Algorithms
Marc Dymetman | Nicola Cancedda
Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation

pdf
Learning an Expert from Human Annotations in Statistical Machine Translation: the Case of Out-of-Vocabulary Words
Wilker Aziz | Marc Dymetman | Lucia Specia | Shachar Mirkin
Proceedings of the 14th Annual Conference of the European Association for Machine Translation

pdf abs
A Dataset for Assessing Machine Translation Evaluation Metrics
Lucia Specia | Nicola Cancedda | Marc Dymetman
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We describe a dataset containing 16,000 translations produced by four machine translation systems and manually annotated for quality by professional translators. This dataset can be used in a range of tasks assessing machine translation evaluation metrics, from basic correlation analysis to training and test of machine learning-based metrics. By providing a standard dataset for such tasks, we hope to encourage the development of better MT evaluation metrics.

2009

pdf
Complexity-Based Phrase-Table Filtering for Statistical Machine Translation
Nadi Tomeh | Nicola Cancedda | Marc Dymetman
Proceedings of Machine Translation Summit XII: Papers

pdf
Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem
Mikhail Zaslavskiy | Marc Dymetman | Nicola Cancedda
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf
Source-Language Entailment Modeling for Translating Unknown Terms
Shachar Mirkin | Lucia Specia | Nicola Cancedda | Ido Dagan | Marc Dymetman | Idan Szpektor
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf
Estimating the Sentence-Level Quality of Machine Translation Systems
Lucia Specia | Marco Turchi | Nicola Cancedda | Nello Cristianini | Marc Dymetman
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

2008

pdf
Using Syntactic Coupling Features for Discriminating Phrase-Based Translations (WMT-08 Shared Translation Task)
Vassilina Nikoulina | Marc Dymetman
Proceedings of the Third Workshop on Statistical Machine Translation

pdf
Experiments in Discriminating Phrase-Based Translations on the Basis of Syntactic Coupling Features
Vassilina Nikoulina | Marc Dymetman
Proceedings of the ACL-08: HLT Second Workshop on Syntax and Structure in Statistical Translation (SSST-2)

2005

Cet article présente une méthode de traduction automatique statistique basée sur des segments non-continus, c’est-à-dire des segments formés de mots qui ne se présentent pas nécéssairement de façon contiguë dans le texte. On propose une méthode pour produire de tels segments à partir de corpus alignés au niveau des mots. On présente également un modèle de traduction statistique capable de tenir compte de tels segments, de même qu’une méthode d’apprentissage des paramètres du modèle visant à maximiser l’exactitude des traductions produites, telle que mesurée avec la métrique NIST. Les traductions optimales sont produites par le biais d’une recherche en faisceau. On présente finalement des résultats expérimentaux, qui démontrent comment la méthode proposée permet une meilleure généralisation à partir des données d’entraînement.

2003

pdf abs
MDA-XML : une expérience de rédaction contrôlée multilingue basée sur XML
Guy Lapalme | Caroline Brun | Marc Dymetman
Actes de la 10ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

Nous décrivons dans cet article l’implantation d’un système de rédaction contrôlée multilingue dans un environnement XML. Avec ce système, un auteur rédige interactivement un texte se conformant à des règles de bonne formation aux niveaux du contenu sémantique et de la réalisation linguistique décrites par un schéma XML. Nous discutons les avantages de cette approche ainsi que les difficultés rencontrées lors du développement de ce système. Nous concluons avec un exemple d’application à une classe de documents pharmaceutiques.

pdf
Controlled Authoring of Biological Experiment Reports
Caroline Brun | Marc Dymetman | Eric Fanchon | Stanislas Lhomme
Demonstrations

pdf
Towards Interactive Text Understanding
Marc Dymetman | Aurélien Max | Kenji Yamada
The Companion Volume to the Proceedings of 41st Annual Meeting of the Association for Computational Linguistics