Yves Lepage

2023

pdf abs
Example-Based Machine Translation with a Multi-Sentence Construction Transformer Architecture
Haozhe Xiao | Yifei Zhou | Yves Lepage
Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD)

Neural Machine Translation (NMT) has now attained state-of-art performance on large-scale data. However, it does not achieve the best translation results on small data sets. Example-Based Machine Translation (EBMT) is an approach to machine translation in which existing examples in a database are retrieved and modified to generate new translations. To combine EBMT with NMT, an architecture based on the Transformer model is proposed. We conduct two experiments respectively using limited amounts of data, one on an English-French bilingual dataset and the other one on a multilingual dataset with six languages (English, French, German, Chinese, Japanese and Russian). On the bilingual task, our method achieves an accuracy of 96.5 and a BLEU score of 98.8. On the multilingual task, it also outperforms OpenNMT in terms of BLEU scores.

pdf abs
A Dual Reinforcement Method for Data Augmentation using Middle Sentences for Machine Translation
Wenyi Tang | Yves Lepage
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

This paper presents an approach to enhance the quality of machine translation by leveraging middle sentences as pivot points and employing dual reinforcement learning. Conventional methods for generating parallel sentence pairs for machine translation rely on parallel corpora, which may be scarce, resulting in limitations in translation quality. In contrast, our proposed method entails training two machine translation models in opposite directions, utilizing the middle sentence as a bridge for a virtuous feedback loop between the two models. This feedback loop resembles reinforcement learning, facilitating the models to make informed decisions based on mutual feedback. Experimental results substantiate that our proposed method significantly improves machine translation quality.

2022

pdf bib abs
Introducing EM-FT for Manipuri-English Neural Machine Translation
Rudali Huidrom | Yves Lepage
Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference

This paper introduces a pretrained word embedding for Manipuri, a low-resourced Indian language. The pretrained word embedding based on FastText is capable of handling the highly agglutinating language Manipuri (mni). We then perform machine translation (MT) experiments using neural network (NN) models. In this paper, we confirm the following observations. Firstly, the reported BLEU score of the Transformer architecture with FastText word embedding model EM-FT performs better than without in all the NMT experiments. Secondly, we observe that adding more training data from a different domain of the test data negatively impacts translation accuracy. The resources reported in this paper are made available in the ELRA catalogue to help the low-resourced languages community with MT/NLP tasks.

pdf
Can the Translation Memory Principle Benefit Neural Machine Translation? A Series of Extensive Experiments with Input Sentence Annotation
Yaling Wang | Yves Lepage
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

pdf
A Study of Re-generating Sentences Given Similar Sentences that Cover Them on the Level of Form and Meaning
Hsuan-Wei Lo | Yifei Zhou | Rashel Fam | Yves Lepage
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

pdf abs
Langues par défaut? Analyse contrastive et diachronique des langues non citées dans les articles de TALN et d’ACL (Contrastive and diachronic study of unmentioned (by default ?) languages in TALN and ACL We study the application of the #BenderRule in natural language processing articles, taking into account a contrastive and a diachronic dimensions, by examining the proceedings of two NLP conferences, TALN and ACL, over time)
Fanny Ducel | Karën Fort | Gaël Lejeune | Yves Lepage
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Cet article étudie l’application de la #RègledeBender dans des articles de traitement automatique des langues (TAL), en prenant en compte une dimension contrastive, par l’examen des actes de deux conférences du domaine, TALN et ACL, et une dimension diachronique, en examinant ces conférences au fil du temps. Un échantillon d’articles a été annoté manuellement et deux classifieurs ont été développés afin d’annoter automatiquement les autres articles. Nous quantifions ainsi l’application de la #RègledeBender, et mettons en évidence un léger mieux en faveur de TALN sur cet aspect.

pdf abs
Do we Name the Languages we Study? The #BenderRule in LREC and ACL articles
Fanny Ducel | Karën Fort | Gaël Lejeune | Yves Lepage
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This article studies the application of the #BenderRule in Natural Language Processing (NLP) articles according to two dimensions. Firstly, in a contrastive manner, by considering two major international conferences, LREC and ACL, and secondly, in a diachronic manner, by inspecting nearly 14,000 articles over a period of time ranging from 2000 to 2020 for LREC and from 1979 to 2020 for ACL. For this purpose, we created a corpus from LREC and ACL articles from the above-mentioned periods, from which we manually annotated nearly 1,000. We then developed two classifiers to automatically annotate the rest of the corpus. Our results show that LREC articles tend to respect the #BenderRule (80 to 90% of them respect it), whereas 30 to 40% of ACL articles do not. Interestingly, over the considered periods, the results appear to be stable for the two conferences, even though a rebound in ACL 2020 could be a sign of the influence of the blog post about the #BenderRule.

2021

pdf
Covering a sentence in form and meaning with fewer retrieved sentences
Yuan Liu | Yves Lepage
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

pdf abs
EM Corpus: a comparable corpus for a less-resourced language pair Manipuri-English
Rudali Huidrom | Yves Lepage | Khogendra Khomdram
Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)

In this paper, we introduce a sentence-level comparable text corpus crawled and created for the less-resourced language pair, Manipuri(mni) and English (eng). Our monolingual corpora comprise 1.88 million Manipuri sentences and 1.45 million English sentences, and our parallel corpus comprises 124,975 Manipuri-English sentence pairs. These data were crawled and collected over a year from August 2020 to March 2021 from a local newspaper website called ‘The Sangai Express.’ The resources reported in this paper are made available to help the low-resourced languages community for MT/NLP tasks.

2020

pdf abs
Video-to-HamNoSys Automated Annotation System
Victor Skobov | Yves Lepage
Proceedings of the LREC2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives

The Hamburg Notation System (HamNoSys) was developed for movement annotation of any sign language (SL) and can be used to produce signing animations for a virtual avatar with the JASigning platform. This provides the potential to use HamNoSys, i.e., strings of characters, as a representation of an SL corpus instead of video material. Processing strings of characters instead of images can significantly contribute to sign language research. However, the complexity of HamNoSys makes it difficult to annotate without a lot of time and effort. Therefore annotation has to be automatized. This work proposes a conceptually new approach to this problem. It includes a new tree representation of the HamNoSys grammar that serves as a basis for the generation of grammatical training data and classification of complex movements using machine learning. Our automatic annotation system relies on HamNoSys grammar structure and can potentially be used on already existing SL corpora. It is retrainable for specific settings such as camera angles, speed, and gestures. Our approach is conceptually different from other SL recognition solutions and offers a developed methodology for future research.

pdf abs
Zero-shot translation among Indian languages
Rudali Huidrom | Yves Lepage
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

Standard neural machine translation (NMT) allows a model to perform translation between a pair of languages. Multilingual neural machine translation (NMT), on the other hand, allows a model to perform translation between several language pairs, even between language pairs for which no sentences pair has been seen during training (zero-shot translation). This paper presents experiments with zero-shot translation on low resource Indian languages with a very small amount of data for each language pair. We first report results on balanced data over all considered language pairs. We then expand our experiments for additional three rounds by increasing the training data with 2,000 sentence pairs in each round for some of the language pairs. We obtain an increase in translation accuracy with its balanced data settings score multiplied by 7 for Manipuri to Hindi during Round-III of zero-shot translation.

pdf abs
Réseaux de neurones pour la résolution d’analogies entre phrases en traduction automatique par l’exemple (Neural networks for the resolution of analogies between sentences in EBMT )
Valentin Taillandier | Liyan Wang | Yves Lepage
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

Cet article propose un modèle de réseau de neurones pour la résolution d’équations analogiques au niveau sémantique et entre phrases dans le cadre de la traduction automatique par l’exemple. Son originalité réside dans le fait qu’il fusionne les deux approches, directe et indirecte, de la traduction par l’exemple.

2018

pdf
Korean L2 Vocabulary Prediction: Can a Large Annotated Corpus be Used to Train Better Models for Predicting Unknown Words?
Kevin Yancey | Yves Lepage
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
Tools for The Production of Analogical Grids and a Resource of N-gram Analogical Grids in 11 Languages
Rashel Fam | Yves Lepage
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
Context Encoder for Analogies on Strings
Tianjing Zhao | Yves Lepage
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf
IPS-WASEDA system at CoNLL–SIGMORPHON 2018 Shared Task on morphological inflection
Rashel Fam | Yves Lepage
Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

2017

pdf
Unsupervised Bilingual Segmentation using MDL for Machine Translation
Bin Shan | Hao Wang | Yves Lepage
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation

pdf
BTG-based Machine Translation with Simple Reordering Model using Structured Perceptron
Hao Wang | Yves Lepage
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation

pdf abs
CHARCUT: Human-Targeted Character-Based MT Evaluation with Loose Differences
Adrien Lardilleux | Yves Lepage
Proceedings of the 14th International Conference on Spoken Language Translation

We present CHARCUT, a character-based machine translation evaluation metric derived from a human-targeted segment difference visualisation algorithm. It combines an iterative search for longest common substrings between the candidate and the reference translation with a simple length-based threshold, enabling loose differences that limit noisy character matches. Its main advantage is to produce scores that directly reflect human-readable string differences, making it a useful support tool for the manual analysis of MT output and its display to end users. Experiments on WMT16 metrics task data show that it is on par with the best “un-trained” metrics in terms of correlation with human judgement, well above BLEU and TER baselines, on both system and segment tasks.

2016

pdf bib abs
Combining fast_align with Hierarchical Sub-sentential Alignment for Better Word Alignments
Hao Wang | Yves Lepage
Proceedings of the Sixth Workshop on Hybrid Approaches to Translation (HyTra6)

fast align is a simple and fast word alignment tool which is widely used in state-of-the-art machine translation systems. It yields comparable results in the end-to-end translation experiments of various language pairs. However, fast align does not perform as well as GIZA++ when applied to language pairs with distinct word orders, like English and Japanese. In this paper, given the lexical translation table output by fast align, we propose to realign words using the hierarchical sub-sentential alignment approach. Experimental results show that simple additional processing improves the performance of word alignment, which is measured by counting alignment matches in comparison with fast align. We also report the result of final machine translation in both English-Japanese and Japanese-English. We show our best system provided significant improvements over the baseline as measured by BLEU and RIBES.

pdf abs
Improving Patent Translation using Bilingual Term Extraction and Re-tokenization for Chinese–Japanese
Wei Yang | Yves Lepage
Proceedings of the 3rd Workshop on Asian Translation (WAT2016)

Unlike European languages, many Asian languages like Chinese and Japanese do not have typographic boundaries in written system. Word segmentation (tokenization) that break sentences down into individual words (tokens) is normally treated as the first step for machine translation (MT). For Chinese and Japanese, different rules and segmentation tools lead different segmentation results in different level of granularity between Chinese and Japanese. To improve the translation accuracy, we adjust and balance the granularity of segmentation results around terms for Chinese–Japanese patent corpus for training translation model. In this paper, we describe a statistical machine translation (SMT) system which is built on re-tokenized Chinese-Japanese patent training corpus using extracted bilingual multi-word terms.

pdf
Extraction of Bilingual Technical Terms for Chinese-Japanese Patent Translation
Wei Yang | Jinghui Yan | Yves Lepage
Proceedings of the NAACL Student Research Workshop

pdf
HSSA tree structures for BTG-based preordering in machine translation
Yujia Zhang | Hao Wang | Yves Lepage
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers

pdf
Yet Another Symmetrical and Real-time Word Alignment Method: Hierarchical Sub-sentential Alignment using F-measure
Hao Wang | Yves Lepage
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers

2015

pdf
Translation of Unseen Bigrams by Analogy Using an SVM Classifier
Hao Wang | Lu Lyu | Yves Lepage
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

pdf
Chinese Word Segmentation based on analogy and majority voting
Zongrong Zheng | Yi Wang | Yves Lepage
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation: Posters

pdf
Sampling-based Alignment and Hierarchical Sub-sentential Alignment in Chinese–Japanese Translation of Patents
Wei Yang | Zhongwen Zhao | Baosong Yang | Yves Lepage
Proceedings of the 2nd Workshop on Asian Translation (WAT2015)

2014

pdf
Measuring Similarity from Word Pair Matrices with Syntagmatic and Paradigmatic Associations
Jin Matsuoka | Yves Lepage
Proceedings of the 4th Workshop on Cognitive Aspects of the Lexicon (CogALex)

pdf
Consistent Improvement in Translation Quality of Chinese-Japanese Technical Texts by Adding Additional Quasi-parallel Training Data
Wei Yang | Yves Lepage
Proceedings of the 1st Workshop on Asian Translation (WAT2014)

pdf
Testing Distributional Hypothesis in Patent Translation
Hsin-Hung Lin | Yves Lepage
Proceedings of the 26th Conference on Computational Linguistics and Speech Processing (ROCLING 2014)

pdf abs
Production of Phrase Tables in 11 European Languages using an Improved Sub-sentential Aligner
Juan Luo | Yves Lepage
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper is a partial report of an on-going Kakenhi project which aims to improve sub-sentential alignment and release multilingual syntactic patterns for statistical and example-based machine translation. Here we focus on improving a sub-sentential aligner which is an instance of the association approach. Phrase table is not only an essential component in the machine translation systems but also an important resource for research and usage in other domains. As part of this project, all phrase tables produced in the experiments will also be made freely available.

Nous montrons dans une série d’expériences sur quatre langues, sur des échantillons du corpus Europarl, que, dans leur grande majorité, les trigrammes inconnus d’un jeu de test peuvent être reconstruits par analogie avec des trigrammes hapax du corpus d’entraînement. De ce résultat, nous dérivons une méthode de lissage simple pour les modèles de langue par trigrammes et obtenons de meilleurs résultats que les lissages de Witten-Bell, Good-Turing et Kneser-Ney dans des expériences menées en onze langues sur la partie commune d’Europarl, sauf pour le finnois et, dans une moindre mesure, le français.

pdf abs
Généralisation de l’alignement sous-phrastique par échantillonnage (Generalization of sub-sentential alignment by sampling)
Adrien Lardilleux | François Yvon | Yves Lepage
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

L’alignement sous-phrastique consiste à extraire des traductions d’unités textuelles de grain inférieur à la phrase à partir de textes multilingues parallèles alignés au niveau de la phrase. Un tel alignement est nécessaire, par exemple, pour entraîner des systèmes de traduction statistique. L’approche standard pour réaliser cette tâche implique l’estimation successive de plusieurs modèles probabilistes de complexité croissante et l’utilisation d’heuristiques qui permettent d’aligner des mots isolés, puis, par extension, des groupes de mots. Dans cet article, nous considérons une approche alternative, initialement proposée dans (Lardilleux & Lepage, 2008), qui repose sur un principe beaucoup plus simple, à savoir la comparaison des profils d’occurrences dans des souscorpus obtenus par échantillonnage. Après avoir analysé les forces et faiblesses de cette approche, nous montrons comment améliorer la détection d’unités de traduction longues, et évaluons ces améliorations sur des tâches de traduction automatique.

pdf abs
Évaluation de G-LexAr pour la traduction automatique statistique (Evaluation of G-Lexar for statistical machine translation)
Wigdan Mekki | Julien Gosme | Fathi Debili | Yves Lepage | Nadine Lucas
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

G-LexAr est un analyseur morphologique de l’arabe qui a récemment reçu des améliorations substantielles. Cet article propose une évaluation de cet analyseur en tant qu’outil de pré-traitement pour la traduction automatique statistique, ce dont il n’a encore jamais fait l’objet. Nous étudions l’impact des différentes formes proposées par son analyse (voyellation, lemmatisation et segmentation) sur un système de traduction arabe-anglais, ainsi que l’impact de la combinaison de ces formes. Nos expériences montrent que l’utilisation séparée de chacune de ces formes n’a que peu d’influence sur la qualité des traductions obtenues, tandis que leur combinaison y contribue de façon très bénéfique.

bib
Traitement Automatique des Langues, Volume 52, Numéro 1 : Varia [Varia]
Éric Villemonte de La Clergerie | Béatrice Daille | Yves Lepage | François Yvon
Traitement Automatique des Langues, Volume 52, Numéro 1 : Varia [Varia]

pdf
Marker-based Chunking for Analogy-based Translation of Chunks
Kota Takeya | Yves Lepage
Proceedings of Machine Translation Summit XIII: Papers

pdf
Improving Sampling-based Alignment by Investigating the Distribution of N-grams in Phrase Translation Tables
Juan Luo | Adrien Lardilleux | Yves Lepage
Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation

pdf
Fully-Automatic Marker-based Chunking in 11 European Languages and Counts of the Number of Analogies between Chunks
Kota Takeya | Yves Lepage
Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation

2010

pdf abs
The GREYC/LLACAN machine translation systems for the IWSLT 2010 campaign
Julien Gosme | Wigdan Mekki | Fathi Debili | Yves Lepage | Nadine Lucas
Proceedings of the 7th International Workshop on Spoken Language Translation: Evaluation Campaign

In this paper we explore the contribution of the use of two Arabic morphological analyzers as preprocessing tools for statistical machine translation. Similar investigations have already been reported for morphologically rich languages like German, Turkish and Arabic. Here, we focus on the case of the Arabic language and mainly discuss the use of the G-LexAr analyzer. A preliminary experiment has been designed to choose the most promising translation system among the 3 G-LexAr-based systems, we concluded that the systems are equivalent. Nevertheless, we decided to use the lemmatized output of G-LexAr and use its translations as primary run for the BTEC AE track. The results showed that G-LexAr outputs degrades translation compared to the basic SMT system trained on the un-analyzed corpus.

pdf
The True Score of Statistical Paraphrase Generation
Jonathan Chevelu | Ghislain Putois | Yves Lepage
Coling 2010: Posters

pdf abs
L’évaluation des paraphrases : pour une prise en compte de la tâche
Jonathan Chevelu | Yves Lepage | Thierry Moudenc | Ghislain Putois
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Les définitions des paraphrases privilégient généralement la conservation du sens. Cet article démontre par l’absurde qu’une évaluation uniquement basée sur la conservation du sens permet à un système inutile de production de paraphrase d’être jugé meilleur qu’un système au niveau de l’état de l’art. La conservation du sens n’est donc pas l’unique critère des paraphrases. Nous exhibons les trois objectifs des paraphrases : la conservation du sens, la naturalité et l’adaptation à la tâche. La production de paraphrase est alors un compromis dépendant de la tâche entre ces trois critères et ceux-ci doivent être pris en compte lors des évaluations.

pdf abs
Bilingual Lexicon Induction: Effortless Evaluation of Word Alignment Tools and Production of Resources for Improbable Language Pairs
Adrien Lardilleux | Julien Gosme | Yves Lepage
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we present a simple protocol to evaluate word aligners on bilingual lexicon induction tasks from parallel corpora. Rather than resorting to gold standards, it relies on a comparison of the outputs of word aligners against a reference bilingual lexicon. The quality of this reference bilingual lexicon does not need to be particularly high, because evaluation quality is ensured by systematically filtering this reference lexicon with the parallel corpus the word aligners are trained on. We perform a comparison of three freely available word aligners on numerous language pairs from the Bible parallel corpus (Resnik et al., 1999): MGIZA++ (Gao and Vogel, 2008), BerkeleyAligner (Liang et al., 2006), and Anymalign (Lardilleux and Lepage, 2009). We then select the most appropriate one to produce bilingual lexicons for all language pairs of this corpus. These involve Cebuano, Chinese, Danish, English, Finnish, French, Greek, Indonesian, Latin, Spanish, Swedish, and Vietnamese. The 66 resulting lexicons are made freely available.

2009

pdf abs
anymalign : un outil d’alignement sous-phrastique libre pour les êtres humains
Adrien Lardilleux | Yves Lepage
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

Nous présentons anymalign, un aligneur sous-phrastique grand public. Ses résultats ont une qualité qui rivalise avec le meilleur outil du domaine, GIZA++. Il est rapide et simple d’utilisation, et permet de produire dictionnaires et autres tables de traduction en une seule commande. À notre connaissance, c’est le seul outil au monde permettant d’aligner un nombre quelconque de langues simultanément. Il s’agit donc du premier aligneur sousphrastique réellement multilingue.

pdf
Towards automatic acquisition of linguistic features
Yves Lepage | Chooi Ling Goh
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

pdf
Introduction of a new paraphrase generation tool based on Monte-Carlo sampling
Jonathan Chevelu | Thomas Lavergne | Yves Lepage | Thierry Moudenc
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

pdf
Sampling-based Multilingual Alignment
Adrien Lardilleux | Yves Lepage
Proceedings of the International Conference RANLP-2009

pdf abs
The GREYC translation memory for the IWSLT 2009 evaluation campaign
Yves Lepage | Adrien Lardilleux | Julien Gosme
Proceedings of the 6th International Workshop on Spoken Language Translation: Evaluation Campaign

This year’s GREYC translation system is an improved translation memory that was designed from scratch to experiment with an approach whose goal is just to improve over the output of a standard translation memory by making heavy use of sub-sentential alignments in a restricted case of translation by analogy. The tracks the system participated in are all BTEC tracks: Arabic to English, Chinese to English, and Turkish to English.

2008

pdf
Multilingual Alignments by Monolingual String Differences
Adrien Lardilleux | Yves Lepage
Coling 2008: Companion volume: Posters

pdf abs
A truly multilingual, high coverage, accurate, yet simple, subsentential alignment method
Adrien Lardilleux | Yves Lepage
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers

This paper describes a new alignment method that extracts high quality multi-word alignments from sentence-aligned multilingual parallel corpora. The method can handle several languages at once. The phrase tables obtained by the method have a comparable accuracy and a higher coverage than those obtained by current methods. They are also obtained much faster.

pdf abs
The GREYC machine translation system for the IWSLT 2008 evaluation campaign.
Yves Lepage | Adrien Lardilleux | Julien Gosme | Jean-Luc Manguin
Proceedings of the 5th International Workshop on Spoken Language Translation: Evaluation Campaign

This year's GREYC machine translation (MT) system presents three major changes relative to the system presented during the previous campaign, while, of course, remaining a pure example-based MT system that exploits proportional analogies. Firstly, the analogy solver has been replaced with a truly non-deterministic one. Secondly, the engine has been re-engineered and a better control has been introduced. Thirdly, the data used for translation were the data provided by the organizers plus alignments obtained using a new alignment method. This year we chose to have the engine run with the word as the processing unit on the contrary to previous years where the processing unit used to be the character. The tracks the system participated in are all classic BTEC tracks (Arabic-English, Chinese-English and Chinese-Spanish) plus the so-called PIVOT task, where the test set had to be translated from Chinese into Spanish by way of English.

2007

pdf abs
The GREYC machine translation system for the IWSLT 2007 evaluation campaign
Yves Lepage | Adrien Lardilleux
Proceedings of the Fourth International Workshop on Spoken Language Translation

The GREYC machine translation (MT) system is a slight evolution of the ALEPH machine translation system that participated in the IWLST 2005 campaign. It is a pure example-based MT system that exploits proportional analogies. The training data used for this campaign were limited on purpose to the sole data provided by the organizers. However, the training data were expanded with the results of sub-sentential alignments. Thesystemparticipatedinthetwoclassicaltasks of translation of manually transcribed texts from Japanese to English and Arabic to English.

2006

pdf bib abs
Analogie en traitement automatique des langues. Application à la traduction automatique
Yves Lepage
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Tutoriels

On se place ici dans la tendance actuelle en traitement automatique des langues, celle à base de corpus et aussi dans une perspective que l’on peut qualifier d’approche à moindre effort : il s’agit d’examiner les limites des possibilités de traitement à partir de données textuelles brutes, c’est-à-dire non pré-traitées. L’interrogation théorique présente en arrière-plan est la suivante : quelles sont les opérations fondamentales en langue ? L’analogie proportionnelle a été mentionnée par de nombreux grammairiens et linguistes. On se propose de montrer l’efficacité d’une telle opération en la testant sur une tâche dure du traitement automatique des langues : la traduction automatique. On montrera aussi les bonnes conséquences de la formalisation d’une telle opération avec des résultats théoriques en théorie des langages en relation avec leur adéquation à la description des langues. De cette façon, une opération fondamentale en langue, l’analogie proportionnelle, se verra illustrée tant par ses aspects théoriques que par ses performances en pratique.

2005

pdf
BLEU in Characters: Towards Automatic MT Evaluation in Languages without Word Delimiters
Etienne Denoual | Yves Lepage
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts

pdf
Automatic generation of paraphrases to be used as translation references in objective evaluation measures of machine translation
Yves Lepage | Etienne Denoual
Proceedings of the Third International Workshop on Paraphrasing (IWP2005)

pdf abs
The ‘purest’ EBMT System Ever Built: No Variables, No Templates, No Training, Examples, Just Examples, Only Examples
Yves Lepage | Etienne Denoual
Workshop on example-based machine translation

We designed, implemented and assessed an EBMT system that can be dubbed the “purest ever built”: it strictly does not make any use of variables, templates or training, does not have any explicit transfer component, and does not require any preprocessing of the aligned examples. It uses a specific operation, namely proportional analogy, that implicitly neutralises divergences between languages and captures lexical and syntactical variations along the paradigmatic and syntagmatic axes without explicitly decomposing sentences into fragments. In an experiment with a test set of 510 input sentences and an unprocessed corpus of almost 160,000 aligned sentences in Japanese and English, we obtained BLEU, NIST and mWER scores of 0.53, 8.53 and 0.39 respectively, well above a baseline simulating a translation memory.

pdf
ALEPH: an EBMT system based on the preservation of proportional analogies between sentences across languages
Yves Lepage | Etienne Denoual
Proceedings of the Second International Workshop on Spoken Language Translation

2004

pdf abs
Using Paradigm Tables to Generate New Utterances Similar to those Existing in Linguistic Resources
Yves Lepage | Guilhem Peralta
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

We inspect the possibility of creating new linguistic utterances (small sentences) similar to those already present in an existing linguistic resource. Using paradigm tables ensures that the new generated sentences resemble previous data, while being of course different. We report an experiment in which 1,201 new correct sentences were generated starting from only 22 seed sentences.

pdf
Lower and higher estimates of the number of “true analogies” between sentences contained in a large multilingual corpus
Yves Lepage
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

2001

pdf abs
Aides à l’analyse pour la construction de banque d’arbres : étude de l’effort
Nicolas Auclerc | Yves Lepage
Actes de la 8ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

La construction de banque d’arbres est une entreprise lourde qui prend du temps. Pour faciliter cette construction, nous voyons la construction de banques d’arbres comme une série d’opérations d’édition et de recherche. Le but de cet article est d’estimer l’effort, en nombre d’opérations d’éditions, nécessaire pour ajouter une nouvelle phrase dans la banque d’arbres. Nous avons proposé un outil, Boardedit, qui inclut un éditeur d’arbres et des aides a l’analyse. Comme l’effort nécessaire dépend bien sûr de la qualité des réponses fournies par les aides a l’analyse, il peut être vue comme une mesure de la qualité de ces aides. L’éditeur d’arbres restant indispensable a notre outil pendant l’eXpérience, les aides a l’analyse seront donc toujours associées a l’éditeur d’arbres. Dans l’eXpérience proposée, nous augmentons une banque d’arbres de 5 000 phrases par l 553 nouvelles phrases. La réduction obtenue est supérieure auX 4/5 de l’effort.

pdf abs
Défense et illustration de l’analogie
Yves Lepage
Actes de la 8ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

L’argumentation générativiste contre l’analogie tenait en trois points: l’hypothèse de l’inné, celle du hors-contexte et la surproduction. Des résultats théoriques et expérimen- taux reposant sur une formulation calculatoire nouvelle de l’analogie contribuent de façon constructive a la réfutation de ces points.