Julien Gosme


2011

pdf
Structure des trigrammes inconnus et lissage par analogie (Structure of unknown trigrams and smoothing by analogy)
Julien Gosme | Yves Lepage
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Nous montrons dans une série d’expériences sur quatre langues, sur des échantillons du corpus Europarl, que, dans leur grande majorité, les trigrammes inconnus d’un jeu de test peuvent être reconstruits par analogie avec des trigrammes hapax du corpus d’entraînement. De ce résultat, nous dérivons une méthode de lissage simple pour les modèles de langue par trigrammes et obtenons de meilleurs résultats que les lissages de Witten-Bell, Good-Turing et Kneser-Ney dans des expériences menées en onze langues sur la partie commune d’Europarl, sauf pour le finnois et, dans une moindre mesure, le français.

pdf
Évaluation de G-LexAr pour la traduction automatique statistique (Evaluation of G-Lexar for statistical machine translation)
Wigdan Mekki | Julien Gosme | Fathi Debili | Yves Lepage | Nadine Lucas
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

G-LexAr est un analyseur morphologique de l’arabe qui a récemment reçu des améliorations substantielles. Cet article propose une évaluation de cet analyseur en tant qu’outil de pré-traitement pour la traduction automatique statistique, ce dont il n’a encore jamais fait l’objet. Nous étudions l’impact des différentes formes proposées par son analyse (voyellation, lemmatisation et segmentation) sur un système de traduction arabe-anglais, ainsi que l’impact de la combinaison de ces formes. Nos expériences montrent que l’utilisation séparée de chacune de ces formes n’a que peu d’influence sur la qualité des traductions obtenues, tandis que leur combinaison y contribue de façon très bénéfique.

2010

pdf
The GREYC/LLACAN machine translation systems for the IWSLT 2010 campaign
Julien Gosme | Wigdan Mekki | Fathi Debili | Yves Lepage | Nadine Lucas
Proceedings of the 7th International Workshop on Spoken Language Translation: Evaluation Campaign

In this paper we explore the contribution of the use of two Arabic morphological analyzers as preprocessing tools for statistical machine translation. Similar investigations have already been reported for morphologically rich languages like German, Turkish and Arabic. Here, we focus on the case of the Arabic language and mainly discuss the use of the G-LexAr analyzer. A preliminary experiment has been designed to choose the most promising translation system among the 3 G-LexAr-based systems, we concluded that the systems are equivalent. Nevertheless, we decided to use the lemmatized output of G-LexAr and use its translations as primary run for the BTEC AE track. The results showed that G-LexAr outputs degrades translation compared to the basic SMT system trained on the un-analyzed corpus.

pdf
Bilingual Lexicon Induction: Effortless Evaluation of Word Alignment Tools and Production of Resources for Improbable Language Pairs
Adrien Lardilleux | Julien Gosme | Yves Lepage
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we present a simple protocol to evaluate word aligners on bilingual lexicon induction tasks from parallel corpora. Rather than resorting to gold standards, it relies on a comparison of the outputs of word aligners against a reference bilingual lexicon. The quality of this reference bilingual lexicon does not need to be particularly high, because evaluation quality is ensured by systematically filtering this reference lexicon with the parallel corpus the word aligners are trained on. We perform a comparison of three freely available word aligners on numerous language pairs from the Bible parallel corpus (Resnik et al., 1999): MGIZA++ (Gao and Vogel, 2008), BerkeleyAligner (Liang et al., 2006), and Anymalign (Lardilleux and Lepage, 2009). We then select the most appropriate one to produce bilingual lexicons for all language pairs of this corpus. These involve Cebuano, Chinese, Danish, English, Finnish, French, Greek, Indonesian, Latin, Spanish, Swedish, and Vietnamese. The 66 resulting lexicons are made freely available.

2009

pdf
The GREYC translation memory for the IWSLT 2009 evaluation campaign
Yves Lepage | Adrien Lardilleux | Julien Gosme
Proceedings of the 6th International Workshop on Spoken Language Translation: Evaluation Campaign

This year’s GREYC translation system is an improved translation memory that was designed from scratch to experiment with an approach whose goal is just to improve over the output of a standard translation memory by making heavy use of sub-sentential alignments in a restricted case of translation by analogy. The tracks the system participated in are all BTEC tracks: Arabic to English, Chinese to English, and Turkish to English.

2008

pdf
The GREYC machine translation system for the IWSLT 2008 evaluation campaign.
Yves Lepage | Adrien Lardilleux | Julien Gosme | Jean-Luc Manguin
Proceedings of the 5th International Workshop on Spoken Language Translation: Evaluation Campaign

This year's GREYC machine translation (MT) system presents three major changes relative to the system presented during the previous campaign, while, of course, remaining a pure example-based MT system that exploits proportional analogies. Firstly, the analogy solver has been replaced with a truly non-deterministic one. Secondly, the engine has been re-engineered and a better control has been introduced. Thirdly, the data used for translation were the data provided by the organizers plus alignments obtained using a new alignment method. This year we chose to have the engine run with the word as the processing unit on the contrary to previous years where the processing unit used to be the character. The tracks the system participated in are all classic BTEC tracks (Arabic-English, Chinese-English and Chinese-Spanish) plus the so-called PIVOT task, where the test set had to be translated from Chinese into Spanish by way of English.