Amir Hazem


2022

pdf
Cross-lingual and Cross-domain Transfer Learning for Automatic Term Extraction from Low Resource Data
Amir Hazem | Merieme Bouhandi | Florian Boudin | Beatrice Daille
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Automatic Term Extraction (ATE) is a key component for domain knowledge understanding and an important basis for further natural language processing applications. Even with persistent improvements, ATE still exhibits weak results exacerbated by small training data inherent to specialized domain corpora. Recently, transformers-based deep neural models, such as BERT, have proven to be efficient in many downstream NLP tasks. However, no systematic evaluation of ATE has been conducted so far. In this paper, we run an extensive study on fine-tuning pre-trained BERT models for ATE. We propose strategies that empirically show BERT’s effectiveness using cross-lingual and cross-domain transfer learning to extract single and multi-word terms. Experiments have been conducted on four specialized domains in three languages. The obtained results suggest that BERT can capture cross-domain and cross-lingual terminologically-marked contexts shared by terms, opening a new design-pattern for ATE.

2020

pdf
Towards Automatic Thesaurus Construction and Enrichment.
Amir Hazem | Beatrice Daille | Lanza Claudia
Proceedings of the 6th International Workshop on Computational Terminology

Thesaurus construction with minimum human efforts often relies on automatic methods to discover terms and their relations. Hence, the quality of a thesaurus heavily depends on the chosen methodologies for: (i) building its content (terminology extraction task) and (ii) designing its structure (semantic similarity task). The performance of the existing methods on automatic thesaurus construction is still less accurate than the handcrafted ones of which is important to highlight the drawbacks to let new strategies build more accurate thesauri models. In this paper, we will provide a systematic analysis of existing methods for both tasks and discuss their feasibility based on an Italian Cybersecurity corpus. In particular, we will provide a detailed analysis on how the semantic relationships network of a thesaurus can be automatically built, and investigate the ways to enrich the terminological scope of a thesaurus by taking into account the information contained in external domain-oriented semantic sets.

pdf
TermEval 2020: TALN-LS2N System for Automatic Term Extraction
Amir Hazem | Mérieme Bouhandi | Florian Boudin | Beatrice Daille
Proceedings of the 6th International Workshop on Computational Terminology

Automatic terminology extraction is a notoriously difficult task aiming to ease effort demanded to manually identify terms in domain-specific corpora by automatically providing a ranked list of candidate terms. The main ways that addressed this task can be ranged in four main categories: (i) rule-based approaches, (ii) feature-based approaches, (iii) context-based approaches, and (iv) hybrid approaches. For this first TermEval shared task, we explore a feature-based approach, and a deep neural network multitask approach -BERT- that we fine-tune for term extraction. We show that BERT models (RoBERTa for English and CamemBERT for French) outperform other systems for French and English languages.

pdf
TALN/LS2N Participation at the BUCC Shared Task: Bilingual Dictionary Induction from Comparable Corpora
Martin Laville | Amir Hazem | Emmanuel Morin
Proceedings of the 13th Workshop on Building and Using Comparable Corpora

This paper describes the TALN/LS2N system participation at the Building and Using Comparable Corpora (BUCC) shared task. We first introduce three strategies: (i) a word embedding approach based on fastText embeddings; (ii) a concatenation approach using both character Skip-Gram and character CBOW models, and finally (iii) a cognates matching approach based on an exact match string similarity. Then, we present the applied strategy for the shared task which consists in the combination of the embeddings concatenation and the cognates matching approaches. The covered languages are French, English, German, Russian and Spanish. Overall, our system mixing embeddings concatenation and perfect cognates matching obtained the best results while compared to individual strategies, except for English-Russian and Russian-English language pairs for which the concatenation approach was preferred.

pdf
Books of Hours. the First Liturgical Data Set for Text Segmentation.
Amir Hazem | Beatrice Daille | Christopher Kermorvant | Dominique Stutzmann | Marie-Laurence Bonhomme | Martin Maarand | Mélodie Boillet
Proceedings of the Twelfth Language Resources and Evaluation Conference

The Book of Hours was the bestseller of the late Middle Ages and Renaissance. It is a historical invaluable treasure, documenting the devotional practices of Christians in the late Middle Ages. Up to now, its textual content has been scarcely studied because of its manuscript nature, its length and its complex content. At first glance, it looks too standardized. However, the study of book of hours raises important challenges: (i) in image analysis, its often lavish ornamentation (illegible painted initials, line-fillers, etc.), abbreviated words, multilingualism are difficult to address in Handwritten Text Recognition (HTR); (ii) its hierarchical entangled structure offers a new field of investigation for text segmentation; (iii) in digital humanities, its textual content gives opportunities for historical analysis. In this paper, we provide the first corpus of books of hours, which consists of Latin transcriptions of 300 books of hours generated by Handwritten Text Recognition (HTR) - that is like Optical Character Recognition (OCR) but for handwritten and not printed texts. We designed a structural scheme of the book of hours and annotated manually two books of hours according to this scheme. Lastly, we performed a systematic evaluation of the main state of the art text segmentation approaches.

pdf
Data Selection for Bilingual Lexicon Induction from Specialized Comparable Corpora
Martin Laville | Amir Hazem | Emmanuel Morin | Phillippe Langlais
Proceedings of the 28th International Conference on Computational Linguistics

Narrow specialized comparable corpora are often small in size. This particularity makes it difficult to build efficient models to acquire translation equivalents, especially for less frequent and rare words. One way to overcome this issue is to enrich the specialized corpora with out-of-domain resources. Although some recent studies have shown improvements using data augmentation, the enrichment method was roughly conducted by adding out-of-domain data with no particular attention given to how to enrich words and how to do it optimally. In this paper, we contrast several data selection techniques to improve bilingual lexicon induction from specialized comparable corpora. We first apply two well-established data selection techniques often used in machine translation that is: Tf-Idf and cross entropy. Then, we propose to exploit BERT for data selection. Overall, all the proposed techniques improve the quality of the extracted bilingual lexicons by a large margin. The best performing model is the cross entropy, obtaining a gain of about 4 points in MAP while decreasing computation time by a factor of 10.

pdf
Hierarchical Text Segmentation for Medieval Manuscripts
Amir Hazem | Beatrice Daille | Dominique Stutzmann | Christopher Kermorvant | Louis Chevalier
Proceedings of the 28th International Conference on Computational Linguistics

In this paper, we address the segmentation of books of hours, Latin devotional manuscripts of the late Middle Ages, that exhibit challenging issues: a complex hierarchical entangled structure, variable content, noisy transcriptions with no sentence markers, and strong correlations between sections for which topical information is no longer sufficient to draw segmentation boundaries. We show that the main state-of-the-art segmentation methods are either inefficient or inapplicable for books of hours and propose a bottom-up greedy approach that considerably enhances the segmentation results. We stress the importance of such hierarchical segmentation of books of hours for historians to explore their overarching differences underlying conception about Church.

2019

pdf
Tweaks and Tricks for Word Embedding Disruptions
Amir Hazem | Nicolas Hernandez
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Word embeddings are established as very effective models used in several NLP applications. If they differ in their architecture and training process, they often exhibit similar properties and remain vector space models with continuously-valued dimensions describing the observed data. The complexity resides in the developed strategies for learning the values within each dimensional space. In this paper, we introduce the concept of disruption which we define as a side effect of the training process of embedding models. Disruptions are viewed as a set of embedding values that are more likely to be noise than effective descriptive features. We show that dealing with disruption phenomenon is of a great benefit to bottom-up sentence embedding representation. By contrasting several in-domain and pre-trained embedding models, we propose two simple but very effective tweaking techniques that yield strong empirical improvements on textual similarity task.

pdf
Meta-Embedding Sentence Representation for Textual Similarity
Amir Hazem | Nicolas Hernandez
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Word embedding models are now widely used in most NLP applications. Despite their effectiveness, there is no clear evidence about the choice of the most appropriate model. It often depends on the nature of the task and on the quality and size of the used data sets. This remains true for bottom-up sentence embedding models. However, no straightforward investigation has been conducted so far. In this paper, we propose a systematic study of the impact of the main word embedding models on sentence representation. By contrasting in-domain and pre-trained embedding models, we show under which conditions they can be jointly used for bottom-up sentence embeddings. Finally, we propose the first bottom-up meta-embedding representation at the sentence level for textual similarity. Significant improvements are observed in several tasks including question-to-question similarity, paraphrasing and next utterance ranking.

pdf
Réutilisation de Textes dans les Manuscrits Anciens (Text Reuse in Ancient Manuscripts)
Amir Hazem | Béatrice Daille | Dominique Stutzmann | Jacob Currie | Christine Jacquin
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts

Nous nous intéressons dans cet article à la problématique de réutilisation de textes dans les livres liturgiques du Moyen Âge. Plus particulièrement, nous étudions les variations textuelles de la prière Obsecro Te souvent présente dans les livres d’heures. L’observation manuelle de 772 copies de l’Obsecro Te a montré l’existence de plus de 21 000 variantes textuelles. Dans le but de pouvoir les extraire automatiquement et les catégoriser, nous proposons dans un premier temps une classification lexico-sémantique au niveau n-grammes de mots pour ensuite rendre compte des performances de plusieurs approches état-de-l’art d’appariement automatique de variantes textuelles de l’Obsecro Te.

pdf
Towards Automatic Variant Analysis of Ancient Devotional Texts
Amir Hazem | Béatrice Daille | Dominique Stutzmann | Jacob Currie | Christine Jacquin
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

We address in this paper the issue of text reuse in liturgical manuscripts of the middle ages. More specifically, we study variant readings of the Obsecro Te prayer, part of the devotional Books of Hours often used by Christians as guidance for their daily prayers. We aim at automatically extracting and categorising pairs of words and expressions that exhibit variant relations. For this purpose, we adopt a linguistic classification that allows to better characterize the variants than edit operations. Then, we study the evolution of Obsecro Te texts from a temporal and geographical axis. Finally, we contrast several unsupervised state-of-the-art approaches for the automatic extraction of Obsecro Te variants. Based on the manual observation of 772 Obsecro Te copies which show more than 21,000 variants, we show that the proposed methodology is helpful for an automatic study of variants and may serve as basis to analyze and to depict useful information from devotional texts.

2018

pdf
Word Embedding Approach for Synonym Extraction of Multi-Word Terms
Amir Hazem | Béatrice Daille
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
PyRATA, Python Rule-based feAture sTructure Analysis
Nicolas Hernandez | Amir Hazem
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
A Multi-Domain Framework for Textual Similarity. A Case Study on Question-to-Question and Question-Answering Similarity Tasks
Amir Hazem | Basma El Amal Boussaha | Nicolas Hernandez
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
Leveraging Meta-Embeddings for Bilingual Lexicon Extraction from Specialized Comparable Corpora
Amir Hazem | Emmanuel Morin
Proceedings of the 27th International Conference on Computational Linguistics

Recent evaluations on bilingual lexicon extraction from specialized comparable corpora have shown contrasted performance while using word embedding models. This can be partially explained by the lack of large specialized comparable corpora to build efficient representations. Within this context, we try to answer the following questions: First, (i) among the state-of-the-art embedding models, whether trained on specialized corpora or pre-trained on large general data sets, which one is the most appropriate model for bilingual terminology extraction? Second (ii) is it worth it to combine multiple embeddings trained on different data sets? For that purpose, we propose the first systematic evaluation of different word embedding models for bilingual terminology extraction from specialized comparable corpora. We emphasize how the character-based embedding model outperforms other models on the quality of the extracted bilingual lexicons. Further more, we propose a new efficient way to combine different embedding models learned from specialized and general-domain data sets. Our approach leads to higher performance than the best individual embedding model.

2017

pdf
Bilingual Word Embeddings for Bilingual Terminology Extraction from Specialized Comparable Corpora
Amir Hazem | Emmanuel Morin
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Bilingual lexicon extraction from comparable corpora is constrained by the small amount of available data when dealing with specialized domains. This aspect penalizes the performance of distributional-based approaches, which is closely related to the reliability of word’s cooccurrence counts extracted from comparable corpora. A solution to avoid this limitation is to associate external resources with the comparable corpus. Since bilingual word embeddings have recently shown efficient models for learning bilingual distributed representation of words, we explore different word embedding models and show how a general-domain comparable corpus can enrich a specialized comparable corpus via neural networks

pdf
MappSent at IJCNLP-2017 Task 5: A Textual Similarity Approach Applied to Multi-choice Question Answering in Examinations
Amir Hazem
Proceedings of the IJCNLP 2017, Shared Tasks

In this paper we present MappSent, a textual similarity approach that we applied to the multi-choice question answering in exams shared task. MappSent has initially been proposed for question-to-question similarity hazem2017. In this work, we present the results of two adaptations of MappSent for the question answering task on the English dataset.

pdf
MappSent: a Textual Mapping Approach for Question-to-Question Similarity
Amir Hazem | Basma El Amel Boussaha | Nicolas Hernandez
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Since the advent of word embedding methods, the representation of longer pieces of texts such as sentences and paragraphs is gaining more and more interest, especially for textual similarity tasks. Mikolov et al. (2013) have demonstrated that words and phrases exhibit linear structures that allow to meaningfully combine words by an element-wise addition of their vector representations. Recently, Arora et al. (2017) have shown that removing the projections of the weighted average sum of word embedding vectors on their first principal components, outperforms sophisticated supervised methods including RNN’s and LSTM’s. Inspired by Mikolov et al. (2013) and Arora et al. (2017) findings and by a bilingual word mapping technique presented in Artetxe et al. (2016), we introduce MappSent, a novel approach for textual similarity. Based on a linear sentence embedding representation, its principle is to build a matrix that maps sentences in a joint-subspace where similar sets of sentences are pushed closer. We evaluate our approach on the SemEval 2016/2017 question-to-question similarity task and show that overall MappSent achieves competitive results and outperforms in most cases state-of-art methods.

2016

pdf
Bilingual Lexicon Extraction at the Morpheme Level Using Distributional Analysis
Amir Hazem | Béatrice Daille
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Bilingual lexicon extraction from comparable corpora is usually based on distributional methods when dealing with single word terms (SWT). These methods often treat SWT as single tokens without considering their compositional property. However, many SWT are compositional (composed of roots and affixes) and this information, if taken into account can be very useful to match translational pairs, especially for infrequent terms where distributional methods often fail. For instance, the English compound xenograft which is composed of the root xeno and the lexeme graft can be translated into French compositionally by aligning each of its elements (xeno with xéno and graft with greffe) resulting in the translation: xénogreffe. In this paper, we experiment several distributional modellings at the morpheme level that we apply to perform compositional translation to a subset of French and English compounds. We show promising results using distributional analysis at the root and affix levels. We also show that the adapted approach significantly improve bilingual lexicon extraction from comparable corpora compared to the approach at the word level.

pdf
Improving Bilingual Terminology Extraction from Comparable Corpora via Multiple Word-Space Models
Amir Hazem | Emmanuel Morin
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

There is a rich flora of word space models that have proven their efficiency in many different applications including information retrieval (Dumais, 1988), word sense disambiguation (Schutze, 1992), various semantic knowledge tests (Lund et al., 1995; Karlgren, 2001), and text categorization (Sahlgren, 2005). Based on the assumption that each model captures some aspects of word meanings and provides its own empirical evidence, we present in this paper a systematic exploration of the principal corpus-based word space models for bilingual terminology extraction from comparable corpora. We find that, once we have identified the best procedures, a very simple combination approach leads to significant improvements compared to individual models.

pdf
Efficient Data Selection for Bilingual Terminology Extraction from Comparable Corpora
Amir Hazem | Emmanuel Morin
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Comparable corpora are the main alternative to the use of parallel corpora to extract bilingual lexicons. Although it is easier to build comparable corpora, specialized comparable corpora are often of modest size in comparison with corpora issued from the general domain. Consequently, the observations of word co-occurrences which are the basis of context-based methods are unreliable. We propose in this article to improve word co-occurrences of specialized comparable corpora and thus context representation by using general-domain data. This idea, which has been already used in machine translation task for more than a decade, is not straightforward for the task of bilingual lexicon extraction from specific-domain comparable corpora. We go against the mainstream of this task where many studies support the idea that adding out-of-domain documents decreases the quality of lexicons. Our empirical evaluation shows the advantages of this approach which induces a significant gain in the accuracy of extracted lexicons.

2015

pdf
Continuous Adaptation to User Feedback for Statistical Machine Translation
Frédéric Blain | Fethi Bougares | Amir Hazem | Loïc Barrault | Holger Schwenk
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
LINA: Identifying Comparable Documents from Wikipedia
Emmanuel Morin | Amir Hazem | Florian Boudin | Elizaveta Loginova-Clouet
Proceedings of the Eighth Workshop on Building and Using Comparable Corpora

2014

pdf
Looking at Unbalanced Specialized Comparable Corpora for Bilingual Lexicon Extraction
Emmanuel Morin | Amir Hazem
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Semi-compositional Method for Synonym Extraction of Multi-Word Terms
Béatrice Daille | Amir Hazem
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Automatic synonyms and semantically related word extraction is a challenging task, useful in many NLP applications such as question answering, search query expansion, text summarization, etc. While different studies addressed the task of word synonym extraction, only a few investigations tackled the problem of acquiring synonyms of multi-word terms (MWT) from specialized corpora. To extract pairs of synonyms of multi-word terms, we propose in this paper an unsupervised semi-compositional method that makes use of distributional semantics and exploit the compositional property shared by most MWT. We show that our method outperforms significantly the state-of-the-art.

2013

pdf
A Comparison of Smoothing Techniques for Bilingual Lexicon Extraction from Comparable Corpora
Amir Hazem | Emmanuel Morin
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora

pdf
Bilingual Lexicon Extraction from Comparable Corpora by Combining Contextual Representations (Extraction de lexiques bilingues à partir de corpus comparables par combinaison de représentations contextuelles) [in French]
Amir Hazem | Emmanuel Morin
Proceedings of TALN 2013 (Volume 1: Long Papers)

pdf
Word Co-occurrence Counts Prediction for Bilingual Terminology Extraction from Comparable Corpora
Amir Hazem | Emmanuel Morin
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2012

pdf
Adaptive Dictionary for Bilingual Lexicon Extraction from Comparable Corpora
Amir Hazem | Emmanuel Morin
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

One of the main resources used for the task of bilingual lexicon extraction from comparable corpora is : the bilingual dictionary, which is considered as a bridge between two languages. However, no particular attention has been given to this lexicon, except its coverage, and the fact that it can be issued from the general language, the specialised one, or a mix of both. In this paper, we want to highlight the idea that a better consideration of the bilingual dictionary by studying its entries and filtering the non-useful ones, leads to a better lexicon extraction and thus, reach a higher precision. The experiments are conducted on a medical domain corpora. The French-English specialised corpus 'breast cancer' of 1 million words. We show that the empirical results obtained with our filtering process improve the standard approach traditionally dedicated to this task and are promising for future work.

pdf
Participation du LINA à DEFT2012 (LINA at DEFT2012) [in French]
Florian Boudin | Amir Hazem | Nicolas Hernandez | Prajol Shrestha
JEP-TALN-RECITAL 2012, Workshop DEFT 2012: DÉfi Fouille de Textes (DEFT 2012 Workshop: Text Mining Challenge)

2011

pdf
Degré de comparabilité, extraction lexicale bilingue et recherche d’information interlingue (Degree of comparability, bilingual lexical extraction and cross-language information retrieval)
Bo Li | Eric Gaussier | Emmanuel Morin | Amir Hazem
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Nous étudions dans cet article le problème de la comparabilité des documents composant un corpus comparable afin d’améliorer la qualité des lexiques bilingues extraits et les performances des systèmes de recherche d’information interlingue. Nous proposons une nouvelle approche qui permet de garantir un certain degré de comparabilité et d’homogénéité du corpus tout en préservant une grande part du vocabulaire du corpus d’origine. Nos expériences montrent que les lexiques bilingues que nous obtenons sont d’une meilleure qualité que ceux obtenus avec les approches précédentes, et qu’ils peuvent être utilisés pour améliorer significativement les systèmes de recherche d’information interlingue.

pdf
Métarecherche pour l’extraction lexicale bilingue à partir de corpus comparables (Metasearch for bilingual lexical extraction from comparable corpora)
Amir Hazem | Emmanuel Morin | Sebastián Peña Saldarriaga
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Nous présentons dans cet article une nouvelle manière d’aborder le problème de l’acquisition automatique de paires de mots en relation de traduction à partir de corpus comparables. Nous décrivons tout d’abord les approches standard et par similarité interlangue traditionnellement dédiées à cette tâche. Nous réinterprétons ensuite la méthode par similarité interlangue et motivons un nouveau modèle pour reformuler cette approche inspirée par les métamoteurs de recherche d’information. Les résultats empiriques que nous obtenons montrent que les performances de notre modèle sont toujours supérieures à celles obtenues avec l’approche par similarité interlangue, mais aussi comme étant compétitives par rapport à l’approche standard.

pdf
Bilingual Lexicon Extraction from Comparable Corpora as Metasearch
Amir Hazem | Emmanuel Morin | Sebastian Peña Saldarriaga
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web