2023
pdf
abs
Exploring the Relationship between Alignment and Cross-lingual Transfer in Multilingual Transformers
Felix Gaschi
|
Patricio Cerda
|
Parisa Rastin
|
Yannick Toussaint
Findings of the Association for Computational Linguistics: ACL 2023
Without any explicit cross-lingual training data, multilingual language models can achieve cross-lingual transfer. One common way to improve this transfer is to perform realignment steps before fine-tuning, i.e., to train the model to build similar representations for pairs of words from translated sentences. But such realignment methods were found to not always improve results across languages and tasks, which raises the question of whether aligned representations are truly beneficial for cross-lingual transfer. We provide evidence that alignment is actually significantly correlated with cross-lingual transfer across languages, models and random seeds. We show that fine-tuning can have a significant impact on alignment, depending mainly on the downstream task and the model. Finally, we show that realignment can, in some instances, improve cross-lingual transfer, and we identify conditions in which realignment methods provide significant improvements. Namely, we find that realignment works better on tasks for which alignment is correlated with cross-lingual transfer when generalizing to a distant language and with smaller models, as well as when using a bilingual dictionary rather than FastAlign to extract realignment pairs. For example, for POS-tagging, between English and Arabic, realignment can bring a +15.8 accuracy improvement on distilmBERT, even outperforming XLM-R Large by 1.7. We thus advocate for further research on realignment methods for smaller multilingual models as an alternative to scaling.
pdf
abs
How Much do Knowledge Graphs Impact Transformer Models for Extracting Biomedical Events?
Laura Zanella
|
Yannick Toussaint
The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks
Biomedical event extraction can be divided into three main subtasks; (1) biomedical event trigger detection, (2) biomedical argument identification and (3) event construction. This work focuses in the two first subtasks. For the first subtask we analyze a set of transformer language models that are commonly used in the biomedical domain to evaluate and compare their capacity for event trigger detection. We fine-tune the models using seven manually annotated corpora to assess their performance in different biomedical subdomains. SciBERT emerged as the highest performing model, presenting a slight improvement compared to baseline models. Then, for the second subtask we construct a knowledge graph (KG) from the biomedical corpora and integrate its KG embeddings to SciBERT to enrich its semantic information. We demonstrate that adding the KG embeddings to the model improves the argument identification performance by around 20 %, and by around 15 % compared to two baseline models. Our results suggest that fine-tuning a transformer model that is pretrained from scratch with biomedical and general data allows to detect event triggers and identify arguments covering different biomedical subdomains, and therefore improving its generalization. Furthermore, the integration of KG embeddings into the model can significantly improve the performance of biomedical event argument identification, outperforming the results of baseline models.
pdf
abs
Multilingual Clinical NER: Translation or Cross-lingual Transfer?
Félix Gaschi
|
Xavier Fontaine
|
Parisa Rastin
|
Yannick Toussaint
Proceedings of the 5th Clinical Natural Language Processing Workshop
Natural language tasks like Named Entity Recognition (NER) in the clinical domain on non-English texts can be very time-consuming and expensive due to the lack of annotated data. Cross-lingual transfer (CLT) is a way to circumvent this issue thanks to the ability of multilingual large language models to be fine-tuned on a specific task in one language and to provide high accuracy for the same task in another language. However, other methods leveraging translation models can be used to perform NER without annotated data in the target language, by either translating the training set or test set. This paper compares cross-lingual transfer with these two alternative methods, to perform clinical NER in French and in German without any training data in those languages. To this end, we release MedNERF a medical NER test set extracted from French drug prescriptions and annotated with the same guidelines as an English dataset. Through extensive experiments on this dataset and on a German medical dataset (Frei and Kramer, 2021), we show that translation-based methods can achieve similar performance to CLT but require more care in their design. And while they can take advantage of monolingual clinical language models, those do not guarantee better results than large general-purpose multilingual models, whether with cross-lingual transfer or translation.
pdf
Code-switching as a cross-lingual Training Signal: an Example with Unsupervised Bilingual Embedding
Felix Gaschi
|
Ilias El-Baamrani
|
Barbara Gendron
|
Parisa Rastin
|
Yannick Toussaint
Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL)
2022
pdf
abs
Organizing and Improving a Database of French Word Formation Using Formal Concept Analysis
Nyoman Juniarta
|
Olivier Bonami
|
Nabil Hathout
|
Fiammetta Namer
|
Yannick Toussaint
Proceedings of the Thirteenth Language Resources and Evaluation Conference
We apply Formal Concept Analysis (FCA) to organize and to improve the quality of Démonette2, a French derivational database, through a detection of both missing and spurious derivations in the database. We represent each derivational family as a graph. Given that the subgraph relation exists among derivational families, FCA can group families and represent them in a partially ordered set (poset). This poset is also useful for improving the database. A family is regarded as a possible anomaly (meaning that it may have missing and/or spurious derivations) if its derivational graph is almost, but not completely identical to a large number of other families.
2020
pdf
abs
Do sentence embeddings capture discourse properties of sentences from Scientific Abstracts ?
Laurine Huber
|
Chaker Memmadi
|
Mathilde Dargnat
|
Yannick Toussaint
Proceedings of the First Workshop on Computational Approaches to Discourse
We introduce four tasks designed to determine which sentence encoders best capture discourse properties of sentences from scientific abstracts, namely coherence and cohesion between clauses of a sentence, and discourse relations within sentences. We show that even if contextual encoders such as BERT or SciBERT encodes the coherence in discourse units, they do not help to predict three discourse relations commonly used in scientific abstracts. We discuss what these results underline, namely that these discourse relations are based on particular phrasing that allow non-contextual encoders to perform well.
2019
pdf
abs
Aligning Discourse and Argumentation Structures using Subtrees and Redescription Mining
Laurine Huber
|
Yannick Toussaint
|
Charlotte Roze
|
Mathilde Dargnat
|
Chloé Braud
Proceedings of the 6th Workshop on Argument Mining
In this paper, we investigate similarities between discourse and argumentation structures by aligning subtrees in a corpus containing both annotations. Contrary to previous works, we focus on comparing sub-structures and not only relations matches. Using data mining techniques, we show that discourse and argumentation most often align well, and the double annotation allows to derive a mapping between structures. Moreover, this approach enables the study of similarities between discourse structures and differences in their expressive power.
2018
pdf
abs
Syntax-based Transfer Learning for the Task of Biomedical Relation Extraction
Joël Legrand
|
Yannick Toussaint
|
Chedy Raïssi
|
Adrien Coulet
Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis
Transfer learning (TL) proposes to enhance machine learning performance on a problem, by reusing labeled data originally designed for a related problem. In particular, domain adaptation consists, for a specific task, in reusing training data developed for the same task but a distinct domain. This is particularly relevant to the applications of deep learning in Natural Language Processing, because those usually require large annotated corpora that may not exist for the targeted domain, but exist for side domains. In this paper, we experiment with TL for the task of Relation Extraction (RE) from biomedical texts, using the TreeLSTM model. We empirically show the impact of TreeLSTM alone and with domain adaptation by obtaining better performances than the state of the art on two biomedical RE tasks and equal performances for two others, for which few annotated data are available. Furthermore, we propose an analysis of the role that syntactic features may play in TL for RE.
2016
pdf
abs
Ambiguity Diagnosis for Terms in Digital Humanities
Béatrice Daille
|
Evelyne Jacquey
|
Gaël Lejeune
|
Luis Felipe Melo
|
Yannick Toussaint
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Among all researches dedicating to terminology and word sense disambiguation, little attention has been devoted to the ambiguity of term occurrences. If a lexical unit is indeed a term of the domain, it is not true, even in a specialised corpus, that all its occurrences are terminological. Some occurrences are terminological and other are not. Thus, a global decision at the corpus level about the terminological status of all occurrences of a lexical unit would then be erroneous. In this paper, we propose three original methods to characterise the ambiguity of term occurrences in the domain of social sciences for French. These methods differently model the context of the term occurrences: one is relying on text mining, the second is based on textometry, and the last one focuses on text genre properties. The experimental results show the potential of the proposed approaches and give an opportunity to discuss about their hybridisation.
2015
pdf
Extracting Disease-Symptom Relationships by Learning Syntactic Patterns from Dependency Graphs
Mohsen Hassan
|
Olfa Makkaoui
|
Adrien Coulet
|
Yannick Toussaint
Proceedings of BioNLP 15
2003
pdf
abs
Le traitement automatique de la langue contre les erreurs judiciaires : une méthodologie d’analyse systématique des textes d’un dossier d’instruction
Yannick Toussaint
Actes de la 10ème conférence sur le Traitement Automatique des Langues Naturelles. Posters
Cet article présente une méthode d’analyse systématique et scientifique des documents constituant un dossier d’instruction. L’objectif de cette approche est de pouvoir donner au juge d’instruction de nouveaux moyens pour évaluer la cohérence, les incohérences, la stabilité ou les variations dans les témoignages. Cela doit lui permettre de définir des pistes pour mener de nouvelles investigations. Nous décrivons les travaux que nous avons réalisés sur un dossier réel puis nous proposons une méthode d’analyse des résultats.