2024
pdf
abs
Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains
Vincent Segonne
|
Aidan Mannion
|
Laura Cristina Alonzo Canul
|
Alexandre Daniel Audibert
|
Xingyu Liu
|
Cécile Macaire
|
Adrien Pupier
|
Yongxin Zhou
|
Mathilde Aguiar
|
Felix E. Herron
|
Magali Norré
|
Massih R Amini
|
Pierrette Bouillon
|
Iris Eshkol-Taravella
|
Emmanuelle Esperança-Rodier
|
Thomas François
|
Lorraine Goeuriot
|
Jérôme Goulian
|
Mathieu Lafourcade
|
Benjamin Lecouteux
|
François Portet
|
Fabien Ringeval
|
Vincent Vandeghinste
|
Maximin Coavoux
|
Marco Dinarelli
|
Didier Schwab
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Pretrained Language Models (PLMs) are the de facto backbone of most state-of-the-art NLP systems. In this paper, we introduce a family of domain-specific pretrained PLMs for French, focusing on three important domains: transcribed speech, medicine, and law. We use a transformer architecture based on efficient methods (LinFormer) to maximise their utility, since these domains often involve processing long documents. We evaluate and compare our models to state-of-the-art models on a diverse set of tasks and datasets, some of which are introduced in this paper. We gather the datasets into a new French-language evaluation benchmark for these three domains. We also compare various training configurations: continued pretraining, pretraining from scratch, as well as single- and multi-domain pretraining. Extensive domain-specific experiments show that it is possible to attain competitive downstream performance even when pre-training with the approximative LinFormer attention mechanism. For full reproducibility, we release the models and pretraining data, as well as contributed datasets.
pdf
abs
Jargon : Une suite de modèles de langues et de référentiels d’évaluation pour les domaines spécialisés du français
Vincent Segonne
|
Aidan Mannion
|
Laura Alonzo-Canul
|
Audibert Alexandre
|
Xingyu Liu
|
Cécile Macaire
|
Adrien Pupier
|
Yongxin Zhou
|
Mathilde Aguiar
|
Felix Herron
|
Magali Norré
|
Massih-Reza Amini
|
Pierrette Bouillon
|
Iris Eshkol Taravella
|
Emmanuelle Esparança-Rodier
|
Thomas François
|
Lorraine Goeuriot
|
Jérôme Goulian
|
Mathieu Lafourcade
|
Benjamin Lecouteux
|
François Portet
|
Fabien Ringeval
|
Vincent Vandeghinste
|
Maximin Coavoux
|
Marco Dinarelli
|
Didier Schwab
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 2 : traductions d'articles publiès
Les modèles de langue préentraînés (PLM) constituent aujourd’hui de facto l’épine dorsale de la plupart des systèmes de traitement automatique des langues. Dans cet article, nous présentons Jargon, une famille de PLMs pour des domaines spécialisés du français, en nous focalisant sur trois domaines : la parole transcrite, le domaine clinique / biomédical, et le domaine juridique. Nous utilisons une architecture de transformeur basée sur des méthodes computationnellement efficaces(LinFormer) puisque ces domaines impliquent souvent le traitement de longs documents. Nous évaluons et comparons nos modèles à des modèles de l’état de l’art sur un ensemble varié de tâches et de corpus d’évaluation, dont certains sont introduits dans notre article. Nous rassemblons les jeux de données dans un nouveau référentiel d’évaluation en langue française pour ces trois domaines. Nous comparons également diverses configurations d’entraînement : préentraînement prolongé en apprentissage autosupervisé sur les données spécialisées, préentraînement à partir de zéro, ainsi que préentraînement mono et multi-domaines. Nos expérimentations approfondies dans des domaines spécialisés montrent qu’il est possible d’atteindre des performances compétitives en aval, même lors d’un préentraînement avec le mécanisme d’attention approximatif de LinFormer. Pour une reproductibilité totale, nous publions les modèles et les données de préentraînement, ainsi que les corpus utilisés.
2023
pdf
abs
PROPICTO: Developing Speech-to-Pictograph Translation Systems to Enhance Communication Accessibility
Lucía Ormaechea
|
Pierrette Bouillon
|
Maximin Coavoux
|
Emmanuelle Esperança-Rodier
|
Johanna Gerlach
|
Jerôme Goulian
|
Benjamin Lecouteux
|
Cécile Macaire
|
Jonathan Mutal
|
Magali Norré
|
Adrien Pupier
|
Didier Schwab
Proceedings of the 24th Annual Conference of the European Association for Machine Translation
PROPICTO is a project funded by the French National Research Agency and the Swiss National Science Foundation, that aims at creating Speech-to-Pictograph translation systems, with a special focus on French as an input language. By developing such technologies, we intend to enhance communication access for non-French speaking patients and people with cognitive impairments.
pdf
abs
Annotation Linguistique pour l’Évaluation de la Simplification Automatique de Textes
Rémi Cardon
|
Adrien Bibal
|
Rodrigo Wilkens
|
David Alfter
|
Magali Norré
|
Adeline Müller
|
Patrick Watrin
|
Thomas François
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 4 : articles déjà soumis ou acceptés en conférence internationale
L’évaluation des systèmes de simplification automatique de textes (SAT) est une tâche difficile, accomplie à l’aide de métriques automatiques et du jugement humain. Cependant, d’un point de vue linguistique, savoir ce qui est concrètement évalué n’est pas clair. Nous proposons d’annoter un des corpus de référence pour la SAT, ASSET, que nous utilisons pour éclaircir cette question. En plus de la contribution que constitue la ressource annotée, nous montrons comment elle peut être utilisée pour analyser le comportement de SARI, la mesure d’évaluation la plus populaire en SAT. Nous présentons nos conclusions comme une étape pour améliorer les protocoles d’évaluation en SAT à l’avenir.
pdf
abs
Word Sense Disambiguation for Automatic Translation of Medical Dialogues into Pictographs
Magali Norré
|
Rémi Cardon
|
Vincent Vandeghinste
|
Thomas François
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
Word sense disambiguation is an NLP task embedded in different applications. We propose to evaluate its contribution to the automatic translation of French texts into pictographs, in the context of communication between doctors and patients with an intellectual disability. Different general and/or medical language models (Word2Vec, fastText, CamemBERT, FlauBERT, DrBERT, and CamemBERT-bio) are tested in order to choose semantically correct pictographs leveraging the synsets in the French WordNets (WOLF and WoNeF). The results of our automatic evaluations show that our method based on Word2Vec and fastText significantly improves the precision of medical translations into pictographs. We also present an evaluation corpus adapted to this task.
2022
pdf
abs
Investigating the Medical Coverage of a Translation System into Pictographs for Patients with an Intellectual Disability
Magali Norré
|
Vincent Vandeghinste
|
Thomas François
|
Bouillon Pierrette
Ninth Workshop on Speech and Language Processing for Assistive Technologies (SLPAT-2022)
Communication between physician and patients can lead to misunderstandings, especially for disabled people. An automatic system that translates natural language into a pictographic language is one of the solutions that could help to overcome this issue. In this preliminary study, we present the French version of a translation system using the Arasaac pictographs and we investigate the strategies used by speech therapists to translate into pictographs. We also evaluate the medical coverage of this tool for translating physician questions and patient instructions.
pdf
abs
Linguistic Corpus Annotation for Automatic Text Simplification Evaluation
Rémi Cardon
|
Adrien Bibal
|
Rodrigo Wilkens
|
David Alfter
|
Magali Norré
|
Adeline Müller
|
Watrin Patrick
|
Thomas François
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Evaluating automatic text simplification (ATS) systems is a difficult task that is either performed by automatic metrics or user-based evaluations. However, from a linguistic point-of-view, it is not always clear on what bases these evaluations operate. In this paper, we propose annotations of the ASSET corpus that can be used to shed more light on ATS evaluation. In addition to contributing with this resource, we show how it can be used to analyze SARI’s behavior and to re-evaluate existing ATS systems. We present our insights as a step to improve ATS evaluation protocols in the future.
pdf
abs
A Neural Machine Translation Approach to Translate Text to Pictographs in a Medical Speech Translation System - The BabelDr Use Case
Jonathan Mutal
|
Pierrette Bouillon
|
Magali Norré
|
Johanna Gerlach
|
Lucia Ormaechea Grijalba
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
The use of images has been shown to positively affect patient comprehension in medical settings, in particular to deliver specific medical instructions. However, tools that automatically translate sentences into pictographs are still scarce due to the lack of resources. Previous studies have focused on the translation of sentences into pictographs by using WordNet combined with rule-based approaches and deep learning methods. In this work, we showed how we leveraged the BabelDr system, a speech to speech translator for medical triage, to build a speech to pictograph translator using UMLS and neural machine translation approaches. We showed that the translation from French sentences to a UMLS gloss can be viewed as a machine translation task and that a Multilingual Neural Machine Translation system achieved the best results.
2021
pdf
abs
Extending a Text-to-Pictograph System to French and to Arasaac
Magali Norré
|
Vincent Vandeghinste
|
Pierrette Bouillon
|
Thomas François
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
We present an adaptation of the Text-to-Picto system, initially designed for Dutch, and extended to English and Spanish. The original system, aimed at people with an intellectual disability, automatically translates text into pictographs (Sclera and Beta). We extend it to French and add a large set of Arasaac pictographs linked to WordNet 3.1. To carry out this adaptation, we automatically link the pictographs and their metadata to synsets of two French WordNets and leverage this information to translate words into pictographs. We automatically and manually evaluate our system with different corpora corresponding to different use cases, including one for medical communication between doctors and patients. The system is also compared to similar systems in other languages.
2020
pdf
bib
abs
AMesure: A Web Platform to Assist the Clear Writing of Administrative Texts
Thomas François
|
Adeline Müller
|
Eva Rolin
|
Magali Norré
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: System Demonstrations
This article presents the AMesure platform, which aims to assist writers of French administrative texts in simplifying their writing. This platform includes a readability formula specialized for administrative texts and it also uses various natural language processing (NLP) tools to analyze texts and highlight a number of linguistic phenomena considered difficult to read. Finally, based on the difficulties identified, it offers pieces of advice coming from official plain language guides to users. This paper describes the different components of the system and reports an evaluation of these components.