Thomas François

Also published as: Thomas Francois

2021

pdf bib abs
Extending a Text-to-Pictograph System to French and to Arasaac
Magali Norré | Vincent Vandeghinste | Pierrette Bouillon | Thomas François
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

We present an adaptation of the Text-to-Picto system, initially designed for Dutch, and extended to English and Spanish. The original system, aimed at people with an intellectual disability, automatically translates text into pictographs (Sclera and Beta). We extend it to French and add a large set of Arasaac pictographs linked to WordNet 3.1. To carry out this adaptation, we automatically link the pictographs and their metadata to synsets of two French WordNets and leverage this information to translate words into pictographs. We automatically and manually evaluate our system with different corpora corresponding to different use cases, including one for medical communication between doctors and patients. The system is also compared to similar systems in other languages.

pdf bib abs
FrenLyS: A Tool for the Automatic Simplification of French General Language Texts
Eva Rolin | Quentin Langlois | Patrick Watrin | Thomas François
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Lexical simplification (LS) aims at replacing words considered complex in a sentence by simpler equivalents. In this paper, we present the first automatic LS service for French, FrenLys, which offers different techniques to generate, select and rank substitutes. The paper describes the different methods proposed by our tool, which includes both classical approaches (e.g. generation of candidates from lexical resources, frequency filter, etc.) and more innovative approaches such as the exploitation of CamemBERT, a model for French based on the RoBERTa architecture. To evaluate the different methods, a new evaluation dataset for French is introduced.

2020

pdf bib abs
AMesure: A Web Platform to Assist the Clear Writing of Administrative Texts
Thomas François | Adeline Müller | Eva Rolin | Magali Norré
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: System Demonstrations

This article presents the AMesure platform, which aims to assist writers of French administrative texts in simplifying their writing. This platform includes a readability formula specialized for administrative texts and it also uses various natural language processing (NLP) tools to analyze texts and highlight a number of linguistic phenomena considered difficult to read. Finally, based on the difficulties identified, it offers pieces of advice coming from official plain language guides to users. This paper describes the different components of the system and reports an evaluation of these components.

The objective of this work is to introduce text simplification as a potential reading aid to help improve the poor reading performance experienced by visually impaired individuals. As a first step, we explore what makes a text especially complex when read with low vision, by assessing the individual effect of three word properties (frequency, orthographic similarity and length) on reading speed in the presence of Central visual Field Loss (CFL). Individuals with bilateral CFL induced by macular diseases read pairs of French sentences displayed with the self-paced reading method. For each sentence pair, sentence n contained a target word matched with a synonym word of the same length included in sentence n+1. Reading time was recorded for each target word. Given the corpus we used, our results show that (1) word frequency has a significant effect on reading time (the more frequent the faster the reading speed) with larger amplitude (in the range of seconds) compared to normal vision; (2) word neighborhood size has a significant effect on reading time (the more neighbors the slower the reading speed), this effect being rather small in amplitude, but interestingly reversed compared to normal vision; (3) word length has no significant effect on reading time. Supporting the development of new and more effective assistive technology to help low vision is an important and timely issue, with massive potential implications for social and rehabilitation practices. The end goal of this project will be to use our findings to custom text simplification to this specific population and use it as an optimal and efficient reading aid.

pdf bib abs
Combining Expert Knowledge with Frequency Information to Infer CEFR Levels for Words
Alice Pintard | Thomas François
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)

Traditional approaches to set goals in second language (L2) vocabulary acquisition relied either on word lists that were obtained from large L1 corpora or on collective knowledge and experience of L2 experts, teachers, and examiners. Both approaches are known to offer some advantages, but also to have some limitations. In this paper, we try to combine both sources of information, namely the official reference level description for French language and the FLElex lexical database. Our aim is to train a statistical model on the French RLD that would be able to turn the distributional information from FLElex into one of the six levels of the Common European Framework of Reference for languages (CEFR). We show that such approach yields a gain of 29% in accuracy compared to the method currently used in the CEFRLex project. Besides, our experiments also offer deeper insights into the advantages and shortcomings of the two traditional sources of information (frequency vs. expert knowledge).

pdf bib abs
Alector: A Parallel Corpus of Simplified French Texts with Alignments of Misreadings by Poor and Dyslexic Readers
Núria Gala | Anaïs Tack | Ludivine Javourey-Drevet | Thomas François | Johannes C. Ziegler
Proceedings of the 12th Language Resources and Evaluation Conference

In this paper, we present a new parallel corpus addressed to researchers, teachers, and speech therapists interested in text simplification as a means of alleviating difficulties in children learning to read. The corpus is composed of excerpts drawn from 79 authentic literary (tales, stories) and scientific (documentary) texts commonly used in French schools for children aged between 7 to 9 years old. The excerpts were manually simplified at the lexical, morpho-syntactic, and discourse levels in order to propose a parallel corpus for reading tests and for the development of automatic text simplification tools. A sample of 21 poor-reading and dyslexic children with an average reading delay of 2.5 years read a portion of the corpus. The transcripts of readings errors were integrated into the corpus with the goal of identifying lexical difficulty in the target population. By means of statistical testing, we provide evidence that the manual simplifications significantly reduced reading errors, highlighting that the words targeted for simplification were not only well-chosen but also substituted with substantially easier alternatives. The entire corpus is available for consultation through a web interface and available on demand for research purposes.

2019

pdf bib abs
PolylexFLE : une base de données d’expressions polylexicales pour le FLE (PolylexFLE : a database of multiword expressions for French L2 language learning)
Amalia Todirascu | Marion Cargill | Thomas Francois
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume I : Articles longs

Nous présentons la base PolylexFLE, contenant 4295 expressions polylexicales. Elle est integrée dans une plateforme d’apprentissage du FLE, SimpleApprenant, destinée à l’apprentissage des expressions polylexicales verbales (idiomatiques, collocations ou expressions figées). Afin de proposer des exercices adaptés au niveau du Cadre européen de référence pour les langues (CECR), nous avons utilisé une procédure mixte (manuelle et automatique) pour annoter 1098 expressions selon les niveaux de compétence du CECR. L’article se concentre sur la procédure automatique qui identifie, dans un premier temps, les expressions de la base PolylexFLE dans un corpus à l’aide d’un système à base d’expressions régulières. Dans un second temps, leur distribution au sein de corpus, annoté selon l’échelle du CECR, est estimée et transformée en un niveau CECR unique.

2018

pdf bib abs
ReSyf: a French lexicon with ranked synonyms
Mokhtar B. Billami | Thomas François | Núria Gala
Proceedings of the 27th International Conference on Computational Linguistics

In this article, we present ReSyf, a lexical resource of monolingual synonyms ranked according to their difficulty to be read and understood by native learners of French. The synonyms come from an existing lexical network and they have been semantically disambiguated and refined. A ranking algorithm, based on a wide range of linguistic features and validated through an evaluation campaign with human annotators, automatically sorts the synonyms corresponding to a given word sense by reading difficulty. ReSyf is freely available and will be integrated into a web platform for reading assistance. It can also be applied to perform lexical simplification of French texts.

pdf bib abs
NT2Lex: A CEFR-Graded Lexical Resource for Dutch as a Foreign Language Linked to Open Dutch WordNet
Anaïs Tack | Thomas François | Piet Desmet | Cédrick Fairon
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

In this paper, we introduce NT2Lex, a novel lexical resource for Dutch as a foreign language (NT2) which includes frequency distributions of 17,743 words and expressions attested in expert-written textbook texts and readers graded along the scale of the Common European Framework of Reference (CEFR). In essence, the lexicon informs us about what kind of vocabulary should be understood when reading Dutch as a non-native reader at a particular proficiency level. The main novelty of the resource with respect to the previously developed CEFR-graded lexicons concerns the introduction of corpus-based evidence for L2 word sense complexity through the linkage to Open Dutch WordNet (Postma et al., 2016). The resource thus contains, on top of the lemmatised and part-of-speech tagged lexical entries, a total of 11,999 unique word senses and 8,934 distinct synsets.

pdf bib
The Interface Between Readability and Automatic Text Simplification
Thomas François
Proceedings of the 1st Workshop on Automatic Text Adaptation (ATA)

pdf bib
Assisted Lexical Simplification for French Native Children with Reading Difficulties
Firas Hmida | Mokhtar B. Billami | Thomas François | Núria Gala
Proceedings of the 1st Workshop on Automatic Text Adaptation (ATA)

pdf bib
EFLLex: A Graded Lexical Resource for Learners of English as a Foreign Language
Luise Dürlich | Thomas François
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib abs
Human and Automated CEFR-based Grading of Short Answers
Anaïs Tack | Thomas François | Sophie Roekhaut | Cédrick Fairon
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

This paper is concerned with the task of automatically assessing the written proficiency level of non-native (L2) learners of English. Drawing on previous research on automated L2 writing assessment following the Common European Framework of Reference for Languages (CEFR), we investigate the possibilities and difficulties of deriving the CEFR level from short answers to open-ended questions, which has not yet been subjected to numerous studies up to date. The object of our study is twofold: to examine the intricacy involved with both human and automated CEFR-based grading of short answers. On the one hand, we describe the compilation of a learner corpus of short answers graded with CEFR levels by three certified Cambridge examiners. We mainly observe that, although the shortness of the answers is reported as undermining a clear-cut evaluation, the length of the answer does not necessarily correlate with inter-examiner disagreement. On the other hand, we explore the development of a soft-voting system for the automated CEFR-based grading of short answers and draw tentative conclusions about its use in a computer-assisted testing (CAT) setting.

2016

pdf bib
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)
Dominique Brunato | Felice Dell’Orletta | Giulia Venturi | Thomas François | Philippe Blache
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)

pdf bib
SweLLex: Second language learners’ productive vocabulary
Elena Volodina | Ildikó Pilán | Lorena Llozhi | Baptiste Degryse | Thomas François
Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition

pdf bib abs
Are Cohesive Features Relevant for Text Readability Evaluation?
Amalia Todirascu | Thomas François | Delphine Bernhard | Núria Gala | Anne-Laure Ligozat
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

This paper investigates the effectiveness of 65 cohesion-based variables that are commonly used in the literature as predictive features to assess text readability. We evaluate the efficiency of these variables across narrative and informative texts intended for an audience of L2 French learners. In our experiments, we use a French corpus that has been both manually and automatically annotated as regards to co-reference and anaphoric chains. The efficiency of the 65 variables for readability is analyzed through a correlational analysis and some modelling experiments.

pdf bib abs
Bleu, contusion, ecchymose : tri automatique de synonymes en fonction de leur difficulté de lecture et compréhension (Automatic ranking of synonyms according to their reading and comprehension difficulty)
Thomas Francois | Mokhtar B. Billami | Núria Gala | Delphine Bernhard
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

La lisibilité d’un texte dépend fortement de la difficulté des unités lexicales qui le composent. La simplification lexicale vise ainsi à remplacer les termes complexes par des équivalents sémantiques plus simples à comprendre : par exemple, BLEU (‘résultat d’un choc’) est plus simple que CONTUSION ou ECCHYMOSE. Il est pour cela nécessaire de disposer de ressources qui listent des synonymes pour des sens donnés et les trient par ordre de difficulté. Cet article décrit une méthode pour constituer une ressource de ce type pour le français. Les listes de synonymes sont extraites de BabelNet et de JeuxDeMots, puis triées grâce à un algorithme statistique d’ordonnancement. Les résultats du tri sont évalués par rapport à 36 listes de synonymes ordonnées manuellement par quarante annotateurs.

pdf bib abs
Modèles adaptatifs pour prédire automatiquement la compétence lexicale d’un apprenant de français langue étrangère (Adaptive models for automatically predicting the lexical competence of French as a foreign language learners)
Anaïs Tack | Thomas François | Anne-Laure Ligozat | Cédrick Fairon
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

Cette étude examine l’utilisation de méthodes d’apprentissage incrémental supervisé afin de prédire la compétence lexicale d’apprenants de français langue étrangère (FLE). Les apprenants ciblés sont des néerlandophones ayant un niveau A2/B1 selon le Cadre européen commun de référence pour les langues (CECR). À l’instar des travaux récents portant sur la prédiction de la maîtrise lexicale à l’aide d’indices de complexité, nous élaborons deux types de modèles qui s’adaptent en fonction d’un retour d’expérience, révélant les connaissances de l’apprenant. En particulier, nous définissons (i) un modèle qui prédit la compétence lexicale de tous les apprenants du même niveau de maîtrise et (ii) un modèle qui prédit la compétence lexicale d’un apprenant individuel. Les modèles obtenus sont ensuite évalués par rapport à un modèle de référence déterminant la compétence lexicale à partir d’un lexique spécialisé pour le FLE et s’avèrent gagner significativement en exactitude (9%-17%).

pdf bib abs
Classification automatique de dictées selon leur niveau de difficulté de compréhension et orthographique (Automatic classification of dictations according to their complexity for comprehension and writing production)
Adeline Müller | Thomas Francois | Sophie Roekhaut | Cedrick Fairon
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

Cet article présente une approche visant à évaluer automatiquement la difficulté de dictées en vue de les intégrer dans une plateforme d’apprentissage de l’orthographe. La particularité de l’exercice de la dictée est de devoir percevoir du code oral et de le retranscrire via le code écrit. Nous envisageons ce double niveau de difficulté à l’aide de 375 variables mesurant la difficulté de compréhension d’un texte ainsi que les phénomènes orthographiques et grammaticaux complexes qu’il contient. Un sous-ensemble optimal de ces variables est combiné à l’aide d’un modèle par machines à vecteurs de support (SVM) qui classe correctement 56% des textes. Les variables lexicales basées sur la liste orthographique de Catach (1984) se révèlent les plus informatives pour le modèle.

pdf bib abs
SVALex: a CEFR-graded Lexical Resource for Swedish Foreign and Second Language Learners
Thomas François | Elena Volodina | Ildikó Pilán | Anaïs Tack
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The paper introduces SVALex, a lexical resource primarily aimed at learners and teachers of Swedish as a foreign and second language that describes the distribution of 15,681 words and expressions across the Common European Framework of Reference (CEFR). The resource is based on a corpus of coursebook texts, and thus describes receptive vocabulary learners are exposed to during reading activities, as opposed to productive vocabulary they use when speaking or writing. The paper describes the methodology applied to create the list and to estimate the frequency distribution. It also discusses some characteristics of the resulting resource and compares it to other lexical resources for Swedish. An interesting feature of this resource is the possibility to separate the wheat from the chaff, identifying the core vocabulary at each level, i.e. vocabulary shared by several coursebook writers at each level, from peripheral vocabulary which is used by the minority of the coursebook writers.

pdf bib abs
Evaluating Lexical Simplification and Vocabulary Knowledge for Learners of French: Possibilities of Using the FLELex Resource
Anaïs Tack | Thomas François | Anne-Laure Ligozat | Cédrick Fairon
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This study examines two possibilities of using the FLELex graded lexicon for the automated assessment of text complexity in French as a foreign language learning. From the lexical frequency distributions described in FLELex, we derive a single level of difficulty for each word in a parallel corpus of original and simplified texts. We then use this data to automatically address the lexical complexity of texts in two ways. On the one hand, we evaluate the degree of lexical simplification in manually simplified texts with respect to their original version. Our results show a significant simplification effect, both in the case of French narratives simplified for non-native readers and in the case of simplified Wikipedia texts. On the other hand, we define a predictive model which identifies the number of words in a text that are expected to be known at a particular learning level. We assess the accuracy with which these predictions are able to capture actual word knowledge as reported by Dutch-speaking learners of French. Our study shows that although the predictions seem relatively accurate in general (87.4% to 92.3%), they do not yet seem to cover the learners’ lack of knowledge very well.

pdf bib abs
Combining Manual and Automatic Prosodic Annotation for Expressive Speech Synthesis
Sandrine Brognaux | Thomas François | Marco Saerens
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Text-to-speech has long been centered on the production of an intelligible message of good quality. More recently, interest has shifted to the generation of more natural and expressive speech. A major issue of existing approaches is that they usually rely on a manual annotation in expressive styles, which tends to be rather subjective. A typical related issue is that the annotation is strongly influenced ― and possibly biased ― by the semantic content of the text (e.g. a shot or a fault may incite the annotator to tag that sequence as expressing a high degree of excitation, independently of its acoustic realization). This paper investigates the assumption that human annotation of basketball commentaries in excitation levels can be automatically improved on the basis of acoustic features. It presents two techniques for label correction exploiting a Gaussian mixture and a proportional-odds logistic regression. The automatically re-annotated corpus is then used to train HMM-based expressive speech synthesizers, the performance of which is assessed through subjective evaluations. The results indicate that the automatic correction of the annotation with Gaussian mixture helps to synthesize more contrasted excitation levels, while preserving naturalness.

2014

pdf bib abs
FLELex: a graded lexical resource for French foreign learners
Thomas François | Nùria Gala | Patrick Watrin | Cédrick Fairon
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we present FLELex, the first graded lexicon for French as a foreign language (FFL) that reports word frequencies by difficulty level (according to the CEFR scale). It has been obtained from a tagged corpus of 777,000 words from available textbooks and simplified readers intended for FFL learners. Our goal is to freely provide this resource to the community to be used for a variety of purposes going from the assessment of the lexical difficulty of a text, to the selection of simpler words within text simplification systems, and also as a dictionary in assistive tools for writing.

pdf bib abs
Multiple Choice Question Corpus Analysis for Distractor Characterization
Van-Minh Pho | Thibault André | Anne-Laure Ligozat | Brigitte Grau | Gabriel Illouz | Thomas François
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we present a study of MCQ aiming to define criteria in order to automatically select distractors. We are aiming to show that distractor editing follows rules like syntactic and semantic homogeneity according to associated answer, and the possibility to automatically identify this homogeneity. Manual analysis shows that homogeneity rule is respected to edit distractors and automatic analysis shows the possibility to reproduce these criteria. These ones can be used in future works to automatically select distractors, with the combination of other criteria.

pdf bib
Syntactic Sentence Simplification for French
Laetitia Brouwers | Delphine Bernhard | Anne-Laure Ligozat | Thomas François
Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR)

pdf bib
An analysis of a French as a Foreign Language Corpus for Readability Assessment
Thomas François
Proceedings of the third workshop on NLP for computer-assisted language learning

pdf bib
A model to predict lexical complexity and to grade words (Un modèle pour prédire la complexité lexicale et graduer les mots) [in French]
Núria Gala | Thomas François | Delphine Bernhard | Cédrick Fairon
Proceedings of TALN 2014 (Volume 1: Long Papers)

pdf bib
AMesure: a readability formula for administrative texts (AMESURE: une plateforme de lisibilité pour les textes administratifs) [in French]
Thomas François | Laetitia Brouwers | Hubert Naets | Cédrick Fairon
Proceedings of TALN 2014 (Volume 2: Short Papers)

Cette étude envisage l’emploi des unités polylexicales (UPs) comme prédicteurs dans une formule de lisibilité pour le français langue étrangère. À l’aide d’un extracteur d’UPs combinant une approche statistique à un filtre linguistique, nous définissons six variables qui prennent en compte la densité et la probabilité des UPs nominales, mais aussi leur structure interne. Nos expérimentations concluent à un faible pouvoir prédictif de ces six variables et révèlent qu’une simple approche basée sur la probabilité moyenne des n-grammes des textes est plus efficace.

pdf bib
An N-gram Frequency Database Reference to Handle MWE Extraction in NLP Applications
Patrick Watrin | Thomas François
Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World

2009

pdf bib
Combining a Statistical Language Model with Logistic Regression to Predict the Lexical and Syntactic Difficulty of Texts for FFL
Thomas François
Proceedings of the Student Research Workshop at EACL 2009

pdf bib abs
Modèles statistiques pour l’estimation automatique de la difficulté de textes de FLE
Thomas François
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues

La lecture constitue l’une des tâches essentielles dans l’apprentissage d’une langue étrangère. Toutefois, la découverte d’un texte portant sur un sujet précis et qui soit adapté au niveau de chaque apprenant est consommatrice de temps et pourrait être automatisée. Des expériences montrent que, pour l’anglais, l’utilisation de classifieurs statistiques permet d’estimer automatiquement la difficulté d’un texte. Dans cet article, nous proposons une méthodologie originale comparant, pour le français langue étrangère (FLE), diverses techniques de classification (la régression logistique, le bagging et le boosting) sur deux corpus d’entraînement. Il ressort de cette analyse comparative une légère supériorité de la régression logistique multinomiale.

Venues