Cédrick Fairon

Also published as: Cedrick Fairon

2024

pdf abs
La subjectivité dans le journalisme québécois et belge : transfert de connaissance inter-médias et inter-cultures
Louis Escouflaire | Antonin Descampe | Antoine Venant | Cédrick Fairon
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 2 : traductions d'articles publiès

Cet article s’intéresse à la capacité de transfert des modèles de classification de texte dans le domaine journalistique, en particulier pour distinguer les articles d’opinion des articles d’information. A l’ère du numérique et des réseaux sociaux, les distinctions entre ces genres deviennent de plus en plus floues, augmentant l’importance de cette tâche de classification. Un corpus de 80 000 articles de presse provenant de huit médias, quatre québécois et quatre belges francophones, a été constitué. Pour identifier les thèmes des articles, une clusterisation a été appliquée sur les 10 000 articles issus de chaque média, assurant une distribution équilibrée des thèmes entre les deux genres opinion et information. Les données ont ensuite été utilisées pour entraîner (ou peaufiner) et évaluer deux types de modèles : CamemBERT (Martin et al., 2019), un modèle neuronal pré-entraîné, et un modèle de régression logistique basé sur des traits textuels. Dix versions différentes de chaque modèle sont entraînées : 8 versions mono-médias’, chacune peaufinée sur l’ensemble d’entraînement du sous-corpus correspondant à un média, et deux versions multi-médias’, l’une peaufinée sur 8000 articles québécois, l’autre sur les articles belges. Les résultats montrent que les modèles CamemBERT surpassent significativement les modèlesstatistiques en termes de capacité de transfert (voir Figures 1 et 2). Les modèles CamemBERT montrent une plus grande exactitude, notamment sur les ensembles de test du même média que celui utilisé pour l’entraînement. Cependant, les modèles entraînés sur Le Journal de Montréal(JDM) sont particulièrement performants même sur d’autres ensembles de test, suggérant une distinction plus claire entre les genres journalistiques dans ce média. Les modèles CamemBERT multi-médias affichent également de bonnes performances. Le modèle québécois notamment obtient les meilleurs résultats en moyenne, indiquant qu’une diversité de sources améliore la généricité du modèle. Les modèles statistiques (mono- et multi-médias) montrent des performances globalement inférieures, avec des variations significatives selon les médias. Les textes québécois sont plus difficiles à classer pour ces modèles, suggérant des différences culturelles dans les pratiques journalistiques entre le Québec et la Belgique. L’analyse des traits révèle que l’importance de certains éléments textuels, comme les points d’exclamation et les marqueurs de temps relatifs, varient considérablement entre les modèles entraînés sur différents médias. Par exemple, les éditoriaux du JDM utilisent fréquemment des points d’exclamation, reflétant un style plus affirmé et polarisant. En revanche, les articles de La Presse présentent des particularités qui compliquent la généralisation de la tâche. En sommme, cette étude démontre la supériorité des modèles neuronaux comme CamemBERT pour la classification de textes journalistiques, notamment grâce à leur capacité de transfert, bien que les modèles basés sur des traits se distinguent par la transparence de leur raisonnement’. Elle met également en lumière des différences significatives entre les cultures journalistiques québécoises et belges.

2019

pdf abs
Study of lexical aspect in the French medical language. Development of a lexical resource
Agathe Pierson | Cédrick Fairon
Proceedings of the 2nd Clinical Natural Language Processing Workshop

This paper details the development of a linguistic resource designed to improve temporal information extraction systems and to integrate aspectual values. After a brief review of recent works in temporal information extraction for the medical area, we discuss the linguistic notion of aspect and how it got a place in the NLP field. Then, we present our clinical data and describe the five-step approach adopted in this study. Finally, we represent the linguistic resource itself and explain how we elaborated it and which properties were selected for the creation of the tables.

2018

pdf abs
Investigating Productive and Receptive Knowledge: A Profile for Second Language Learning
Leonardo Zilio | Rodrigo Wilkens | Cédrick Fairon
Proceedings of the 27th International Conference on Computational Linguistics

The literature frequently addresses the differences in receptive and productive vocabulary, but grammar is often left unacknowledged in second language acquisition studies. In this paper, we used two corpora to investigate the divergences in the behavior of pedagogically relevant grammatical structures in reception and production texts. We further improved the divergence scores observed in this investigation by setting a polarity to them that indicates whether there is overuse or underuse of a grammatical structure by language learners. This led to the compilation of a language profile that was later combined with vocabulary and readability features for classifying reception and production texts in three classes: beginner, intermediate, and advanced. The results of the automatic classification task in both production (0.872 of F-measure) and reception (0.942 of F-measure) were comparable to the current state of the art. We also attempted to automatically attribute a score to texts produced by learners, and the correlation results were encouraging, but there is still a good amount of room for improvement in this task. The developed language profile will serve as input for a system that helps language learners to activate more of their passive knowledge in writing texts.

pdf
SW4ALL: a CEFR Classified and Aligned Corpus for Language Learning
Rodrigo Wilkens | Leonardo Zilio | Cédrick Fairon
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
An SLA Corpus Annotated with Pedagogically Relevant Grammatical Structures
Leonardo Zilio | Rodrigo Wilkens | Cédrick Fairon
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf abs
NT2Lex: A CEFR-Graded Lexical Resource for Dutch as a Foreign Language Linked to Open Dutch WordNet
Anaïs Tack | Thomas François | Piet Desmet | Cédrick Fairon
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

In this paper, we introduce NT2Lex, a novel lexical resource for Dutch as a foreign language (NT2) which includes frequency distributions of 17,743 words and expressions attested in expert-written textbook texts and readers graded along the scale of the Common European Framework of Reference (CEFR). In essence, the lexicon informs us about what kind of vocabulary should be understood when reading Dutch as a non-native reader at a particular proficiency level. The main novelty of the resource with respect to the previously developed CEFR-graded lexicons concerns the introduction of corpus-based evidence for L2 word sense complexity through the linkage to Open Dutch WordNet (Postma et al., 2016). The resource thus contains, on top of the lemmatised and part-of-speech tagged lexical entries, a total of 11,999 unique word senses and 8,934 distinct synsets.

2017

pdf abs
Using NLP for Enhancing Second Language Acquisition
Leonardo Zilio | Rodrigo Wilkens | Cédrick Fairon
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

This study presents SMILLE, a system that draws on the Noticing Hypothesis and on input enhancements, addressing the lack of salience of grammatical infor mation in online documents chosen by a given user. By means of input enhancements, the system can draw the user’s attention to grammar, which could possibly lead to a higher intake per input ratio for metalinguistic information. The system receives as input an online document and submits it to a combined processing of parser and hand-written rules for detecting its grammatical structures. The input text can be freely chosen by the user, providing a more engaging experience and reflecting the user’s interests. The system can enhance a total of 107 fine-grained types of grammatical structures that are based on the CEFR. An evaluation of some of those structures resulted in an overall precision of 87%.

pdf abs
Human and Automated CEFR-based Grading of Short Answers
Anaïs Tack | Thomas François | Sophie Roekhaut | Cédrick Fairon
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

This paper is concerned with the task of automatically assessing the written proficiency level of non-native (L2) learners of English. Drawing on previous research on automated L2 writing assessment following the Common European Framework of Reference for Languages (CEFR), we investigate the possibilities and difficulties of deriving the CEFR level from short answers to open-ended questions, which has not yet been subjected to numerous studies up to date. The object of our study is twofold: to examine the intricacy involved with both human and automated CEFR-based grading of short answers. On the one hand, we describe the compilation of a learner corpus of short answers graded with CEFR levels by three certified Cambridge examiners. We mainly observe that, although the shortness of the answers is reported as undermining a clear-cut evaluation, the length of the answer does not necessarily correlate with inter-examiner disagreement. On the other hand, we explore the development of a soft-voting system for the automated CEFR-based grading of short answers and draw tentative conclusions about its use in a computer-assisted testing (CAT) setting.

2016

pdf
CENTAL at SemEval-2016 Task 12: a linguistically fed CRF model for medical and temporal information extraction
Charlotte Hansart | Damien De Meyere | Patrick Watrin | André Bittar | Cédrick Fairon
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
TAXI at SemEval-2016 Task 13: a Taxonomy Induction Method based on Lexico-Syntactic Patterns, Substrings and Focused Crawling
Alexander Panchenko | Stefano Faralli | Eugen Ruppert | Steffen Remus | Hubert Naets | Cédrick Fairon | Simone Paolo Ponzetto | Chris Biemann
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf abs
Modèles adaptatifs pour prédire automatiquement la compétence lexicale d’un apprenant de français langue étrangère (Adaptive models for automatically predicting the lexical competence of French as a foreign language learners)
Anaïs Tack | Thomas François | Anne-Laure Ligozat | Cédrick Fairon
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

Cette étude examine l’utilisation de méthodes d’apprentissage incrémental supervisé afin de prédire la compétence lexicale d’apprenants de français langue étrangère (FLE). Les apprenants ciblés sont des néerlandophones ayant un niveau A2/B1 selon le Cadre européen commun de référence pour les langues (CECR). À l’instar des travaux récents portant sur la prédiction de la maîtrise lexicale à l’aide d’indices de complexité, nous élaborons deux types de modèles qui s’adaptent en fonction d’un retour d’expérience, révélant les connaissances de l’apprenant. En particulier, nous définissons (i) un modèle qui prédit la compétence lexicale de tous les apprenants du même niveau de maîtrise et (ii) un modèle qui prédit la compétence lexicale d’un apprenant individuel. Les modèles obtenus sont ensuite évalués par rapport à un modèle de référence déterminant la compétence lexicale à partir d’un lexique spécialisé pour le FLE et s’avèrent gagner significativement en exactitude (9%-17%).

pdf abs
Classification automatique de dictées selon leur niveau de difficulté de compréhension et orthographique (Automatic classification of dictations according to their complexity for comprehension and writing production)
Adeline Müller | Thomas Francois | Sophie Roekhaut | Cedrick Fairon
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

Cet article présente une approche visant à évaluer automatiquement la difficulté de dictées en vue de les intégrer dans une plateforme d’apprentissage de l’orthographe. La particularité de l’exercice de la dictée est de devoir percevoir du code oral et de le retranscrire via le code écrit. Nous envisageons ce double niveau de difficulté à l’aide de 375 variables mesurant la difficulté de compréhension d’un texte ainsi que les phénomènes orthographiques et grammaticaux complexes qu’il contient. Un sous-ensemble optimal de ces variables est combiné à l’aide d’un modèle par machines à vecteurs de support (SVM) qui classe correctement 56% des textes. Les variables lexicales basées sur la liste orthographique de Catach (1984) se révèlent les plus informatives pour le modèle.

pdf abs
Evaluating Lexical Simplification and Vocabulary Knowledge for Learners of French: Possibilities of Using the FLELex Resource
Anaïs Tack | Thomas François | Anne-Laure Ligozat | Cédrick Fairon
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This study examines two possibilities of using the FLELex graded lexicon for the automated assessment of text complexity in French as a foreign language learning. From the lexical frequency distributions described in FLELex, we derive a single level of difficulty for each word in a parallel corpus of original and simplified texts. We then use this data to automatically address the lexical complexity of texts in two ways. On the one hand, we evaluate the degree of lexical simplification in manually simplified texts with respect to their original version. Our results show a significant simplification effect, both in the case of French narratives simplified for non-native readers and in the case of simplified Wikipedia texts. On the other hand, we define a predictive model which identifies the number of words in a text that are expected to be known at a particular learning level. We assess the accuracy with which these predictions are able to capture actual word knowledge as reported by Dutch-speaking learners of French. Our study shows that although the predictions seem relatively accurate in general (87.4% to 92.3%), they do not yet seem to cover the learners’ lack of knowledge very well.

2014

pdf abs
FLELex: a graded lexical resource for French foreign learners
Thomas François | Nùria Gala | Patrick Watrin | Cédrick Fairon
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we present FLELex, the first graded lexicon for French as a foreign language (FFL) that reports word frequencies by difficulty level (according to the CEFR scale). It has been obtained from a tagged corpus of 777,000 words from available textbooks and simplified readers intended for FFL learners. Our goal is to freely provide this resource to the community to be used for a variety of purposes going from the assessment of the lexical difficulty of a text, to the selection of simpler words within text simplification systems, and also as a dictionary in assistive tools for writing.

pdf
A model to predict lexical complexity and to grade words (Un modèle pour prédire la complexité lexicale et graduer les mots) [in French]
Núria Gala | Thomas François | Delphine Bernhard | Cédrick Fairon
Proceedings of TALN 2014 (Volume 1: Long Papers)

pdf
AMesure: a readability formula for administrative texts (AMESURE: une plateforme de lisibilité pour les textes administratifs) [in French]
Thomas François | Laetitia Brouwers | Hubert Naets | Cédrick Fairon
Proceedings of TALN 2014 (Volume 2: Short Papers)

Cet article présente une méthode hybride de normalisation des SMS, à mi-chemin entre correction orthographique et traduction automatique. La partie du système qui assure la normalisation utilise exclusivement des modèles entraînés sur corpus. Evalué en français par validation croisée, le système obtient un taux d’erreur au mot de 9.3% et un score BLEU de 0.83.

pdf abs
Expressive : Génération automatique de parole expressive à partir de données non linguistiques
Olivier Blanc | Noémi Boubel | Jean-Philippe Goldman | Sophie Roekhaut | Anne Catherine Simon | Cédrick Fairon | Richard Beaufort
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

Nous présentons Expressive, un système de génération de parole expressive à partir de données non linguistiques. Ce système est composé de deux outils distincts : Taittingen, un générateur automatique de textes d’une grande variété lexico-syntaxique produits à partir d’une représentation conceptuelle du discours, et StyloPhone, un système de synthèse vocale multi-styles qui s’attache à rendre le discours produit attractif et naturel en proposant différents styles vocaux.

pdf abs
Text-it /Voice-it Une application mobile de normalisation des SMS
Richard Beaufort | Kévin Macé | Cédrick Fairon
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

Cet article présente Text-it / Voice-it, une application de normalisation des SMS pour téléphone mobile. L’application permet d’envoyer et de recevoir des SMS normalisés, et offre le choix entre un résultat textuel (Text-it) et vocal (Voice-it).

pdf
Machine learning and features selection for semi-automatic ICD-9-CM encoding
Julia Medori | Cédrick Fairon
Proceedings of the NAACL HLT 2010 Second Louhi Workshop on Text and Data Mining of Health Documents

2009

pdf abs
Recto /Verso Un système de conversion automatique ancienne / nouvelle orthographe à visée linguistique et didactique
Richard Beaufort | Anne Dister | Hubert Naets | Kévin Macé | Cédrick Fairon
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Cet article présente Recto /Verso, un système de traitement automatique du langage dédié à l’application des rectifications orthographiques de 1990. Ce système a été développé dans le cadre de la campagne de sensibilisation réalisée en mars dernier par le Service et le Conseil de la langue française et de la politique linguistique de la Communauté française de Belgique. Nous commençons par rappeler les motivations et le contenu de la réforme proposée, et faisons le point sur les principes didactiques retenus dans le cadre de la campagne. La plus grande partie de l’article est ensuite consacrée à l’implémentation du système. Nous terminons enfin par une première analyse de l’impact de la campagne sur les utilisateurs.

2006

pdf abs
A framework for real-time dictionary updating
Cédrick Fairon | Sébastien Paumier
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We present a framework that combines a web-based text acquisition tool, a term extractor and a two-level workflow management system tailored for facilitating dictionary updates. Our aim is to show that, thanks to such a methodology, it is possible to monitor data sources and rapidly review and code new dictionary entries. Once approved, these new entries can feed in real-time client dictionary-based applications that need to be continuously kept up to date.

pdf abs
A translated corpus of 30,000 French SMS
Cédrick Fairon | Sébastien Paumier
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The development of communication technologies has contributed to the appearance of new forms in the written language that scientists have to study according to their peculiarities (typing or viewing constraints, synchronicity, etc). In the particular case of SMS (Short Message Service), studies are complicated by a lack of data, mainly due to technical constraints and privacy considerations. In this paper, we present a corpus of 30,000 French SMS collected through a project in Belgium named Faites don de vos SMS à la science (Give your SMS to Science). This corpus is unique in its quality, its size and the fact that the SMS have been manually translated into standard French. We will first describe the collection process and discuss the writers' profiles. Then we will explain in detail how the translation was carried out.

pdf bib
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Conférences invitées
Piet Mertens | Cédrick Fairon | Anne Dister | Patrick Watrin
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Conférences invitées

pdf bib
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs
Piet Mertens | Cédrick Fairon | Anne Dister | Patrick Watrin
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

pdf bib
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Posters
Piet Mertens | Cédrick Fairon | Anne Dister | Patrick Watrin
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

pdf bib
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Tutoriels
Piet Mertens | Cédrick Fairon | Anne Dister | Patrick Watrin
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Tutoriels

pdf bib
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues
Piet Mertens | Cédrick Fairon | Anne Dister | Patrick Watrin
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues

pdf bib
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues (Posters)
Piet Mertens | Cédrick Fairon | Anne Dister | Patrick Watrin
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues (Posters)

pdf
Corporator: A tool for creating RSS-based specialized corpora
Cédrick Fairon
Proceedings of the 2nd International Workshop on Web as Corpus

2002

pdf abs
Automatic Item Text Generation in Educational Assessment
Cédrick Fairon | David M. Williamson
Actes de la 9ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

We present an automatic text generation system (ATG) developed for the generation of natural language text for automatically produced test items. This ATG has been developed to work with an automatic item generation system for analytical reasoning items for use in tests with high-stakes outcomes (such as college admissions decisions). As such, the development and implementation of this ATG is couched in the context and goals of automated item generation for educational assessment.