Lionel Nicolas


About the Applicability of Combining Implicit Crowdsourcing and Language Learning for the Collection of NLP Datasets
Verena Lyding | Lionel Nicolas | Alexander König
Proceedings of the 2nd Workshop on Novel Incentives in Data Collection from People: models, implementations, challenges and results within LREC 2022

In this article, we present a recent trend of approaches, hereafter referred to as Collect4NLP, and discuss its applicability. Collect4NLP-based approaches collect inputs from language learners through learning exercises and aggregate the collected data to derive linguistic knowledge of expert quality. The primary purpose of these approaches is to improve NLP resources, however sincere concern with the needs of learners is crucial for making Collect4NLP work. We discuss the applicability of Collect4NLP approaches in relation to two perspectives. On the one hand, we compare Collect4NLP approaches to the two crowdsourcing trends currently most prevalent in NLP, namely Crowdsourcing Platforms (CPs) and Games-With-A-Purpose (GWAPs), and identify strengths and weaknesses of each trend. By doing so we aim to highlight particularities of each trend and to identify in which kind of settings one trend should be favored over the other two. On the other hand, we analyze the applicability of Collect4NLP approaches to the production of different types of NLP resources. We first list the types of NLP resources most used within its community and second propose a set of blueprints for mapping these resources to well-established language learning exercises as found in standard language learning textbooks.


pdf bib
An Experiment on Implicitly Crowdsourcing Expert Knowledge about Romanian Synonyms from Language Learners
Lionel Nicolas | Lavinia Nicoleta Aparaschivei | Verena Lyding | Christos Rodosthenous | Federico Sangati | Alexander König | Corina Forascu
Proceedings of the 10th Workshop on NLP for Computer Assisted Language Learning


Creating Expert Knowledge by Relying on Language Learners: a Generic Approach for Mass-Producing Language Resources by Combining Implicit Crowdsourcing and Language Learning
Lionel Nicolas | Verena Lyding | Claudia Borg | Corina Forascu | Karën Fort | Katerina Zdravkova | Iztok Kosem | Jaka Čibej | Špela Arhar Holdt | Alice Millour | Alexander König | Christos Rodosthenous | Federico Sangati | Umair ul Hassan | Anisia Katinskaia | Anabela Barreiro | Lavinia Aparaschivei | Yaakov HaCohen-Kerner
Proceedings of the Twelfth Language Resources and Evaluation Conference

We introduce in this paper a generic approach to combine implicit crowdsourcing and language learning in order to mass-produce language resources (LRs) for any language for which a crowd of language learners can be involved. We present the approach by explaining its core paradigm that consists in pairing specific types of LRs with specific exercises, by detailing both its strengths and challenges, and by discussing how much these challenges have been addressed at present. Accordingly, we also report on on-going proof-of-concept efforts aiming at developing the first prototypical implementation of the approach in order to correct and extend an LR called ConceptNet based on the input crowdsourced from language learners. We then present an international network called the European Network for Combining Language Learning with Crowdsourcing Techniques (enetCollect) that provides the context to accelerate the implementation of this generic approach. Finally, we exemplify how it can be used in several language learning scenarios to produce a multitude of NLP resources and how it can therefore alleviate the long-standing NLP issue of the lack of LRs.

Using Crowdsourced Exercises for Vocabulary Training to Expand ConceptNet
Christos Rodosthenous | Verena Lyding | Federico Sangati | Alexander König | Umair ul Hassan | Lionel Nicolas | Jolita Horbacauskiene | Anisia Katinskaia | Lavinia Aparaschivei
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this work, we report on a crowdsourcing experiment conducted using the V-TREL vocabulary trainer which is accessed via a Telegram chatbot interface to gather knowledge on word relations suitable for expanding ConceptNet. V-TREL is built on top of a generic architecture implementing the implicit crowdsourding paradigm in order to offer vocabulary training exercises generated from the commonsense knowledge-base ConceptNet and – in the background – to collect and evaluate the learners’ answers to extend ConceptNet with new words. In the experiment about 90 university students learning English at C1 level, based on Common European Framework of Reference for Languages (CEFR), trained their vocabulary with V-TREL over a period of 16 calendar days. The experiment allowed to gather more than 12,000 answers from learners on different question types. In this paper we present in detail the experimental setup and the outcome of the experiment, which indicates the potential of our approach for both crowdsourcing data as well as fostering vocabulary skills.

pdf bib
Substituto – A Synchronous Educational Language Game for Simultaneous Teaching and Crowdsourcing
Marianne Grace Araneta | Gülşen Eryiğit | Alexander König | Ji-Ung Lee | Ana Luís | Verena Lyding | Lionel Nicolas | Christos Rodosthenous | Federico Sangati
Proceedings of the 9th Workshop on NLP for Computer Assisted Language Learning


v-trel: Vocabulary Trainer for Tracing Word Relations - An Implicit Crowdsourcing Approach
Verena Lyding | Christos Rodosthenous | Federico Sangati | Umair ul Hassan | Lionel Nicolas | Alexander König | Jolita Horbacauskiene | Anisia Katinskaia
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

In this paper, we present our work on developing a vocabulary trainer that uses exercises generated from language resources such as ConceptNet and crowdsources the responses of the learners to enrich the language resource. We performed an empirical evaluation of our approach with 60 non-native speakers over two days, which shows that new entries to expand Concept-Net can efficiently be gathered through vocabulary exercises on word relations. We also report on the feedback gathered from the users and an expert from language teaching, and discuss the potential of the vocabulary trainer application from the user and language learner perspective. The feedback suggests that v-trel has educational potential, while in its current state some shortcomings could be identified.


Transc&Anno: A Graphical Tool for the Transcription and On-the-Fly Annotation of Handwritten Documents
Nadezda Okinina | Lionel Nicolas | Verena Lyding
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)


‘interHist’ - an interactive visual interface for corpus exploration
Verena Lyding | Lionel Nicolas | Egon Stemle
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this article, we present interHist, a compact visualization for the interactive exploration of results to complex corpus queries. Integrated with a search interface to the PAISA corpus of Italian web texts, interHist aims at facilitating the exploration of large results sets to linguistic corpus searches. This objective is approached by providing an interactive visual overview of the data, which supports the user-steered navigation by means of interactive filtering. It allows to dynamically switch between an overview on the data and a detailed view on results in their immediate textual context, thus helping to detect and inspect relevant hits more efficiently. We provide background information on corpus linguistics and related work on visualizations for language and linguistic data. We introduce the architecture of interHist, by detailing the data structure it relies on, describing the visualization design and providing technical details of the implementation and its integration with the corpus querying environment. Finally, we illustrate its usage by presenting a use case for the analysis of the composition of Italian noun phrases.

The MERLIN corpus: Learner language and the CEFR
Adriane Boyd | Jirka Hana | Lionel Nicolas | Detmar Meurers | Katrin Wisniewski | Andrea Abel | Karin Schöne | Barbora Štindlová | Chiara Vettori
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The MERLIN corpus is a written learner corpus for Czech, German,and Italian that has been designed to illustrate the Common European Framework of Reference for Languages (CEFR) with authentic learner data. The corpus contains 2,290 learner texts produced in standardized language certifications covering CEFR levels A1-C1. The MERLIN annotation scheme includes a wide range of language characteristics that enable research into the empirical foundations of the CEFR scales and provide language teachers, test developers, and Second Language Acquisition researchers with concrete examples of learner performance and progress across multiple proficiency levels. For computational linguistics, it provide a range of authentic learner data for three target languages, supporting a broadening of the scope of research in areas such as automatic proficiency classification or native language identification. The annotated corpus and related information will be freely available as a corpus resource and through a freely accessible, didactically-oriented online platform.

KoKo: an L1 Learner Corpus for German
Andrea Abel | Aivars Glaznieks | Lionel Nicolas | Egon Stemle
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We introduce the KoKo corpus, a collection of German L1 learner texts annotated with learner errors, along with the methods and tools used in its construction and evaluation. The corpus contains both texts and corresponding survey information from 1,319 pupils and amounts to around 716,000 tokens. The evaluation of the performed transcriptions and annotations shows an accuracy of orthographic error annotations of approximately 80% as well as high accuracies of transcriptions (>99%), automatic tokenisation (>99%), sentence splitting (>96%) and POS-tagging (>94%). The KoKo corpus will be published at the end of 2014. It will be the first accessible linguistically annotated German L1 learner corpus and a valuable source for research on L1 learner language as well as for teachers of German as L1, in particular with regards to writing skills.


High-Accuracy Phrase Translation Acquisition Through Battle-Royale Selection
Lionel Nicolas | Egon W. Stemle | Klara Kranebitter | Verena Lyding
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013


Unsupervised acquisition of concatenative morphology
Lionel Nicolas | Jacques Farré | Cécile Darme
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Among the linguistic resources formalizing a language, morphological rules are among those that can be achieved in a reasonable time. Nevertheless, since the construction of such resource can require linguistic expertise, morphological rules are still lacking for many languages. The automatized acquisition of morphology is thus an open topic of interest within the NLP field. We present an approach that allows to automatically compute, from raw corpora, a data-representative description of the concatenative mechanisms of a morphology. Our approach takes advantage of phenomena that are observable for all languages using morphological inflection and derivation but are more easy to exploit when dealing with concatenative mechanisms. Since it has been developed toward the objective of being used on as many languages as possible, applying this approach to a varied set of languages needs very few expert work. The results obtained for our first participation in the 2010 edition of MorphoChallenge have confirmed both the practical interest and the potential of the method.


Building a morphological and syntactic lexicon by merging various linguistic resources
Miguel A. Molinero | Benoît Sagot | Lionel Nicolas
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

Trouver et confondre les coupables : un processus sophistiqué de correction de lexique
Lionel Nicolas | Benoît Sagot | Miguel A. Molinero | Jacques Farré | Éric Villemonte De La Clergerie
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

La couverture d’un analyseur syntaxique dépend avant tout de la grammaire et du lexique sur lequel il repose. Le développement d’un lexique complet et précis est une tâche ardue et de longue haleine, surtout lorsque le lexique atteint un certain niveau de qualité et de couverture. Dans cet article, nous présentons un processus capable de détecter automatiquement les entrées manquantes ou incomplètes d’un lexique, et de suggérer des corrections pour ces entrées. La détection se réalise au moyen de deux techniques reposant soit sur un modèle statistique, soit sur les informations fournies par un étiqueteur syntaxique. Les hypothèses de corrections pour les entrées lexicales détectées sont générées en étudiant les modifications qui permettent d’améliorer le taux d’analyse des phrases dans lesquelles ces entrées apparaissent. Le processus global met en oeuvre plusieurs techniques utilisant divers outils tels que des étiqueteurs et des analyseurs syntaxiques ou des classifieurs d’entropie. Son application au Lefff , un lexique morphologique et syntaxique à large couverture du français, nous a déjà permis de réaliser des améliorations notables.

A Morphological and Syntactic Wide-coverage Lexicon for Spanish: The Leffe
Miguel A. Molinero | Benoît Sagot | Lionel Nicolas
Proceedings of the International Conference RANLP-2009

Towards Efficient Production of Linguistic Resources: the Victoria Project
Lionel Nicolas | Miguel A. Molinero | Benoît Sagot | Elena Trigo | Éric de La Clergerie | Miguel Alonso Pardo | Jacques Farré | Joan Miquel Vergés
Proceedings of the International Conference RANLP-2009


Computer Aided Correction and Extension of a Syntactic Wide-Coverage Lexicon
Lionel Nicolas | Benoît Sagot | Miguel A. Molinero | Jacques Farré | Éric de la Clergerie
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)


Confondre le coupable : corrections d’un lexique suggérées par une grammaire
Lionel Nicolas | Jacques Farré | Éric Villemonte De La Clergerie
Actes de la 14ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Le succès de l’analyse syntaxique d’une phrase dépend de la qualité de la grammaire sous-jacente mais aussi de celle du lexique utilisé. Une première étape dans l’amélioration des lexiques consiste à identifier les entrées lexicales potentiellement erronées, par exemple en utilisant des techniques de fouilles d’erreurs sur corpus (Sagot & Villemonte de La Clergerie, 2006). Nous explorons ici l’étape suivante : la suggestion de corrections pour les entrées identifiées. Cet objectif est atteint au travers de réanalyses des phrases rejetées à l’étape précédente, après modification des informations portées par les entrées suspectées. Un calcul statistique sur les nouveaux résultats permet ensuite de mettre en valeur les corrections les plus pertinentes.