Rodrigo Wilkens


2020

pdf bib
Un corpus d’évaluation pour un système de simplification discursive (An Evaluation Corpus for Automatic Discourse Simplification)
Rodrigo Wilkens | Amalia Todirascu
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

Nous présentons un nouveau corpus simplifié, disponible en français pour l’évaluation d’un système de simplification discursive. Ce système utilise des chaînes de référence pour simplifier et pour préserver la cohésion textuelle après simplification. Nous présentons la méthodologie de collecte de corpus (via un formulaire, qui recueille les simplifications manuelles faites par des participants experts), les règles présentées dans le guide, une analyse des types de simplifications et une évaluation de notre corpus, par comparaison avec la sortie du système de simplification automatique.

pdf bib
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)
Núria Gala | Rodrigo Wilkens
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)

pdf bib
Coreference-Based Text Simplification
Rodrigo Wilkens | Bruno Oberle | Amalia Todirascu
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)

Text simplification aims at adapting documents to make them easier to read by a given audience. Usually, simplification systems consider only lexical and syntactic levels, and, moreover, are often evaluated at the sentence level. Thus, studies on the impact of simplification in text cohesion are lacking. Some works add coreference resolution in their pipeline to address this issue. In this paper, we move forward in this direction and present a rule-based system for automatic text simplification, aiming at adapting French texts for dyslexic children. The architecture of our system takes into account not only lexical and syntactic but also discourse information, based on coreference chains. Our system has been manually evaluated in terms of grammaticality and cohesion. We have also built and used an evaluation corpus containing multiple simplification references for each sentence. It has been annotated by experts following a set of simplification guidelines, and can be used to run automatic evaluation of other simplification systems. Both the system and the evaluation corpus are freely available.

pdf bib
French Coreference for Spoken and Written Language
Rodrigo Wilkens | Bruno Oberle | Frédéric Landragin | Amalia Todirascu
Proceedings of the 12th Language Resources and Evaluation Conference

Coreference resolution aims at identifying and grouping all mentions referring to the same entity. In French, most systems run different setups, making their comparison difficult. In this paper, we present an extensive comparison of several coreference resolution systems for French. The systems have been trained on two corpora (ANCOR for spoken language and Democrat for written language) annotated with coreference chains, and augmented with syntactic and semantic information. The models are compared with different configurations (e.g. with and without singletons). In addition, we evaluate mention detection and coreference resolution apart. We present a full-stack model that outperforms other approaches. This model allows us to study the impact of mention detection errors on coreference resolution. Our analysis shows that mention detection can be improved by focusing on boundary identification while advances in the pronoun-noun relation detection can help the coreference task. Another contribution of this work is the first end-to-end neural French coreference resolution model trained on Democrat (written texts), which compares to the state-of-the-art systems for oral French.

pdf bib
Simplifying Coreference Chains for Dyslexic Children
Rodrigo Wilkens | Amalia Todirascu
Proceedings of the 12th Language Resources and Evaluation Conference

We present a work aiming to generate adapted content for dyslexic children for French, in the context of the ALECTOR project. Thus, we developed a system to transform the texts at the discourse level. This system modifies the coreference chains, which are markers of text cohesion, by using rules. These rules were designed following a careful study of coreference chains in both original texts and its simplified versions. Moreover, in order to define reliable transformation rules, we analysed several coreference properties as well as the concurrent simplification operations in the aligned texts. This information is coded together with a coreference resolution system and a text rewritten tool in the proposed system, which comprise a coreference module specialised in written text and seven text transformation operations. The evaluation of the system firstly focused on check the simplification by manual validation of three judges. These errors were grouped into five classes that combined can explain 93% of the errors. The second evaluation step consisted of measuring the simplification perception by 23 judges, which allow us to measure the simplification impact of the proposed rules.

2018

pdf bib
Investigating Productive and Receptive Knowledge: A Profile for Second Language Learning
Leonardo Zilio | Rodrigo Wilkens | Cédrick Fairon
Proceedings of the 27th International Conference on Computational Linguistics

The literature frequently addresses the differences in receptive and productive vocabulary, but grammar is often left unacknowledged in second language acquisition studies. In this paper, we used two corpora to investigate the divergences in the behavior of pedagogically relevant grammatical structures in reception and production texts. We further improved the divergence scores observed in this investigation by setting a polarity to them that indicates whether there is overuse or underuse of a grammatical structure by language learners. This led to the compilation of a language profile that was later combined with vocabulary and readability features for classifying reception and production texts in three classes: beginner, intermediate, and advanced. The results of the automatic classification task in both production (0.872 of F-measure) and reception (0.942 of F-measure) were comparable to the current state of the art. We also attempted to automatically attribute a score to texts produced by learners, and the correlation results were encouraging, but there is still a good amount of room for improvement in this task. The developed language profile will serve as input for a system that helps language learners to activate more of their passive knowledge in writing texts.

pdf bib
Similarity Measures for the Detection of Clinical Conditions with Verbal Fluency Tasks
Felipe Paula | Rodrigo Wilkens | Marco Idiart | Aline Villavicencio
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Semantic Verbal Fluency tests have been used in the detection of certain clinical conditions, like Dementia. In particular, given a sequence of semantically related words, a large number of switches from one semantic class to another has been linked to clinical conditions. In this work, we investigate three similarity measures for automatically identifying switches in semantic chains: semantic similarity from a manually constructed resource, and word association strength and semantic relatedness, both calculated from corpora. This information is used for building classifiers to distinguish healthy controls from clinical cases with early stages of Alzheimer’s Disease and Mild Cognitive Deficits. The overall results indicate that for clinical conditions the classifiers that use these similarity measures outperform those that use a gold standard taxonomy.

pdf bib
SW4ALL: a CEFR Classified and Aligned Corpus for Language Learning
Rodrigo Wilkens | Leonardo Zilio | Cédrick Fairon
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
An SLA Corpus Annotated with Pedagogically Relevant Grammatical Structures
Leonardo Zilio | Rodrigo Wilkens | Cédrick Fairon
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
The brWaC Corpus: A New Open Resource for Brazilian Portuguese
Jorge A. Wagner Filho | Rodrigo Wilkens | Marco Idiart | Aline Villavicencio
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Using NLP for Enhancing Second Language Acquisition
Leonardo Zilio | Rodrigo Wilkens | Cédrick Fairon
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

This study presents SMILLE, a system that draws on the Noticing Hypothesis and on input enhancements, addressing the lack of salience of grammatical infor mation in online documents chosen by a given user. By means of input enhancements, the system can draw the user’s attention to grammar, which could possibly lead to a higher intake per input ratio for metalinguistic information. The system receives as input an online document and submits it to a combined processing of parser and hand-written rules for detecting its grammatical structures. The input text can be freely chosen by the user, providing a more engaging experience and reflecting the user’s interests. The system can enhance a total of 107 fine-grained types of grammatical structures that are based on the CEFR. An evaluation of some of those structures resulted in an overall precision of 87%.

pdf bib
LexSubNC: A Dataset of Lexical Substitution for Nominal Compounds
Rodrigo Wilkens | Leonardo Zilio | Silvio Ricardo Cordeiro | Felipe Paula | Carlos Ramisch | Marco Idiart | Aline Villavicencio
IWCS 2017 — 12th International Conference on Computational Semantics — Short papers

2016

pdf bib
Automatic Construction of Large Readability Corpora
Jorge Alberto Wagner Filho | Rodrigo Wilkens | Aline Villavicencio
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)

This work presents a framework for the automatic construction of large Web corpora classified by readability level. We compare different Machine Learning classifiers for the task of readability assessment focusing on Portuguese and English texts, analysing the impact of variables like the feature inventory used in the resulting corpus. In a comparison between shallow and deeper features, the former already produce F-measures of over 0.75 for Portuguese texts, but the use of additional features results in even better results, in most cases. For English, shallow features also perform well as do classic readability formulas. Comparing different classifiers for the task, logistic regression obtained, in general, the best results, but with considerable differences between the results for two and those for three-classes, especially regarding the intermediary class. Given the large scale of the resulting corpus, for evaluation we adopt the agreement between different classifiers as an indication of readability assessment certainty. As a result of this work, a large corpus for Brazilian Portuguese was built, including 1.7 million documents and about 1.6 billion tokens, already parsed and annotated with 134 different textual attributes, along with the agreement among the various classifiers.

pdf bib
Multiword Expressions in Child Language
Rodrigo Wilkens | Marco Idiart | Aline Villavicencio
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The goal of this work is to introduce CHILDES-MWE, which contains English CHILDES corpora automatically annotated with Multiword Expressions (MWEs) information. The result is a resource with almost 350,000 sentences annotated with more than 70,000 distinct MWEs of various types from both longitudinal and latitudinal corpora. This resource can be used for large scale language acquisition studies of how MWEs feature in child language. Focusing on compound nouns (CN), we then verify in a longitudinal study if there are differences in the distribution and compositionality of CNs in child-directed and child-produced sentences across ages. Moreover, using additional latitudinal data, we investigate if there are further differences in CN usage and in compositionality preferences. The results obtained for the child-produced sentences reflect CN distribution and compositionality in child-directed sentences.

pdf bib
B2SG: a TOEFL-like Task for Portuguese
Rodrigo Wilkens | Leonardo Zilio | Eduardo Ferreira | Aline Villavicencio
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Resources such as WordNet are useful for NLP applications, but their manual construction consumes time and personnel, and frequently results in low coverage. One alternative is the automatic construction of large resources from corpora like distributional thesauri, containing semantically associated words. However, as they may contain noise, there is a strong need for automatic ways of evaluating the quality of the resulting resource. This paper introduces a gold standard that can aid in this task. The BabelNet-Based Semantic Gold Standard (B2SG) was automatically constructed based on BabelNet and partly evaluated by human judges. It consists of sets of tests that present one target word, one related word and three unrelated words. B2SG contains 2,875 validated relations: 800 for verbs and 2,075 for nouns; these relations are divided among synonymy, antonymy and hypernymy. They can be used as the basis for evaluating the accuracy of the similarity relations on distributional thesauri by comparing the proximity of the target word with the related and unrelated options and observing if the related word has the highest similarity value among them. As a case study two distributional thesauri were also developed: one using surface forms from a large (1.5 billion word) corpus and the other using lemmatized forms from a smaller (409 million word) corpus. Both distributional thesauri were then evaluated against B2SG, and the one using lemmatized forms performed slightly better.

2015

pdf bib
Distributional Thesauri for Portuguese: methodology evaluation
Rodrigo Wilkens | Leonardo Zilio | Eduardo Ferreira | Gabriel Gonçalves | Aline Villavicencio
Proceedings of the 10th Brazilian Symposium in Information and Human Language Technology

2012

pdf bib
An annotated English child language database
Aline Villavicencio | Beracah Yankama | Rodrigo Wilkens | Marco Idiart | Robert Berwick
Proceedings of the Workshop on Computational Models of Language Acquisition and Loss

pdf bib
Searching the Annotated Portuguese Childes Corpora
Rodrigo Wilkens
Proceedings of the Workshop on Computational Models of Language Acquisition and Loss

pdf bib
I say have you say tem: profiling verbs in children data in English and Portuguese
Rodrigo Wilkens | Aline Villavicencio
Proceedings of the Workshop on Computational Models of Language Acquisition and Loss

2010

pdf bib
COMUNICA - A Question Answering System for Brazilian Portuguese
Rodrigo Wilkens | Aline Villavicencio | Daniel Muller | Leandro Wives | Fabio Silva | Stanley Loh
Coling 2010: Demonstrations