Amalia Todirascu

Also published as: Amalia Todiraşcu


HECTOR: A Hybrid TExt SimplifiCation TOol for Raw Texts in French
Amalia Todirascu | Rodrigo Wilkens | Eva Rolin | Thomas François | Delphine Bernhard | Núria Gala
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Reducing the complexity of texts by applying an Automatic Text Simplification (ATS) system has been sparking interest inthe area of Natural Language Processing (NLP) for several years and a number of methods and evaluation campaigns haveemerged targeting lexical and syntactic transformations. In recent years, several studies exploit deep learning techniques basedon very large comparable corpora. Yet the lack of large amounts of corpora (original-simplified) for French has been hinderingthe development of an ATS tool for this language. In this paper, we present our system, which is based on a combination ofmethods relying on word embeddings for lexical simplification and rule-based strategies for syntax and discourse adaptations. We present an evaluation of the lexical, syntactic and discourse-level simplifications according to automatic and humanevaluations. We discuss the performances of our system at the lexical, syntactic, and discourse levels


Coreference-Based Text Simplification
Rodrigo Wilkens | Bruno Oberle | Amalia Todirascu
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)

Text simplification aims at adapting documents to make them easier to read by a given audience. Usually, simplification systems consider only lexical and syntactic levels, and, moreover, are often evaluated at the sentence level. Thus, studies on the impact of simplification in text cohesion are lacking. Some works add coreference resolution in their pipeline to address this issue. In this paper, we move forward in this direction and present a rule-based system for automatic text simplification, aiming at adapting French texts for dyslexic children. The architecture of our system takes into account not only lexical and syntactic but also discourse information, based on coreference chains. Our system has been manually evaluated in terms of grammaticality and cohesion. We have also built and used an evaluation corpus containing multiple simplification references for each sentence. It has been annotated by experts following a set of simplification guidelines, and can be used to run automatic evaluation of other simplification systems. Both the system and the evaluation corpus are freely available.

French Coreference for Spoken and Written Language
Rodrigo Wilkens | Bruno Oberle | Frédéric Landragin | Amalia Todirascu
Proceedings of the Twelfth Language Resources and Evaluation Conference

Coreference resolution aims at identifying and grouping all mentions referring to the same entity. In French, most systems run different setups, making their comparison difficult. In this paper, we present an extensive comparison of several coreference resolution systems for French. The systems have been trained on two corpora (ANCOR for spoken language and Democrat for written language) annotated with coreference chains, and augmented with syntactic and semantic information. The models are compared with different configurations (e.g. with and without singletons). In addition, we evaluate mention detection and coreference resolution apart. We present a full-stack model that outperforms other approaches. This model allows us to study the impact of mention detection errors on coreference resolution. Our analysis shows that mention detection can be improved by focusing on boundary identification while advances in the pronoun-noun relation detection can help the coreference task. Another contribution of this work is the first end-to-end neural French coreference resolution model trained on Democrat (written texts), which compares to the state-of-the-art systems for oral French.

Simplifying Coreference Chains for Dyslexic Children
Rodrigo Wilkens | Amalia Todirascu
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present a work aiming to generate adapted content for dyslexic children for French, in the context of the ALECTOR project. Thus, we developed a system to transform the texts at the discourse level. This system modifies the coreference chains, which are markers of text cohesion, by using rules. These rules were designed following a careful study of coreference chains in both original texts and its simplified versions. Moreover, in order to define reliable transformation rules, we analysed several coreference properties as well as the concurrent simplification operations in the aligned texts. This information is coded together with a coreference resolution system and a text rewritten tool in the proposed system, which comprise a coreference module specialised in written text and seven text transformation operations. The evaluation of the system firstly focused on check the simplification by manual validation of three judges. These errors were grouped into five classes that combined can explain 93% of the errors. The second evaluation step consisted of measuring the simplification perception by 23 judges, which allow us to measure the simplification impact of the proposed rules.

Un corpus d’évaluation pour un système de simplification discursive (An Evaluation Corpus for Automatic Discourse Simplification)
Rodrigo Wilkens | Amalia Todirascu
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

Nous présentons un nouveau corpus simplifié, disponible en français pour l’évaluation d’un système de simplification discursive. Ce système utilise des chaînes de référence pour simplifier et pour préserver la cohésion textuelle après simplification. Nous présentons la méthodologie de collecte de corpus (via un formulaire, qui recueille les simplifications manuelles faites par des participants experts), les règles présentées dans le guide, une analyse des types de simplifications et une évaluation de notre corpus, par comparaison avec la sortie du système de simplification automatique.


PolylexFLE : une base de données d’expressions polylexicales pour le FLE (PolylexFLE : a database of multiword expressions for French L2 language learning)
Amalia Todirascu | Marion Cargill | Thomas Francois
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume I : Articles longs

Nous présentons la base PolylexFLE, contenant 4295 expressions polylexicales. Elle est integrée dans une plateforme d’apprentissage du FLE, SimpleApprenant, destinée à l’apprentissage des expressions polylexicales verbales (idiomatiques, collocations ou expressions figées). Afin de proposer des exercices adaptés au niveau du Cadre européen de référence pour les langues (CECR), nous avons utilisé une procédure mixte (manuelle et automatique) pour annoter 1098 expressions selon les niveaux de compétence du CECR. L’article se concentre sur la procédure automatique qui identifie, dans un premier temps, les expressions de la base PolylexFLE dans un corpus à l’aide d’un système à base d’expressions régulières. Dans un second temps, leur distribution au sein de corpus, annoté selon l’échelle du CECR, est estimée et transformée en un niveau CECR unique.


Survey: Multiword Expression Processing: A Survey
Mathieu Constant | Gülşen Eryiǧit | Johanna Monti | Lonneke van der Plas | Carlos Ramisch | Michael Rosner | Amalia Todirascu
Computational Linguistics, Volume 43, Issue 4 - December 2017

Multiword expressions (MWEs) are a class of linguistic forms spanning conventional word boundaries that are both idiosyncratic and pervasive across different languages. The structure of linguistic processing that depends on the clear distinction between words and phrases has to be re-thought to accommodate MWEs. The issue of MWE handling is crucial for NLP applications, where it raises a number of challenges. The emergence of solutions in the absence of guiding principles motivates this survey, whose aim is not only to provide a focused review of MWE processing, but also to clarify the nature of interactions between MWE processing and downstream applications. We propose a conceptual framework within which challenges and research contributions can be positioned. It offers a shared understanding of what is meant by “MWE processing,” distinguishing the subtasks of MWE discovery and identification. It also elucidates the interactions between MWE processing and two use cases: Parsing and machine translation. Many of the approaches in the literature can be differentiated according to how MWE processing is timed with respect to underlying use cases. We discuss how such orchestration choices affect the scope of MWE-aware systems. For each of the two MWE processing subtasks and for each of the two use cases, we conclude on open issues and research perspectives.


Are Cohesive Features Relevant for Text Readability Evaluation?
Amalia Todirascu | Thomas François | Delphine Bernhard | Núria Gala | Anne-Laure Ligozat
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

This paper investigates the effectiveness of 65 cohesion-based variables that are commonly used in the literature as predictive features to assess text readability. We evaluate the efficiency of these variables across narrative and informative texts intended for an audience of L2 French learners. In our experiments, we use a French corpus that has been both manually and automatically annotated as regards to co-reference and anaphoric chains. The efficiency of the 65 variables for readability is analyzed through a correlational analysis and some modelling experiments.


Caractériser les discours académiques et de vulgarisation : quelles propriétés ?
Amalia Todirascu | Beatriz Sanchez Cardenas
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

L’article présente une étude des propriétés linguistiques (lexicales, morpho-syntaxiques, syntaxiques) permettant la classification automatique de documents selon leur genre (articles scientifiques et articles de vulgarisation), dans deux domaines différentes (médecine et informatique). Notre analyse, effectuée sur des corpus comparables en genre et en thèmes disponibles en français, permet de valider certaines propriétés identifiées dans la littérature comme caractéristiques des discours académiques ou de vulgarisation scientifique. Les premières expériences de classification évaluent l’influence de ces propriétés pour l’identification automatique du genre pour le cas spécifique des textes scientifiques ou de vulgarisation.


French and German Corpora for Audience-based Text Type Classification
Amalia Todirascu | Sebastian Padó | Jennifer Krisch | Max Kisselew | Ulrich Heid
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents some of the results of the CLASSYN project which investigated the classification of text according to audience-related text types. We describe the design principles and the properties of the French and German linguistically annotated corpora that we have created. We report on tools used to collect the data and on the quality of the syntactic annotation. The CLASSYN corpora comprise two text collections to investigate general text types difference between scientific and popular science text on the two domains of medical and computer science.


Cognate Identification for a French - Romanian Lexical Alignment System: Empirical Study
Mirabela Navlea | Amalia Todiraşcu
Proceedings of the 15th Annual conference of the European Association for Machine Translation

Identification de cognats à partir de corpus parallèles français-roumain (Identification of cognates from French-Romanian parallel corpora)
Mirabela Navlea | Amalia Todiraşcu
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Cet article présente une méthode hybride d’identification de cognats français - roumain. Cette méthode exploite des corpus parallèles alignés au niveau propositionnel, lemmatisés et étiquetés (avec des propriétés morphosyntaxiques). Notre méthode combine des techniques statistiques et des informations linguistiques pour améliorer les résultats obtenus. Nous évaluons le module d’identification de cognats et nous faisons une comparaison avec des méthodes statistiques pures, afin d’étudier l’impact des informations linguistiques utilisées sur la qualité des résultats obtenus. Nous montrons que l’utilisation des informations linguistiques augmente significativement la performance de la méthode.

RefGen, outil d’identification automatique des chaînes de référence en français (RefGen, an automatic identification tool of reference chains in French)
Laurence Longo | Amalia Todirascu
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

Using Cognates in a French-Romanian Lexical Alignment System: A Comparative Study
Mirabela Navlea | Amalia Todiraşcu
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011


RefGen : un module d’identification des chaînes de référence dépendant du genre textuel
Laurence Longo | Amalia Todiraşcu
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Dans cet article, nous présentons RefGen, un module d’identification des chaînes de référence pour le français. RefGen effectue une annotation automatique des expressions référentielles puis identifie les relations de coréférence établies entre ces expressions pour former des chaînes de référence. Le calcul de la référence utilise des propriétés des chaînes de référence dépendantes du genre textuel, l’échelle d’accessibilité d’(Ariel, 1990) et une série de filtres lexicaux, morphosyntaxiques et sémantiques. Nous évaluons les premiers résultats de RefGen sur un corpus issu de rapports publics.


A Hybrid Approach to Extracting and Classifying Verb+Noun Constructions
Amalia Todiraşcu | Dan Tufiş | Ulrich Heid | Christopher Gledhill | Dan Ştefanescu | Marion Weller | François Rousselot
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present the main findings and preliminary results of an ongoing project aimed at developing a system for collocation extraction based on contextual morpho-syntactic properties. We explored two hybrid extraction methods: the first method applies language-indepedent statistical techniques followed by a linguistic filtering, while the second approach, available only for German, is based on a set of lexico-syntactic patterns to extract collocation candidates. To define extraction and filtering patterns, we studied a specific collocation category, the Verb-Noun constructions, using a model inspired by the systemic functional grammar, proposing three level analysis: lexical, functional and semantic criteria. From tagged and lemmatized corpus, we identify some contextual morpho-syntactic properties helping to filter the output of the statistical methods and to extract some potential interesting VN constructions (complex predicates vs complex predicators). The extracted candidates are validated and classified manually.


Experiments on Building Language Resources for Multi-Modal Dialogue Systems
Laurent Romary | Amalia Todirascu | David Langlois
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Introduction to ROMAND 2004
Vincenzo Pallotta | Amalia Todirascu
Proceedings of the 3rd workshop on RObust Methods in Analysis of Natural Language Data (ROMAND 2004)


Towards Reusable NLP Components
Amalia Todirascu | Eric Kow | Laurent Romary
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)


Ontologies for Information Retrieval
Amalia Todiraşcu | François Rousselot
Actes de la 8ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

The paper presents a system for querying (in natural language) a set of text documents from a limited domain. The domain knowledge, represented in description logics (DL), is used for filtering the documents returned as answer and it is extended dynamically (when new concepts are identified in the texts), as result of DL inference mechanisms. The conceptual hierarchy is built semi-automatically from the texts. Concept instances are identified using shallow natural language parsing techniques.