Ildikó Pilán

Also published as: Ildiko Pilan, Ildikó Pilan

2022

pdf abs
The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization
Ildikó Pilán | Pierre Lison | Lilja Øvrelid | Anthi Papadopoulou | David Sánchez | Montserrat Batet
Computational Linguistics, Volume 48, Issue 4 - December 2022

We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods. Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources, making it difficult to properly evaluate the level of privacy protection offered by various anonymization methods. This paper presents TAB (Text Anonymization Benchmark), a new, open-source annotated corpus developed to address this shortage. The corpus comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations. Compared with previous work, the TAB corpus is designed to go beyond traditional de-identification (which is limited to the detection of predefined semantic categories), and explicitly marks which text spans ought to be masked in order to conceal the identity of the person to be protected. Along with presenting the corpus and its annotation layers, we also propose a set of evaluation metrics that are specifically tailored toward measuring the performance of text anonymization, both in terms of privacy protection and utility preservation. We illustrate the use of the benchmark and the proposed metrics by assessing the empirical performance of several baseline text anonymization models. The full corpus along with its privacy-oriented annotation guidelines, evaluation scripts, and baseline models are available on: https://github.com/NorskRegnesentral/text-anonymization-benchmark.

pdf abs
Bootstrapping Text Anonymization Models with Distant Supervision
Anthi Papadopoulou | Pierre Lison | Lilja Øvrelid | Ildikó Pilán
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We propose a novel method to bootstrap text anonymization models based on distant supervision. Instead of requiring manually labeled training data, the approach relies on a knowledge graph expressing the background information assumed to be publicly available about various individuals. This knowledge graph is employed to automatically annotate text documents including personal data about a subset of those individuals. More precisely, the method determines which text spans ought to be masked in order to guarantee k-anonymity, assuming an adversary with access to both the text documents and the background information expressed in the knowledge graph. The resulting collection of labeled documents is then used as training data to fine-tune a pre-trained language model for text anonymization. We illustrate this approach using a knowledge graph extracted from Wikidata and short biographical texts from Wikipedia. Evaluation results with a RoBERTa-based model and a manually annotated collection of 553 summaries showcase the potential of the approach, but also unveil a number of issues that may arise if the knowledge graph is noisy or incomplete. The results also illustrate that, contrary to most sequence labeling problems, the text anonymization task may admit several alternative solutions.

2021

pdf abs
Anonymisation Models for Text Data: State of the art, Challenges and Future Directions
Pierre Lison | Ildikó Pilán | David Sanchez | Montserrat Batet | Lilja Øvrelid
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

This position paper investigates the problem of automated text anonymisation, which is a prerequisite for secure sharing of documents containing sensitive information about individuals. We summarise the key concepts behind text anonymisation and provide a review of current approaches. Anonymisation methods have so far been developed in two fields with little mutual interaction, namely natural language processing and privacy-preserving data publishing. Based on a case study, we outline the benefits and limitations of these approaches and discuss a number of open challenges, such as (1) how to account for multiple types of semantic inferences, (2) how to strike a balance between disclosure risk and data utility and (3) how to evaluate the quality of the resulting anonymisation. We lay out a case for moving beyond sequence labelling models and incorporate explicit measures of disclosure risk into the text anonymisation process.

pdf bib
Proceedings of the 10th Workshop on NLP for Computer Assisted Language Learning
David Alfter | Elena Volodina | Ildikó Pilan | Johannes Graën | Lars Borin
Proceedings of the 10th Workshop on NLP for Computer Assisted Language Learning

2020

Loss of consciousness, so-called syncope, is a commonly occurring symptom associated with worse prognosis for a number of heart-related diseases. We present a comparison of methods for a diagnosis classification task in Norwegian clinical notes, targeting syncope, i.e. fainting cases. We find that an often neglected baseline with keyword matching constitutes a rather strong basis, but more advanced methods do offer some improvement in classification performance, especially a convolutional neural network model. The developed pipeline is planned to be used for quantifying unregistered syncope cases in Norway.

pdf abs
A Dataset for Investigating the Impact of Feedback on Student Revision Outcome
Ildiko Pilan | John Lee | Chak Yan Yeung | Jonathan Webster
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present an annotation scheme and a dataset of teacher feedback provided for texts written by non-native speakers of English. The dataset consists of student-written sentences in their original and revised versions with teacher feedback provided for the errors. Feedback appears both in the form of open-ended comments and error category tags. We focus on a specific error type, namely linking adverbial (e.g. however, moreover) errors. The dataset has been annotated for two aspects: (i) revision outcome establishing whether the re-written student sentence was correct and (ii) directness, indicating whether teachers provided explicitly the correction in their feedback. This dataset allows for studies around the characteristics of teacher feedback and how these influence students’ revision outcome. We describe the data preparation process and we present initial statistical investigations regarding the effect of different feedback characteristics on revision outcome. These show that open-ended comments and mitigating expressions appear in a higher proportion of successful revisions than unsuccessful ones, while directness and metalinguistic terms have no effect. Given that the use of this type of data is relatively unexplored in natural language processing (NLP) applications, we also report some observations and challenges when working with feedback data.

pdf bib
Proceedings of the 9th Workshop on NLP for Computer Assisted Language Learning
David Alfter | Elena Volodina | Ildikó Pilan | Herbert Lange | Lars Borin
Proceedings of the 9th Workshop on NLP for Computer Assisted Language Learning

pdf bib abs
Building a Norwegian Lexical Resource for Medical Entity Recognition
Ildiko Pilan | Pål H. Brekke | Lilja Øvrelid
Proceedings of the LREC 2020 Workshop on Multilingual Biomedical Text Processing (MultilingualBIO 2020)

We present a large Norwegian lexical resource of categorized medical terms. The resource, which merges information from large medical databases, contains over 56,000 entries, including automatically mapped terms from a Norwegian medical dictionary. We describe the methodology behind this automatic dictionary entry mapping based on keywords and suffixes and further present the results of a manual evaluation performed on a subset by a domain expert. The evaluation indicated that ca. 80% of the mappings were correct.

pdf bib
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications
Jill Burstein | Ekaterina Kochmar | Claudia Leacock | Nitin Madnani | Ildikó Pilán | Helen Yannakoudakis | Torsten Zesch
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications

2019

pdf bib
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications
Helen Yannakoudakis | Ekaterina Kochmar | Claudia Leacock | Nitin Madnani | Ildikó Pilán | Torsten Zesch
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Proceedings of the 8th Workshop on NLP for Computer Assisted Language Learning
David Alfter | Elena Volodina | Lars Borin | Ildikó Pilan | Herbert Lange
Proceedings of the 8th Workshop on NLP for Computer Assisted Language Learning

2018

pdf abs
SB@GU at the Complex Word Identification 2018 Shared Task
David Alfter | Ildikó Pilán
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

In this paper, we describe our experiments for the Shared Task on Complex Word Identification (CWI) 2018 (Yimam et al., 2018), hosted by the 13th Workshop on Innovative Use of NLP for Building Educational Applications (BEA) at NAACL 2018. Our system for English builds on previous work for Swedish concerning the classification of words into proficiency levels. We investigate different features for English and compare their usefulness using feature selection methods. For the German, Spanish and French data we use simple systems based on character n-gram models and show that sometimes simple models achieve comparable results to fully feature-engineered systems.

pdf abs
Exploring word embeddings and phonological similarity for the unsupervised correction of language learner errors
Ildikó Pilán | Elena Volodina
Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

The presence of misspellings and other errors or non-standard word forms poses a considerable challenge for NLP systems. Although several supervised approaches have been proposed previously to normalize these, annotated training data is scarce for many languages. We investigate, therefore, an unsupervised method where correction candidates for Swedish language learners’ errors are retrieved from word embeddings. Furthermore, we compare the usefulness of combining cosine similarity with orthographic and phonological similarity based on a neural grapheme-to-phoneme conversion system we train for this purpose. Although combinations of similarity measures have been explored for finding error correction candidates, it remains unclear how these measures relate to each other and how much they contribute individually to identifying the correct alternative. We experiment with different combinations of these and find that integrating phonological information is especially useful when the majority of learner errors are related to misspellings, but less so when errors are of a variety of types including, e.g. grammatical errors.

pdf abs
Investigating the importance of linguistic complexity features across different datasets related to language learning
Ildikó Pilán | Elena Volodina
Proceedings of the Workshop on Linguistic Complexity and Natural Language Processing

We present the results of our investigations aiming at identifying the most informative linguistic complexity features for classifying language learning levels in three different datasets. The datasets vary across two dimensions: the size of the instances (texts vs. sentences) and the language learning skill they involve (reading comprehension texts vs. texts written by learners themselves). We present a subset of the most predictive features for each dataset, taking into consideration significant differences in their per-class mean values and show that these subsets lead not only to simpler models, but also to an improved classification performance. Furthermore, we pinpoint fourteen central features that are good predictors regardless of the size of the linguistic unit analyzed or the skills involved, which include both morpho-syntactic and lexical dimensions.

pdf bib
Proceedings of the 7th workshop on NLP for Computer Assisted Language Learning
Ildikó Pilán | Elena Volodina | David Alfter | Lars Borin
Proceedings of the 7th workshop on NLP for Computer Assisted Language Learning

2017

pdf bib
Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition
Elena Volodina | Gintarė Grigonytė | Ildikó Pilán | Kristina Nilsson Björkenstam | Lars Borin
Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition

2016

pdf abs
Predicting proficiency levels in learner writings by transferring a linguistic complexity model from expert-written coursebooks
Ildikó Pilán | Elena Volodina | Torsten Zesch
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

The lack of a sufficient amount of data tailored for a task is a well-recognized problem for many statistical NLP methods. In this paper, we explore whether data sparsity can be successfully tackled when classifying language proficiency levels in the domain of learner-written output texts. We aim at overcoming data sparsity by incorporating knowledge in the trained model from another domain consisting of input texts written by teaching professionals for learners. We compare different domain adaptation techniques and find that a weighted combination of the two types of data performs best, which can even rival systems based on considerably larger amounts of in-domain data. Moreover, we show that normalizing errors in learners’ texts can substantially improve classification when level-annotated in-domain data is not available.

pdf
Detecting Context Dependence in Exercise Item Candidates Selected from Corpora
Ildikó Pilán
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications

pdf abs
Coursebook Texts as a Helping Hand for Classifying Linguistic Complexity in Language Learners’ Writings
Ildikó Pilán | David Alfter | Elena Volodina
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)

We bring together knowledge from two different types of language learning data, texts learners read and texts they write, to improve linguistic complexity classification in the latter. Linguistic complexity in the foreign and second language learning context can be expressed in terms of proficiency levels. We show that incorporating features capturing lexical complexity information from reading passages can boost significantly the machine learning based classification of learner-written texts into proficiency levels. With an F1 score of .8 our system rivals state-of-the-art results reported for other languages for this task. Finally, we present a freely available web-based tool for proficiency level classification and lexical complexity visualization for both learner writings and reading texts.

pdf bib
From distributions to labels: A lexical proficiency analysis using learner corpora
David Alfter | Yuri Bizzoni | Anders Agebjörn | Elena Volodina | Ildikó Pilán
Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition

pdf
SweLLex: Second language learners’ productive vocabulary
Elena Volodina | Ildikó Pilán | Lorena Llozhi | Baptiste Degryse | Thomas François
Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition

pdf abs
SweLL on the rise: Swedish Learner Language corpus for European Reference Level studies
Elena Volodina | Ildikó Pilán | Ingegerd Enström | Lorena Llozhi | Peter Lundkvist | Gunlög Sundberg | Monica Sandell
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a new resource for Swedish, SweLL, a corpus of Swedish Learner essays linked to learners’ performance according to the Common European Framework of Reference (CEFR). SweLL consists of three subcorpora ― SpIn, SW1203 and Tisus, collected from three different educational establishments. The common metadata for all subcorpora includes age, gender, native languages, time of residence in Sweden, type of written task. Depending on the subcorpus, learner texts may contain additional information, such as text genres, topics, grades. Five of the six CEFR levels are represented in the corpus: A1, A2, B1, B2 and C1 comprising in total 339 essays. C2 level is not included since courses at C2 level are not offered. The work flow consists of collection of essays and permits, essay digitization and registration, meta-data annotation, automatic linguistic annotation. Inter-rater agreement is presented on the basis of SW1203 subcorpus. The work on SweLL is still ongoing with more that 100 essays waiting in the pipeline. This article both describes the resource and the “how-to” behind the compilation of SweLL.

pdf abs
SVALex: a CEFR-graded Lexical Resource for Swedish Foreign and Second Language Learners
Thomas François | Elena Volodina | Ildikó Pilán | Anaïs Tack
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The paper introduces SVALex, a lexical resource primarily aimed at learners and teachers of Swedish as a foreign and second language that describes the distribution of 15,681 words and expressions across the Common European Framework of Reference (CEFR). The resource is based on a corpus of coursebook texts, and thus describes receptive vocabulary learners are exposed to during reading activities, as opposed to productive vocabulary they use when speaking or writing. The paper describes the methodology applied to create the list and to estimate the frequency distribution. It also discusses some characteristics of the resulting resource and compares it to other lexical resources for Swedish. An interesting feature of this resource is the possibility to separate the wheat from the chaff, identifying the core vocabulary at each level, i.e. vocabulary shared by several coursebook writers at each level, from peripheral vocabulary which is used by the minority of the coursebook writers.

pdf
Candidate sentence selection for language learning exercises: from a comprehensive framework to an empirical evaluation
Ildikó Pilán | Elena Volodina | Lars Borin
Traitement Automatique des Langues, Volume 57, Numéro 3 : TALP et didactique [NLP for Learning and Teaching]

2015

pdf
Helping Swedish words come to their senses: word-sense disambiguation based on sense associations from the SALDO lexicon
Ildikó Pilán
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

pdf bib
Proceedings of the fourth workshop on NLP for computer-assisted language learning
Elena Volodina | Lars Borin | Ildikó Pilán
Proceedings of the fourth workshop on NLP for computer-assisted language learning

2014

pdf abs
Reusing Swedish FrameNet for training semantic roles
Ildikó Pilán | Elena Volodina
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this article we present the first experiences of reusing the Swedish FrameNet (SweFN) as a resource for training semantic roles. We give an account of the procedure we used to adapt SweFN to the needs of students of Linguistics in the form of an automatically generated exercise. During this adaptation, the mapping of the fine-grained distinction of roles from SweFN into learner-friendlier coarse-grained roles presented a major challenge. Besides discussing the details of this mapping, we describe the resulting multiple-choice exercise and its graphical user interface. The exercise was made available through Lärka, an online platform for students of Linguistics and learners of Swedish as a second language. We outline also aspects underlying the selection of the incorrect answer options which include semantic as well as frequency-based criteria. Finally, we present our own observations and initial user feedback about the applicability of such a resource in the pedagogical domain. Students’ answers indicated an overall positive experience, the majority found the exercise useful for learning semantic roles.

pdf abs
A flexible language learning platform based on language resources and web services
Elena Volodina | Ildikó Pilán | Lars Borin | Therese Lindström Tiedemann
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present Lärka, the language learning platform of Spräkbanken (the Swedish Language Bank). It consists of an exercise generator which reuses resources available through Spräkbanken: mainly Korp, the corpus infrastructure, and Karp, the lexical infrastructure. Through Lärka we reach new user groups ― students and teachers of Linguistics as well as second language learners and their teachers ― and this way bring Spräkbanken’s resources in a relevant format to them. Lärka can therefore be viewed as an case of real-life language resource evaluation with end users. In this article we describe Lärka’s architecture, its user interface, and the five exercise types that have been released for users so far. The first user evaluation following in-class usage with students of linguistics, speech therapy and teacher candidates are presented. The outline of future work concludes the paper.

pdf
Rule-based and machine learning approaches for second language sentence-level readability
Ildikó Pilán | Elena Volodina | Richard Johansson
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Proceedings of the third workshop on NLP for computer-assisted language learning
Elena Volodina | Lars Borin | Ildikó Pilán
Proceedings of the third workshop on NLP for computer-assisted language learning

pdf
You Get what You Annotate: A Pedagogically Annotated Corpus of Coursebooks for Swedish as a Second Language
Elena Volodina | Ildikó Pilán | Stian Rødven Eide | Hannes Heidarsson
Proceedings of the third workshop on NLP for computer-assisted language learning