2024
pdf
bib
Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024)
Elena Volodina
|
David Alfter
|
Simon Dobnik
|
Therese Lindström Tiedemann
|
Ricardo Muñoz Sánchez
|
Maria Irena Szawerna
|
Xuan-Son Vu
Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024)
pdf
abs
Detecting Personal Identifiable Information in Swedish Learner Essays
Maria Irena Szawerna
|
Simon Dobnik
|
Ricardo Muñoz Sánchez
|
Therese Lindström Tiedemann
|
Elena Volodina
Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024)
Linguistic data can — and often does — contain PII (Personal Identifiable Information). Both from a legal and ethical standpoint, the sharing of such data is not permissible. According to the GDPR, pseudonymization, i.e. the replacement of sensitive information with surrogates, is an acceptable strategy for privacy preservation. While research has been conducted on the detection and replacement of sensitive data in Swedish medical data using Large Language Models (LLMs), it is unclear whether these models handle PII in less structured and more thematically varied texts equally well. In this paper, we present and discuss the performance of an LLM-based PII-detection system for Swedish learner essays.
pdf
abs
Did the Names I Used within My Essay Affect My Score? Diagnosing Name Biases in Automated Essay Scoring
Ricardo Muñoz Sánchez
|
Simon Dobnik
|
Maria Irena Szawerna
|
Therese Lindström Tiedemann
|
Elena Volodina
Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024)
Automated essay scoring (AES) of second-language learner essays is a high-stakes task as it can affect the job and educational opportunities a student may have access to. Thus, it becomes imperative to make sure that the essays are graded based on the students’ language proficiency as opposed to other reasons, such as personal names used in the text of the essay. Moreover, most of the research data for AES tends to contain personal identifiable information. Because of that, pseudonymization becomes an important tool to make sure that this data can be freely shared. Thus, our systems should not grade students based on which given names were used in the text of the essay, both for fairness and for privacy reasons. In this paper we explore how given names affect the CEFR level classification of essays of second language learners of Swedish. We use essays containing just one personal name and substitute it for names from lists of given names from four different ethnic origins, namely Swedish, Finnish, Anglo-American, and Arabic. We find that changing the names within the essays has no apparent effect on the classification task, regardless of whether a feature-based or a transformer-based model is used.
pdf
abs
Pseudonymization Categories across Domain Boundaries
Maria Irena Szawerna
|
Simon Dobnik
|
Therese Lindström Tiedemann
|
Ricardo Muñoz Sánchez
|
Xuan-Son Vu
|
Elena Volodina
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Linguistic data, a component critical not only for research in a variety of fields but also for the development of various Natural Language Processing (NLP) applications, can contain personal information. As a result, its accessibility is limited, both from a legal and an ethical standpoint. One of the solutions is the pseudonymization of the data. Key stages of this process include the identification of sensitive elements and the generation of suitable surrogates in a way that the data is still useful for the intended task. Within this paper, we conduct an analysis of tagsets that have previously been utilized in anonymization and pseudonymization. We also investigate what kinds of Personally Identifiable Information (PII) appear in various domains. These reveal that none of the analyzed tagsets account for all of the PII types present cross-domain at the level of detailedness seemingly required for pseudonymization. We advocate for a universal system of tags for categorizing PIIs leading up to their replacement. Such categorization could facilitate the generation of grammatically, semantically, and sociolinguistically appropriate surrogates for the kinds of information that are considered sensitive in a given domain, resulting in a system that would enable dynamic pseudonymization while keeping the texts readable and useful for future research in various fields.
2021
pdf
bib
Crowdsourcing Relative Rankings of Multi-Word Expressions: Experts versus Non-Experts
David Alfter
|
Therese Lindström Tiedemann
|
Elena Volodina
Northern European Journal of Language Technology, Volume 7
pdf
abs
CoDeRooMor: A new dataset for non-inflectional morphology studies of Swedish
Elena Volodina
|
Yousuf Ali Mohammed
|
Therese Lindström Tiedemann
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)
The paper introduces a new resource, CoDeRooMor, for studying the morphology of modern Swedish word formation. The approximately 16.000 lexical items in the resource have been manually segmented into word-formation morphemes, and labeled for their categories, such as prefixes, suffixes, roots, etc. Word-formation mechanisms, such as derivation and compounding have been associated with each item on the list. The article describes the selection of items for manual annotation and the principles of annotation, reports on the reliability of the manual annotation, and presents tools, resources and some first statistics. Given the”gold” nature of the resource, it is possible to use it for empirical studies as well as to develop linguistically-aware algorithms for morpheme segmentation and labeling (cf statistical subword approach). The resource will be made freely available.
2019
pdf
abs
LEGATO: A flexible lexicographic annotation tool
David Alfter
|
Therese Lindström Tiedemann
|
Elena Volodina
Proceedings of the 22nd Nordic Conference on Computational Linguistics
This article is a report from an ongoing project aiming at analyzing lexical and grammatical competences of Swedish as a Second language (L2). To facilitate lexical analysis, we need access to metalinguistic information about relevant vocabulary that L2 learners can use and understand. The focus of the current article is on the lexical annotation of the vocabulary scope for a range of lexicographical aspects, such as morphological analysis, valency, types of multi-word units, etc. We perform parts of the analysis automatically, and other parts manually. The rationale behind this is that where there is no possibility to add information automatically, manual effort needs to be added. To facilitate the latter, a tool LEGATO has been designed, implemented and currently put to active testing.
2014
pdf
abs
A flexible language learning platform based on language resources and web services
Elena Volodina
|
Ildikó Pilán
|
Lars Borin
|
Therese Lindström Tiedemann
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We present Lärka, the language learning platform of Spräkbanken (the Swedish Language Bank). It consists of an exercise generator which reuses resources available through Spräkbanken: mainly Korp, the corpus infrastructure, and Karp, the lexical infrastructure. Through Lärka we reach new user groups ― students and teachers of Linguistics as well as second language learners and their teachers ― and this way bring Spräkbanken’s resources in a relevant format to them. Lärka can therefore be viewed as an case of real-life language resource evaluation with end users. In this article we describe Lärka’s architecture, its user interface, and the five exercise types that have been released for users so far. The first user evaluation following in-class usage with students of linguistics, speech therapy and teacher candidates are presented. The outline of future work concludes the paper.