Melanie Siegel

2024

pdf abs
A Preliminary Study of ChatGPT for Spanish E2R Text Adaptation
Margot Madina | Itziar Gonzalez-Dios | Melanie Siegel
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The process of adapting and creating Easy-to-Read (E2R) texts is very expensive and time-consuming. Due to the success of Large Language Models (LLMs) such as ChatGPT and their ability to generate written language, it is likely to think that such models can help in the adaptation or creation of text in E2R. In this paper, we explore the concept of E2R, its underlying principles and applications, and provides a preliminary study on the usefulness of ChatGPT-4 for E2R text adaptation. We focus on the Spanish language and its E2R variant, Lectura Fácil (LF). We consider a range of prompts that can be used and the differences in output that this produces. We then carry out a three-folded evaluation on 10 texts adapted by ChatGPT-4: (1) an automated evaluation to check values related to the readability of texts, (2) a checklist-based manual evaluation (for which we also propose three new capabilities) and (3) a users’ evaluation with people with cognitive disabilities. We show that it is difficult to choose the best prompt to make ChatGPT-4 adapt texts to LF. Furthermore, the generated output does not follow the E2R text rules, so it is often not suitable for the target audience.

pdf abs
GerDISDETECT: A German Multilabel Dataset for Disinformation Detection
Mina Schütz | Daniela Pisoiu | Daria Liakhovets | Alexander Schindler | Melanie Siegel
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Disinformation has become increasingly relevant in recent years both as a political issue and as object of research. Datasets for training machine learning models, especially for other languages than English, are sparse and the creation costly. Annotated datasets often have only binary or multiclass labels, which provide little information about the grounds and system of such classifications. We propose a novel textual dataset GerDISDETECT for German disinformation. To provide comprehensive analytical insights, a fine-grained taxonomy guided annotation scheme is required. The goal of this dataset, instead of providing a direct assessment regarding true or false, is to provide wide-ranging semantic descriptors that allow for complex interpretation as well as inferred decision-making regarding information and trustworthiness of potentially critical articles. This allows this dataset to be also used for other tasks. The dataset was collected in the first three months of 2022 and contains 39 multilabel classes with 5 top-level categories for a total of 1,890 articles: General View (3 labels), Offensive Language (11 labels), Reporting Style (15 labels), Writing Style (6 labels), and Extremism (4 labels). As a baseline, we further pre-trained a multilingual XLM-R model on around 200,000 unlabeled news articles and fine-tuned it for each category.

pdf abs
LanguageTool as a CAT tool for Easy-to-Read in Spanish
Margot Madina | Itziar Gonzalez-Dios | Melanie Siegel
Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI) @ LREC-COLING 2024

Easy-to-Read (E2R) is an approach to content creation that emphasizes simplicity and clarity in language to make texts more accessible to readers with cognitive challenges or learning disabilities. The Spanish version of E2R is called Lectura Fácil (LF). E2R and its variants, such as LF, focus on straightforward language and structure to enhance readability. The manual production of such texts is both time and resource expensive. In this work, we have developed LFWriteAssist, an authoring support tool that aligns with the guidelines of LF. It is underpinned by the functionalities of LanguageTool, a free and open source grammar, style and spelling checker. Our tool assists in ensuring compliance with LF standard, provides definitions for complex, polysemic, or infrequently used terms, and acronym extensions. The tool is primarily targeted at LF creators, as it serves as an authoring aid, identifying any rule infringements and assisting with language simplifications. However, it can be used by anyone who seek to enhance text readability and inclusivity. The tool’s code is made available as open source, thereby contributing to the wider effort of creating inclusive and comprehensible content.

2023

pdf
Towards UkrainianWordNet: Incorporation of an Existing Thesaurus in the Domain of Physics
Melanie Siegel | Maksym Vakulenko | Jonathan Baum
Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)

pdf abs
Connecting Multilingual Wordnets: Strategies for Improving ILI Classification in OdeNet
Melanie Siegel | Johann Bergh
Proceedings of the 12th Global Wordnet Conference

The Open Multilingual Wordnet (OMW) is an open source project that was launched with the goal to make it easy to use wordnets in multiple languages without having to pay expensive proprietary licensing costs. As OMW evolved, the interlingual indicator (ILI)1 was used to allow semantically equivalent synsets in different languages to be linked to each other. OdeNet2 is the German language wordnet which forms part of the OMW project. This paper analyses the shortcomings of the initial ILI classification in OdeNet and the consequent methods used to improve this classification.

2022

In this work, we present a new publicly available offensive language dataset of 10.278 German social media comments collected in the first half of 2021 that were annotated by in total six annotators. With twelve different annotation categories, it is far more comprehensive than other datasets, and goes beyond just hate speech detection. The labels aim in particular also at toxicity, criminal relevance and discrimination types of comments. Furthermore, about half of the comments are from coherent parts of conversations, which opens the possibility to consider the comments’ contexts and do conversation analyses in order to research the contagion of offensive language in conversations.

2021

pdf abs
OdeNet: Compiling a GermanWordNet from other Resources
Melanie Siegel | Francis Bond
Proceedings of the 11th Global Wordnet Conference

The Princeton WordNet for the English language has been used worldwide in NLP projects for many years. With the OMW initiative, wordnets for different languages of the world are being linked via identifiers. The parallel development and linking allows new multilingual application perspectives. The development of a wordnet for the German language is also in this context. To save development time, existing resources were combined and recompiled. The result was then evaluated and improved. In a relatively short time a resource was created that can be used in projects and continuously improved and extended.

pdf abs
DeTox at GermEval 2021: Toxic Comment Classification
Mina Schütz | Christoph Demus | Jonas Pitz | Nadine Probol | Melanie Siegel | Dirk Labudde
Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments

In this work, we present our approaches on the toxic comment classification task (subtask 1) of the GermEval 2021 Shared Task. For this binary task, we propose three models: a German BERT transformer model; a multilayer perceptron, which was first trained in parallel on textual input and 14 additional linguistic features and then concatenated in an additional layer; and a multilayer perceptron with both feature types as input. We enhanced our pre-trained transformer model by re-training it with over 1 million tweets and fine-tuned it on two additional German datasets of similar tasks. The embeddings of the final fine-tuned German BERT were taken as the textual input features for our neural networks. Our best models on the validation data were both neural networks, however our enhanced German BERT gained with a F1-score = 0.5895 a higher prediction on the test data.

2020

pdf abs
Adding Pronunciation Information to Wordnets
Thierry Declerck | Lenka Bajcetic | Melanie Siegel
Proceedings of the LREC 2020 Workshop on Multimodal Wordnets (MMW2020)

We describe on-going work consisting in adding pronunciation information to wordnets, as such information can indicate specific senses of a word. Many wordnets associate with their senses only a lemma form and a part-of-speech tag. At the same time, we are aware that additional linguistic information can be useful for identifying a specific sense of a wordnet lemma when encountered in a corpus. While work already deals with the addition of grammatical number or grammatical gender information to wordnet lemmas,we are investigating the linking of wordnet lemmas to pronunciation information, adding thus a speech-related modality to wordnets

2019

pdf abs
Using OntoLex-Lemon for Representing and Interlinking German Multiword Expressions in OdeNet and MMORPH
Thierry Declerck | Melanie Siegel | Stefania Racioppa
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

We describe work consisting in porting two large German lexical resources into the OntoLex-Lemon model in order to establish complementary interlinkings between them. One resource is OdeNet (Open German WordNet) and the other is a further development of the German version of the MMORPH morphological analyzer. We show how the Multiword Expressions (MWEs) contained in OdeNet can be morphologically specified by the use of the lexical representation and linking features of OntoLex-Lemon, which also support the formulation of restrictions in the usage of such expressions.

pdf abs
OntoLex as a possible Bridge between WordNets and full lexical Descriptions
Thierry Declerck | Melanie Siegel
Proceedings of the 10th Global Wordnet Conference

In this paper we describe our current work on representing a recently created German lexical semantics resource in OntoLex-Lemon and in conformance with WordNet specifications. Besides presenting the representation effort, we show the utilization of OntoLex-Lemon to bridge from WordNet-like resources to full lexical descriptions and extend the coverage of WordNets to other types of lexical data, such as decomposition results, exemplified for German data, and inflectional phenomena, here outlined for English data.

2012

pdf abs
Using Automatic Machine Translation Metrics to Analyze the Impact of Source Reformulations
Johann Roturier | Linda Mitchell | Robert Grabowski | Melanie Siegel
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers

This paper investigates the usefulness of automatic machine translation metrics when analyzing the impact of source reformulations on the quality of machine-translated user generated content. We propose a novel framework to quickly identify rewriting rules which improve or degrade the quality of MT output, by trying to rely on automatic metrics rather than human judgments. We find that this approach allows us to quickly identify overlapping rules between two language pairs (English- French and English-German) and specific cases where the rules’ precision could be improved.

2006

pdf abs
Ontology-based Information Extraction with SOBA
Paul Buitelaar | Philipp Cimiano | Stefania Racioppa | Melanie Siegel
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we describe SOBA, a sub-component of the SmartWeb multi-modal dialog system. SOBA is a component for ontologybased information extraction from soccer web pages for automatic population of a knowledge base that can be used for domainspecific question answering. SOBA realizes a tight connection between the ontology, knowledge base and the information extraction component. The originality of SOBA is in the fact that it extracts information from heterogeneous sources such as tabular structures, text and image captions in a semantically integrated way. In particular, it stores extracted information in a knowledge base, and in turn uses the knowledge base to interpret and link newly extracted information with respect to already existing entities.

Co-authors

Venues

ws2

mwe1

acl1

alr1

woah1

mmw1

readi1

Melanie Siegel

2024

2023

2022

2021

2020

2019

2012

2006

2005

2004

2002

2000

1999

1996

Co-authors

Venues