Lucie-Aimée Kaffee

Also published as: Lucie-aimée Kaffee


2024

pdf
Wiki-VEL: Visual Entity Linking for Structured Data on Wikimedia Commons
Philipp Bielefeld | Jasmin Geppert | Necdet Güven | Melna John | Adrian Ziupka | Lucie-Aimée Kaffee | Russa Biswas | Gerard De Melo
Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR)

Describing Wikimedia Commons images using Wikidata’s structured data enables a wide range of automation tasks, such as search and organization, as well as downstream tasks, such as labeling images or training machine learning models. However, there is currently a lack of structured data-labelled images on Wikimedia Commons.To close this gap, we propose the task of Visual Entity Linking (VEL) for Wikimedia Commons, in which we create new labels for Wikimedia Commons images from Wikidata items. VEL is a crucial tool for improving information retrieval, search, content understanding, cross-modal applications, and various machine-learning tasks. In this paper, we propose a method to create new labels for Wikimedia Commons images from Wikidata items. To this end, we create a novel dataset leveraging community-created structured data on Wikimedia Commons and fine-tuning pre-trained models based on the CLIP architecture. Although the best-performing models show promising results, the study also identifies key challenges of the data and the task.

pdf bib
Proceedings of the 1st Workshop on Knowledge Graphs and Large Language Models (KaLLM 2024)
Russa Biswas | Lucie-Aimée Kaffee | Oshin Agarwal | Pasquale Minervini | Sameer Singh | Gerard de Melo
Proceedings of the 1st Workshop on Knowledge Graphs and Large Language Models (KaLLM 2024)

pdf
Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing
Isaac Johnson | Lucie-Aimée Kaffee | Miriam Redi
Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia

Wikimedia content is used extensively by the AI community and within the language modeling community in particular. In this paper, we provide a review of the different ways in which Wikimedia data is curated to use in NLP tasks across pre-training, post-training, and model evaluations. We point to opportunities for greater use of Wikimedia content but also identify ways in which the language modeling community could better center the needs of Wikimedia editors. In particular, we call for incorporating additional sources of Wikimedia data, a greater focus on benchmarks for LLMs that encode Wikimedia principles, and greater multilingualism in Wikimedia-derived datasets.

pdf
Investigating Wit, Creativity, and Detectability of Large Language Models in Domain-Specific Writing Style Adaptation of Reddit’s Showerthoughts
Tolga Buz | Benjamin Frost | Nikola Genchev | Moritz Schneider | Lucie-Aimée Kaffee | Gerard de Melo
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)

Recent Large Language Models (LLMs) have shown the ability to generate content that is difficult or impossible to distinguish from human writing. We investigate the ability of differently-sized LLMs to replicate human writing style in short, creative texts in the domain of Showerthoughts, thoughts that may occur during mundane activities. We compare GPT-2 and GPT-Neo fine-tuned on Reddit data as well as GPT-3.5 invoked in a zero-shot manner, against human-authored texts. We measure human preference on the texts across the specific dimensions that account for the quality of creative, witty texts. Additionally, we compare the ability of humans versus fine-tuned RoBERTa-based classifiers to detect AI-generated texts. We conclude that human evaluators rate the generated texts slightly worse on average regarding their creative quality, but they are unable to reliably distinguish between human-written and AI-generated texts. We further provide the dataset for creative, witty text generation based on Reddit Showerthoughts posts.

2023

pdf
Thorny Roses: Investigating the Dual Use Dilemma in Natural Language Processing
Lucie-Aimée Kaffee | Arnav Arora | Zeerak Talat | Isabelle Augenstein
Findings of the Association for Computational Linguistics: EMNLP 2023

Dual use, the intentional, harmful reuse of technology and scientific artefacts, is an ill-defined problem within the context of Natural Language Processing (NLP). As large language models (LLMs) have advanced in their capabilities and become more accessible, the risk of their intentional misuse becomes more prevalent. To prevent such intentional malicious use, it is necessary for NLP researchers and practitioners to understand and mitigate the risks of their research. Hence, we present an NLP-specific definition of dual use informed by researchers and practitioners in the field. Further, we propose a checklist focusing on dual-use in NLP, that can be integrated into existing conference ethics-frameworks. The definition and checklist are created based on a survey of NLP researchers and practitioners.

pdf
Probing Pre-Trained Language Models for Cross-Cultural Differences in Values
Arnav Arora | Lucie-aimée Kaffee | Isabelle Augenstein
Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP)

Language embeds information about social, cultural, and political values people hold. Prior work has explored potentially harmful social biases encoded in Pre-trained Language Models (PLMs). However, there has been no systematic study investigating how values embedded in these models vary across cultures. In this paper, we introduce probes to study which cross-cultural values are embedded in these models, and whether they align with existing theories and cross-cultural values surveys. We find that PLMs capture differences in values across cultures, but those only weakly align with established values surveys. We discuss implications of using mis-aligned models in cross-cultural settings, as well as ways of aligning PLMs with values surveys.

pdf
Why Should This Article Be Deleted? Transparent Stance Detection in Multilingual Wikipedia Editor Discussions
Lucie-Aimée Kaffee | Arnav Arora | Isabelle Augenstein
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

The moderation of content on online platforms is usually non-transparent. On Wikipedia, however, this discussion is carried out publicly and editors are encouraged to use the content moderation policies as explanations for making moderation decisions. Currently, only a few comments explicitly mention those policies – 20% of the English ones, but as few as 2% of the German and Turkish comments. To aid in this process of understanding how content is moderated, we construct a novel multilingual dataset of Wikipedia editor discussions along with their reasoning in three languages. The dataset contains the stances of the editors (keep, delete, merge, comment), along with the stated reason, and a content moderation policy, for each edit decision. We demonstrate that stance and corresponding reason (policy) can be predicted jointly with a high degree of accuracy, adding transparency to the decision-making process. We release both our joint prediction models and the multilingual content moderation dataset for further research on automated transparent content moderation.

2018

pdf
Learning to Generate Wikipedia Summaries for Underserved Languages from Wikidata
Lucie-Aimée Kaffee | Hady Elsahar | Pavlos Vougiouklis | Christophe Gravier | Frédérique Laforest | Jonathon Hare | Elena Simperl
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

While Wikipedia exists in 287 languages, its content is unevenly distributed among them. In this work, we investigate the generation of open domain Wikipedia summaries in underserved languages using structured data from Wikidata. To this end, we propose a neural network architecture equipped with copy actions that learns to generate single-sentence and comprehensible textual summaries from Wikidata triples. We demonstrate the effectiveness of the proposed approach by evaluating it against a set of baselines on two languages of different natures: Arabic, a morphological rich language with a larger vocabulary than English, and Esperanto, a constructed language known for its easy acquisition.