Hannah Sterz


2025

pdf bib
ReCoVeR the Target Language: Language Steering without Sacrificing Task Performance
Hannah Sterz | Fabian David Schmidt | Goran Glavaš | Ivan Vulić
Findings of the Association for Computational Linguistics: EMNLP 2025

As they become increasingly multilingual, Large Language Models (LLMs) exhibit more language confusion, i.e., they tend to generate answers in a language different from the language of the prompt or the answer language explicitly requested by the user. In this work, we propose ReCoVeR (REducing language COnfusion in VEctor Representations), a novel lightweight approach for reducing language confusion based on language-specific steering vectors. We first isolate language vectors with the help of multi-parallel corpus and then effectively leverage those vectors for effective LLM steering via fixed (i.e., unsupervised) as well as trainable steering functions. Our extensive evaluation, encompassing three benchmarks and 18 languages, shows that ReCoVeR effectively mitigates language confusion in both monolingual and cross-lingual setups while at the same time—and in contrast to prior language steering methods—retaining task performance. Our data code is available at https://github.com/hSterz/recover.

2024

pdf bib
M2QA: Multi-domain Multilingual Question Answering
Leon Engländer | Hannah Sterz | Clifton A Poth | Jonas Pfeiffer | Ilia Kuznetsov | Iryna Gurevych
Findings of the Association for Computational Linguistics: EMNLP 2024

Generalization and robustness to input variation are core desiderata of machine learning research. Language varies along several axes, most importantly, language instance (e.g. French) and domain (e.g. news). While adapting NLP models to new languages within a single domain, or to new domains within a single language, is widely studied, research in joint adaptation is hampered by the lack of evaluation datasets. This prevents the transfer of NLP systems from well-resourced languages and domains to non-dominant language-domain combinations. To address this gap, we introduce M2QA, a multi-domain multilingual question answering benchmark.M2QA includes 13,500 SQuAD 2.0-style question-answer instances in German, Turkish, and Chinese for the domains of product reviews, news, and creative writing. We use M2QA to explore cross-lingual cross-domain performance of fine-tuned models and state-of-the-art LLMs and investigate modular approaches to domain and language adaptation.We witness **1)** considerable performance _variations_ across domain-language combinations within model classes and **2)** considerable performance _drops_ between source and target language-domain combinations across all model sizes. We demonstrate that M2QA is far from solved, and new methods to effectively transfer both linguistic and domain-specific information are necessary.

2023

pdf bib
Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning
Clifton Poth | Hannah Sterz | Indraneil Paul | Sukannya Purkayastha | Leon Engländer | Timo Imhof | Ivan Vulić | Sebastian Ruder | Iryna Gurevych | Jonas Pfeiffer
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We introduce Adapters, an open-source library that unifies parameter-efficient and modular transfer learning in large language models. By integrating 10 diverse adapter methods into a unified interface, Adapters offers ease of use and flexible configuration. Our library allows researchers and practitioners to leverage adapter modularity through composition blocks, enabling the design of complex adapter setups. We demonstrate the library’s efficacy by evaluating its performance against full fine-tuning on various NLP tasks. Adapters provides a powerful tool for addressing the challenges of conventional fine-tuning paradigms and promoting more efficient and modular transfer learning. The library is available via https://adapterhub.ml/adapters.

pdf bib
ML Mob at SemEval-2023 Task 1: Probing CLIP on Visual Word-Sense Disambiguation
Clifton Poth | Martin Hentschel | Tobias Werner | Hannah Sterz | Leonard Bongard
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

Successful word sense disambiguation (WSD)is a fundamental element of natural languageunderstanding. As part of SemEval-2023 Task1, we investigate WSD in a multimodal setting,where ambiguous words are to be matched withcandidate images representing word senses. Wecompare multiple systems based on pre-trainedCLIP models. In our experiments, we findCLIP to have solid zero-shot performance onmonolingual and multilingual data. By em-ploying different fine-tuning techniques, we areable to further enhance performance. However,transferring knowledge between data distribu-tions proves to be more challenging.

pdf bib
ML Mob at SemEval-2023 Task 5: “Breaking News: Our Semi-Supervised and Multi-Task Learning Approach Spoils Clickbait”
Hannah Sterz | Leonard Bongard | Tobias Werner | Clifton Poth | Martin Hentschel
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

Online articles using striking headlines that promise intriguing information are often used to attract readers. Most of the time, the information provided in the text is disappointing to the reader after the headline promised exciting news. As part of the SemEval-2023 challenge, we propose a system to generate a spoiler for these headlines. The spoiler provides the information promised by the headline and eliminates the need to read the full article. We consider Multi-Task Learning and generating more data using a distillation approach in our system. With this, we achieve an F1 score up to 51.48% on extracting the spoiler from the articles.

2022

pdf bib
UKP-SQUARE: An Online Platform for Question Answering Research
Tim Baumgärtner | Kexin Wang | Rachneet Sachdeva | Gregor Geigle | Max Eichler | Clifton Poth | Hannah Sterz | Haritz Puerto | Leonardo F. R. Ribeiro | Jonas Pfeiffer | Nils Reimers | Gözde Şahin | Iryna Gurevych
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Recent advances in NLP and information retrieval have given rise to a diverse set of question answering tasks that are of different formats (e.g., extractive, abstractive), require different model architectures (e.g., generative, discriminative), and setups (e.g., with or without retrieval). Despite having a large number of powerful, specialized QA pipelines (which we refer to as Skills) that consider a single domain, model or setup, there exists no framework where users can easily explore and compare such pipelines and can extend them according to their needs. To address this issue, we present UKP-SQuARE, an extensible online QA platform for researchers which allows users to query and analyze a large collection of modern Skills via a user-friendly web interface and integrated behavioural tests. In addition, QA researchers can develop, manage, and share their custom Skills using our microservices that support a wide range of models (Transformers, Adapters, ONNX), datastores and retrieval techniques (e.g., sparse and dense). UKP-SQuARE is available on https://square.ukp-lab.de