Michelle Wastl

2026

It’s Not What You Say, It’s How You Say It: Evaluating LLM Responses to Expressions of Belief
Kevin Du | Clara K\"umpel | Michelle Wastl | Alex Warstadt
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Users frequently express their beliefs to large language models (LLMs). In some situations, the LLM should accept these contextual beliefs as true. In others, they should stick to their prior knowledge. Notably, users’ expressions of belief (EoBs) can take linguistically diverse forms—using presuppositions, evidential and certainty markers, or varied tones—each of which may have a different persuasiveness over the LLMs. We introduce a typology to systematically evaluate how different EoBs affect whether models follow context versus prior knowledge. The typology is grounded in four linguistically motivated dimensions: form, evidentiality, epistemic stance, and tone, spanning 17 fine-grained types. By pairing these EoBs with world knowledge facts, we generate controlled EoB–query pairs that isolate the effect of linguistic variation. Using this benchmark, we evaluate 16 LLMs that differ in architecture (Llama3, Qwen3, Gemma3), scale (1B-30B parameters), and training stages (base vs instruct). We identify meaningful variations in response behavior across these axes, e.g., that bigger models and instruction models tend to be less context–following than smaller models and base models. We further identify specific EoBs that statistically significantly persuade LMs more consistently than others. Our work reveals systematic patterns in how linguistic framing affects LLM context integration, with implications for prompt engineering and model robustness.

pdf bib abs

SwissGov-RSD: A Human-annotated, Cross-lingual Benchmark for Token-level Recognition of Semantic Differences Between Related Documents
Michelle Wastl | Jannis Vamvas | Rico Sennrich
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recognizing semantic differences across documents is crucial for text generation evaluation and content alignment, especially in cross-lingual settings. However, as a standalone task, it has received little attention. We address this by introducing SwissGov-RSD, the first naturalistic, document-level, cross-lingual dataset for semantic difference recognition. It encompasses a total of 224 multi-parallel documents in English–German, English–French, and English–Italian with token-level difference annotations by human annotators.We evaluate a variety of open-source and closed-source large language models as well as encoder models across different fine-tuning settings on this new benchmark. Our results show that current automatic approaches perform poorly compared to their performance on monolingual, sentence-level, and synthetic benchmarks, revealing a considerable gap for both LLMs and encoder models.

2025

pdf bib

20min-XD: A Comparable Corpus of Swiss News Articles
Michelle Wastl | Jannis Vamvas | Selena Calleri | Rico Sennrich
Proceedings of the 10th edition of the Swiss Text Analytics Conference

pdf bib abs

Machine Translation Models are Zero-Shot Detectors of Translation Direction
Michelle Wastl | Jannis Vamvas | Rico Sennrich
Findings of the Association for Computational Linguistics: ACL 2025

Detecting the translation direction of parallel text has applications for machine translation training and evaluation, but also has forensic applications, such as resolving plagiarism or forgery allegations. In this work, we explore an unsupervised approach to translation direction detection based on the simple hypothesis that p(translation|original)>p(original|translation), motivated by the well-known simplification effect in translationese or machine-translationese. In experiments with multilingual machine translation models across 20 translation directions, we confirm the effectiveness of the approach for high-resource language pairs, achieving document-level accuracies of 82–96% for NMT-produced translations, and 60–81% for human translations, depending on the model used.

pdf bib abs

UZH at SemEval-2025 Task 3: Token-Level Self-Consistency for Hallucination Detection
Michelle Wastl | Jannis Vamvas | Rico Sennrich
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

This paper presents our system developed for the SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. The objective of this task is to identify spans of hallucinated text in the output of large language models across 14 high- and low- resource languages. To address this challenge, we propose two consistency-based approaches: (a) token-level consistency with a superior LLM and (b) token-level self-consistency with the underlying model of the sequence that is to be evaluated. Our results show effectiveness when compared to simple mark-all baselines, competitiveness to other submissions of the shared task and for some languages to GPT4o- mini prompt-based approaches.

Michelle Wastl

2026

2025

2024

Co-authors

Venues