Timm Dill


2025

pdf bib
SHADES: Towards a Multilingual Assessment of Stereotypes in Large Language Models
Margaret Mitchell | Giuseppe Attanasio | Ioana Baldini | Miruna Clinciu | Jordan Clive | Pieter Delobelle | Manan Dey | Sil Hamilton | Timm Dill | Jad Doughman | Ritam Dutt | Avijit Ghosh | Jessica Zosa Forde | Carolin Holtermann | Lucie-Aimée Kaffee | Tanmay Laud | Anne Lauscher | Roberto L Lopez-Davila | Maraim Masoud | Nikita Nangia | Anaelia Ovalle | Giada Pistilli | Dragomir Radev | Beatrice Savoldi | Vipul Raheja | Jeremy Qin | Esther Ploeger | Arjun Subramonian | Kaustubh Dhole | Kaiser Sun | Amirbek Djanibekov | Jonibek Mansurov | Kayo Yin | Emilio Villa Cueva | Sagnik Mukherjee | Jerry Huang | Xudong Shen | Jay Gala | Hamdan Al-Ali | Tair Djanibekov | Nurdaulet Mukhituly | Shangrui Nie | Shanya Sharma | Karolina Stanczak | Eliza Szczechla | Tiago Timponi Torrent | Deepak Tunuguntla | Marcelo Viridiano | Oskar Van Der Wal | Adina Yakefu | Aurélie Névéol | Mike Zhang | Sydney Zink | Zeerak Talat
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large Language Models (LLMs) reproduce and exacerbate the social biases present in their training data, and resources to quantify this issue are limited. While research has attempted to identify and mitigate such biases, most efforts have been concentrated around English, lagging the rapid advancement of LLMs in multilingual settings. In this paper, we introduce a new multilingual parallel dataset SHADES to help address this issue, designed for examining culturally-specific stereotypes that may be learned by LLMs. The dataset includes stereotypes from 20 regions around the world and 16 languages, spanning multiple identity categories subject to discrimination worldwide. We demonstrate its utility in a series of exploratory evaluations for both “base” and “instruction-tuned” language models. Our results suggest that stereotypes are consistently reflected across models and languages, with some languages and models indicating much stronger stereotype biases than others.

2024

pdf bib
Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ
Carolin Holtermann | Paul Röttger | Timm Dill | Anne Lauscher
Findings of the Association for Computational Linguistics: ACL 2024

Large language models (LLMs) need to serve everyone, including a global majority of non-English speakers. However, most LLMs today, and open LLMs in particular, are often intended for use in just English (e.g. Llama2, Mistral) or a small handful of high-resource languages (e.g. Mixtral, Qwen). Recent research shows that, despite limits in their intended use, people prompt LLMs in many different languages.Therefore, in this paper, we investigate the basic multilingual capabilities of state-of-the-art open LLMs beyond their intended use.For this purpose, we introduce MultiQ, a new silver standard benchmark for basic open-ended question answering with 27.4k test questions across a typologically diverse set of 137 languages. With MultiQ, we evaluate language fidelity, i.e. whether models respond in the prompted language, and question answering accuracy. All LLMs we test respond faithfully and/or accurately for at least some languages beyond their intended use. Most models are more accurate when they respond faithfully. However, differences across models are large, and there is a long tail of languages where models are neither accurate nor faithful. We explore differences in tokenization as a potential explanation for our findings, identifying possible correlations that warrant further investigation.

pdf bib
Sparks of Fairness: Preliminary Evidence of Commercial Machine Translation as English-to-German Gender-Fair Dictionaries
Manuel Lardelli | Timm Dill | Giuseppe Attanasio | Anne Lauscher
Proceedings of the 2nd International Workshop on Gender-Inclusive Translation Technologies

Bilingual dictionaries are bedrock components for several language tasks, including translation. However, dictionaries are traditionally fixed in time, thus excluding those neologisms and neo-morphemes that challenge the language’s nominal morphology. The need for a more dynamic, mutable alternative makes machine translation (MT) systems become an extremely valuable avenue. This paper investigates whether commercial MT can be used as bilingual dictionaries for gender-neutral translation. We focus on the English-to-German pair, where notional gender in the source requires gender inflection in the target. We translated 115 person-referring terms using Google Translate, Microsoft Bing, and DeepL and discovered that while each system is heavily biased towards the masculine gender, DeepL often provides gender-fair alternatives to users, especially with plurals.