Pieter Delobelle

2025

Large Language Models (LLMs) reproduce and exacerbate the social biases present in their training data, and resources to quantify this issue are limited. While research has attempted to identify and mitigate such biases, most efforts have been concentrated around English, lagging the rapid advancement of LLMs in multilingual settings. In this paper, we introduce a new multilingual parallel dataset SHADES to help address this issue, designed for examining culturally-specific stereotypes that may be learned by LLMs. The dataset includes stereotypes from 20 regions around the world and 16 languages, spanning multiple identity categories subject to discrimination worldwide. We demonstrate its utility in a series of exploratory evaluations for both “base” and “instruction-tuned” language models. Our results suggest that stereotypes are consistently reflected across models and languages, with some languages and models indicating much stronger stereotype biases than others.

2024

pdf bib abs
ItaEval and TweetyIta: A New Extensive Benchmark and Efficiency-First Language Model for Italian
Giuseppe Attanasio | Pieter Delobelle | Moreno La Quatra | Andrea Santilli | Beatrice Savoldi
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)

Current development and benchmarking efforts for modern, large-scale Italian language models (LMs) are scattered.This paper situates such efforts by introducing two new resources: ItaEval, a comprehensive evaluation suite, and TweetyIta, an efficiency-first language model for Italian.Through ItaEval, we standardize evaluation across language understanding, commonsense and factual knowledge, and social bias-related tasks.In our attempt at language modeling, we experiment with efficient, tokenization-based adaption techniques. Our TweetyIta shows encouraging results after training on as little as 5G tokens from natural Italian corpora. We benchmark an extensive list of models against ItaEval and find several interesting insights. Surprisingly, i) models trained predominantly on English data dominate the leaderboard; ii) TweetyIta is competitive against other forms of adaptation or inherently monolingual models;iii) natural language understanding tasks are challenging for current models.We release code and data at https://github.com/RiTA-nlp/ita-eval and host a live leaderboard at https://huggingface.co/spaces/RiTA-nlp/ita-eval.

pdf bib abs
Metrics for What, Metrics for Whom: Assessing Actionability of Bias Evaluation Metrics in NLP
Pieter Delobelle | Giuseppe Attanasio | Debora Nozza | Su Lin Blodgett | Zeerak Talat
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

This paper introduces the concept of actionability in the context of bias measures in natural language processing (NLP). We define actionability as the degree to which a measure’s results enable informed action and propose a set of desiderata for assessing it. Building on existing frameworks such as measurement modeling, we argue that actionability is a crucial aspect of bias measures that has been largely overlooked in the literature.We conduct a comprehensive review of 146 papers proposing bias measures in NLP, examining whether and how they provide the information required for actionable results. Our findings reveal that many key elements of actionability, including a measure’s intended use and reliability assessment, are often unclear or entirely absent.This study highlights a significant gap in the current approach to developing and reporting bias measures in NLP. We argue that this lack of clarity may impede the effective implementation and utilization of these measures. To address this issue, we offer recommendations for more comprehensive and actionable metric development and reporting practices in NLP bias research.

pdf bib
OneLove beyond the field - A few-shot pipeline for topic and sentiment analysis during the FIFA World Cup in Qatar
Christoph Rauchegger | Sonja Mei Wang | Pieter Delobelle
Proceedings of the 20th Conference on Natural Language Processing (KONVENS 2024)

pdf bib abs
BPE-knockout: Pruning Pre-existing BPE Tokenisers with Backwards-compatible Morphological Semi-supervision
Thomas Bauwens | Pieter Delobelle
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Byte-pair encoding (BPE) has become the default subword tokeniser in language models (LMs), allowing the representation of an infinite space of text with a finite set of units. Yet, BPE training is unsupervised, receiving no explicit information about a language’s morphology. This results in a subword vocabulary wherein many units are a concatenation of partial morphemes, preventing their formation as tokens. This, in turn, causes consistent intra-word patterns to be displayed inconsistently to downstream models, and bloats the vocabulary, hence requiring unnecessary embedding storage. In this paper, we address this issue by identifying blameworthy BPE merges and removing the resulting subwords from the BPE vocabulary, without impeding further use of merges that relied on them. We find that our method, BPE-knockout, is effective at making BPE’s segmentation positions adhere better to derivational and compound boundaries in English, Dutch and German, and improves token-based tasks in Dutch RoBERTa models, indicating that a tokeniser’s adherence to morphology impacts downstream models. We demonstrate the latter not only by training LMs from scratch, but also by continuing the pre-training of existing LMs. This proves promising, showing that suboptimal tokenisers can be remedied whilst salvaging training cost of downstream LMs.

2023

pdf bib abs
How Far Can It Go? On Intrinsic Gender Bias Mitigation for Text Classification
Ewoenam Kwaku Tokpo | Pieter Delobelle | Bettina Berendt | Toon Calders
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

To mitigate gender bias in contextualized language models, different intrinsic mitigation strategies have been proposed, alongside many bias metrics. Considering that the end use of these language models is for downstream tasks like text classification, it is important to understand how these intrinsic bias mitigation strategies actually translate to fairness in downstream tasks and the extent of this. In this work, we design a probe to investigate the effects that some of the major intrinsic gender bias mitigation strategies have on downstream text classification tasks. We discover that instead of resolving gender bias, intrinsic mitigation techniques and metrics are able to hide it in such a way that significant gender information is retained in the embeddings. Furthermore, we show that each mitigation technique is able to hide the bias from some of the intrinsic bias measures but not all, and each intrinsic bias measure can be fooled by some mitigation techniques, but not all. We confirm experimentally, that none of the intrinsic mitigation techniques used without any other fairness intervention is able to consistently impact extrinsic bias. We recommend that intrinsic bias mitigation techniques should be combined with other fairness interventions for downstream tasks.

2022

pdf bib abs
Measuring Fairness with Biased Rulers: A Comparative Study on Bias Metrics for Pre-trained Language Models
Pieter Delobelle | Ewoenam Tokpo | Toon Calders | Bettina Berendt
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

An increasing awareness of biased patterns in natural language processing resources such as BERT has motivated many metrics to quantify ‘bias’ and ‘fairness’ in these resources. However, comparing the results of different metrics and the works that evaluate with such metrics remains difficult, if not outright impossible. We survey the literature on fairness metrics for pre-trained language models and experimentally evaluate compatibility, including both biases in language models and in their downstream tasks. We do this by combining traditional literature survey, correlation analysis and empirical evaluations. We find that many metrics are not compatible with each other and highly depend on (i) templates, (ii) attribute and target seeds and (iii) the choice of embeddings. We also see no tangible evidence of intrinsic bias relating to extrinsic bias. These results indicate that fairness or bias evaluation remains challenging for contextualized language models, among other reasons because these choices remain subjective. To improve future comparisons and fairness evaluations, we recommend to avoid embedding-based metrics and focus on fairness evaluations in downstream tasks.

2020

pdf bib abs
RobBERT: a Dutch RoBERTa-based Language Model
Pieter Delobelle | Thomas Winters | Bettina Berendt
Findings of the Association for Computational Linguistics: EMNLP 2020

Pre-trained language models have been dominating the field of natural language processing in recent years, and have led to significant performance gains for various complex natural language tasks. One of the most prominent pre-trained language models is BERT, which was released as an English as well as a multilingual version. Although multilingual BERT performs well on many tasks, recent studies show that BERT models trained on a single language significantly outperform the multilingual version. Training a Dutch BERT model thus has a lot of potential for a wide range of Dutch NLP tasks. While previous approaches have used earlier implementations of BERT to train a Dutch version of BERT, we used RoBERTa, a robustly optimized BERT approach, to train a Dutch language model called RobBERT. We measured its performance on various tasks as well as the importance of the fine-tuning dataset size. We also evaluated the importance of language-specific tokenizers and the model’s fairness. We found that RobBERT improves state-of-the-art results for various tasks, and especially significantly outperforms other models when dealing with smaller datasets. These results indicate that it is a powerful pre-trained model for a large variety of Dutch language tasks. The pre-trained and fine-tuned models are publicly available to support further downstream Dutch NLP applications.

2019

pdf bib abs
Computational Ad Hominem Detection
Pieter Delobelle | Murilo Cunha | Eric Massip Cano | Jeroen Peperkamp | Bettina Berendt
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Fallacies like the personal attack—also known as the ad hominem attack—are introduced in debates as an easy win, even though they provide no rhetorical contribution. Although their importance in argumentation mining is acknowledged, automated mining and analysis is still lacking. We show TF-IDF approaches are insufficient to detect the ad hominem attack. Therefore we present a machine learning approach for information extraction, which has a recall of 80% for a social media data source. We also demonstrate our approach with an application that uses online learning.