Timothee Mickus

2025

pdf bib abs
Your Model is Overconfident, and Other Lies We Tell Ourselves
Timothee Mickus | Aman Sinha | Raúl Vázquez
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The difficulty intrinsic to a given example, rooted in its inherent ambiguity, is a key yet often overlooked factor in evaluating neural NLP models. We investigate the interplay and divergence among various metrics for assessing intrinsic difficulty, including annotator dissensus, training dynamics, and model confidence. Through a comprehensive analysis using 29 models on three datasets, we reveal that while correlations exist among these metrics, their relationships are neither linear nor monotonic. By disentangling these dimensions of uncertainty, we aim to refine our understanding of data complexity and its implications for evaluating and improving NLP models.

pdf bib
Proceedings of the Second Workshop on the Bridges and Gaps between Formal and Computational Linguistics (BriGap-2)
Timothée Bernard | Timothee Mickus
Proceedings of the Second Workshop on the Bridges and Gaps between Formal and Computational Linguistics (BriGap-2)

pdf bib abs
Adapting Definition Modeling for New Languages: A Case Study on Belarusian
Daniela Kazakouskaya | Timothee Mickus | Janine Siewert
Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)

Definition modeling, the task of generating new definitions for words in context, holds great prospect as a means to assist the work of lexicographers in documenting a broader variety of lects and languages, yet much remains to be done in order to assess how we can leverage pre-existing models for as-of-yet unsupported languages. In this work, we focus on adapting existing models to Belarusian, for which we propose a novel dataset of 43,150 definitions. Our experiments demonstrate that adapting a definition modeling systems requires minimal amounts of data, but that there currently are gaps in what automatic metrics do capture.

pdf bib abs
Pre-trained Language Models Learn Remarkably Accurate Representations of Numbers
Marek Kadlčík | Michal Štefánik | Timothee Mickus | Josef Kuchař | Michal Spiegel
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Pretrained language models (LMs) are prone to arithmetic errors. Existing work showed limited success in probing numeric values from models’ representations, indicating that these errors can be attributed to the inherent unreliability of distributionally learned embeddings in representing exact quantities. However, we observe that previous probing methods are inadequate for the emergent structure of learned number embeddings with sinusoidal patterns.In response, we propose a novel probing technique that decodes numeric values from input embeddings with near-perfect accuracy across a range of open-source LMs. This proves that after the sole pre-training, LMs represent numbers with remarkable precision. Finally, we find that the embeddings’ preciseness judged by our probe’s accuracy explains a large portion of LM’s errors in elementary arithmetic, and show that aligning the embeddings with the pattern discovered by our probe can mitigate these errors.

pdf bib abs
Can Out-of-Distribution Evaluations Uncover Reliance on Prediction Shortcuts? A Case Study in Question Answering
Michal Štefánik | Timothee Mickus | Michal Spiegel | Marek Kadlčík | Josef Kuchař
Findings of the Association for Computational Linguistics: EMNLP 2025

A large body of recent work assesses models’ generalization capabilities through the lens of performance on out-of-distribution (OOD) datasets. Despite their practicality, such evaluations build upon a strong assumption: that OOD evaluations can capture and reflect upon possible failures in a real-world deployment. In this work, we challenge this assumption and confront the results obtained from OOD evaluations with a set of specific failure modes documented in existing question-answering (QA) models, referred to as a reliance on spurious features or prediction shortcuts.We find that different datasets used for OOD evaluations in QA provide an estimate of models’ robustness to shortcuts that have a vastly different quality, some largely under-performing even a simple, in-distribution evaluation. We partially attribute this to the observation that spurious shortcuts are shared across ID+OOD datasets, but also find cases where a dataset’s quality for training and evaluation is largely disconnected. Our work underlines limitations of commonly-used OOD-based evaluations of generalization, and provides methodology and recommendations for evaluating generalization within and beyond QA more robustly.

We present the Mu-SHROOM shared task which is focused on detecting hallucinations and other overgeneration mistakes in the output of instruction-tuned large language models (LLMs).Mu-SHROOM addresses general-purpose LLMs in 14 languages, and frames the hallucination detection problem as a span-labeling task. We received 2,618 submissions from 43 participating teams employing diverse methodologies. The very high number of submissions highlights the interest of the community in hallucination detection. We present the results of the participating systems and provide an empirical analysis in order to better understand the factors that can lead to strong performance in this task. We also underscore current challenges, notably the varying degree of hallucinations across languages and the high annotator disagreement when labeling hallucination spans.

2024

pdf bib abs
Isotropy, Clusters, and Classifiers
Timothee Mickus | Stig-Arne Grönroos | Joseph Attieh
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Whether embedding spaces use all their dimensions equally, i.e., whether they are isotropic, has been a recent subject of discussion. Evidence has been accrued both for and against enforcing isotropy in embedding spaces. In the present paper, we stress that isotropy imposes requirements on the embedding space that are not compatible with the presence of clusters—which also negatively impacts linear classification objectives. We demonstrate this fact both empirically and mathematically and use it to shed light on previous results from the literature.

pdf bib abs
Domain-specific or Uncertainty-aware models: Does it really make a difference for biomedical text classification?
Aman Sinha | Timothee Mickus | Marianne Clausel | Mathieu Constant | Xavier Coubez
Proceedings of the 23rd Workshop on Biomedical Natural Language Processing

The success of pretrained language models (PLMs) across a spate of use-cases has led to significant investment from the NLP community towards building domain-specific foundational models. On the other hand, in mission critical settings such as biomedical applications, other aspects also factor in—chief of which is a model’s ability to produce reasonable estimates of its own uncertainty. In the present study, we discuss these two desiderata through the lens of how they shape the entropy of a model’s output probability distribution. We find that domain specificity and uncertainty awareness can often be successfully combined, but the exact task at hand weighs in much more strongly.

NLP in the age of monolithic large language models is approaching its limits in terms of size and information that can be handled. The trend goes to modularization, a necessary step into the direction of designing smaller sub-networks and components with specialized functionality. In this paper, we present the MAMMOTH toolkit: a framework designed for training massively multilingual modular machine translation systems at scale, initially derived from OpenNMT-py and then adapted to ensure efficient training across computation clusters.We showcase its efficiency across clusters of A100 and V100 NVIDIA GPUs, and discuss our design philosophy and plans for future information.The toolkit is publicly available online at https://github.com/Helsinki-NLP/mammoth.

pdf bib abs
A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives
Zihao Li | Shaoxiong Ji | Timothee Mickus | Vincent Segonne | Jörg Tiedemann
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Pretrained language models (PLMs) display impressive performances and have captured the attention of the NLP community.Establishing best practices in pretraining has, therefore, become a major focus of NLP research, especially since insights gained from monolingual English models may not necessarily apply to more complex multilingual models.One significant caveat of the current state of the art is that different works are rarely comparable: they often discuss different parameter counts, training data, and evaluation methodology.This paper proposes a comparison of multilingual pretraining objectives in a controlled methodological environment. We ensure that training data and model architectures are comparable, and discuss the downstream performances across 6 languages that we observe in probing and fine-tuning scenarios.We make two key observations: (1) the architecture dictates which pretraining objective is optimal; (2) multilingual translation is a very effective pretraining objective under the right conditions.We make our code, data, and model weights available at https://github.com/Helsinki-NLP/lm-vs-mt.

pdf bib abs
I Have an Attention Bridge to Sell You: Generalization Capabilities of Modular Translation Architectures
Timothee Mickus | Raul Vazquez | Joseph Attieh
Proceedings of the Fifth Workshop on Insights from Negative Results in NLP

Modularity is a paradigm of machine translation with the potential of bringing forth models that are large at training time and small during inference. Within this field of study, modular approaches, and in particular attention bridges, have been argued to improve the generalization capabilities of models by fostering language-independent representations. In the present paper, we study whether modularity affects translation quality; as well as how well modular architectures generalize across different evaluation scenarios. For a given computational budget, we find non-modular architectures to be always comparable or preferable to all modular designs we study.

pdf bib
AXOLOTL’24 Shared Task on Multilingual Explainable Semantic Change Modeling
Mariia Fedorova | Timothee Mickus | Niko Partanen | Janine Siewert | Elena Spaziani | Andrey Kutuzov
Proceedings of the 5th Workshop on Computational Approaches to Historical Language Change

pdf bib abs
Can Machine Translation Bridge Multilingual Pretraining and Cross-lingual Transfer Learning?
Shaoxiong Ji | Timothee Mickus | Vincent Segonne | Jörg Tiedemann
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Multilingual pretraining and fine-tuning have remarkably succeeded in various natural language processing tasks. Transferring representations from one language to another is especially crucial for cross-lingual learning. One can expect machine translation objectives to be well suited to fostering such capabilities, as they involve the explicit alignment of semantically equivalent sentences from different languages. This paper investigates the potential benefits of employing machine translation as a continued training objective to enhance language representation learning, bridging multilingual pretraining and cross-lingual applications. We study this question through two lenses: a quantitative evaluation of the performance of existing models and an analysis of their latent representations. Our results show that, contrary to expectations, machine translation as the continued training fails to enhance cross-lingual representation learning in multiple cross-lingual natural language understanding tasks. We conclude that explicit sentence-level alignment in the cross-lingual scenario is detrimental to cross-lingual transfer pretraining, which has important implications for future cross-lingual transfer studies. We furthermore provide evidence through similarity measures and investigation of parameters that this lack of positive influence is due to output separability—which we argue is of use for machine translation but detrimental elsewhere.

pdf bib
Proceedings of the 1st Workshop on Modular and Open Multilingual NLP (MOOMIN 2024)
Raúl Vázquez | Timothee Mickus | Jörg Tiedemann | Ivan Vulić | Ahmet Üstün
Proceedings of the 1st Workshop on Modular and Open Multilingual NLP (MOOMIN 2024)

pdf bib
Language Models and the Paradigmatic Axis
Timothee Mickus
Proceedings of the Society for Computation in Linguistics 2024

pdf bib
Stranger than Paradigms Word Embedding Benchmarks Don’t Align With Morphology
Timothee Mickus | Maria Copot
Proceedings of the Society for Computation in Linguistics 2024

This paper presents the results of the SHROOM, a shared task focused on detecting hallucinations: outputs from natural language generation (NLG) systems that are fluent, yet inaccurate. Such cases of overgeneration put in jeopardy many NLG applications, where correctness is often mission-critical. The shared task was conducted with a newly constructed dataset of 4000 model outputs labeled by 5 annotators each, spanning 3 NLP tasks: machine translation, paraphrase generation and definition modeling.The shared task was tackled by a total of 58 different users grouped in 42 teams, out of which 26 elected to write a system description paper; collectively, they submitted over 300 prediction sets on both tracks of the shared task. We observe a number of key trends in how this approach was tackled—many participants rely on a handful of model, and often rely either on synthetic data for fine-tuning or zero-shot prompting strategies. While a majority of the teams did outperform our proposed baseline system, the performances of top-scoring systems are still consistent with a random handling of the more challenging items.

pdf bib abs
The Emergence of High-Level Semantics in a Signaling Game
Timothée Bernard | Timothee Mickus | Hiroya Takamura
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)

The symbol grounding problem—how to connect a symbolic system to the outer world—is a longstanding question in AI that has recently gained prominence with the progress made in NLP in general and surrounding large language models in particular. In this article, we study the emergence of semantic categories in the communication protocol developed by neural agents involved in a well-established type of signaling game. In its basic form, the game requires one agent to retrieve an image based on a message produced by a second agent. We first show that the agents are able to, and do, learn to communicate high-level semantic concepts rather than low-level features of the images even from very indirect training signal to that end. Second, we demonstrate that the introduction of an adversarial agent in the game fosters the emergence of semantics by producing an appropriate training signal when no other method is available.

2023

pdf bib abs
Why Bother with Geometry? On the Relevance of Linear Decompositions of Transformer Embeddings
Timothee Mickus | Raúl Vázquez
Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

A recent body of work has demonstrated that Transformer embeddings can be linearly decomposed into well-defined sums of factors, that can in turn be related to specific network inputs or components. There is however still a dearth of work studying whether these mathematical reformulations are empirically meaningful. In the present work, we study representations from machine-translation decoders using two of such embedding decomposition methods. Our results indicate that, while decomposition-derived indicators effectively correlate with model performance, variation across different runs suggests a more nuanced take on this question. The high variability of our measurements indicate that geometry reflects model-specific characteristics more than it does sentence-specific computations, and that similar training conditions do not guarantee similar vector spaces.

pdf bib abs
So many design choices: Improving and interpreting neural agent communication in signaling games
Timothée Bernard | Timothee Mickus
Findings of the Association for Computational Linguistics: ACL 2023

Emergent language games are experimental protocols designed to model how communication may arise among a group of agents. In this paper, we focus on how to improve performances of neural agents playing a signaling game: a sender is exposed to an image and generates a sequence of symbols that is transmitted to a receiver, which uses it to distinguish between two images, one that is semantically related to the original image, and one that is not. We consider multiple design choices, such as pretraining the visual components of the agents, introducing regularization terms, how to sample training items from the dataset, and we study how these different choices impact the behavior and performances of the agents. To that end, we introduce a number of automated metrics to measure the properties of the emergent language. We find that some implementation choices are always beneficial, and that the information that is conveyed by the agents’ messages is shaped not only by the game, but also by the overall design of the agents as well as seemingly unrelated implementation choices.

pdf bib abs
Grounded and well-rounded: a methodological approach to the study of cross-modal and cross-lingual grounding
Timothee Mickus | Elaine Zosa | Denis Paperno
Findings of the Association for Computational Linguistics: EMNLP 2023

Grounding has been argued to be a crucial component towards the development of more complete and truly semantically competent artificial intelligence systems. Literature has divided into two camps: While some argue that grounding allows for qualitatively different generalizations, others believe it can be compensated by mono-modal data quantity. Limited empirical evidence has emerged for or against either position, which we argue is due to the methodological challenges that come with studying grounding and its effects on NLP systems. In this paper, we establish a methodological framework for studying what the effects are—if any—of providing models with richer input sources than text-only. The crux of it lies in the construction of comparable samples of populations of models trained on different input modalities, so that we can tease apart the qualitative effects of different input sources from quantifiable model performances. Experiments using this framework reveal qualitative differences in model behavior between cross-modally grounded, cross-lingually grounded, and ungrounded models, which we measure both at a global dataset level as well as for specific word representations, depending on how concrete their semantics is.

pdf bib abs
Definition Modeling : To model definitions. Generating Definitions With Little to No Semantics
Vincent Segonne | Timothee Mickus
Proceedings of the 15th International Conference on Computational Semantics

Definition Modeling, the task of generating definitions, was first proposed as a means to evaluate the semantic quality of word embeddings—a coherent lexical semantic representations of a word in context should contain all the information necessary to generate its definition. The relative novelty of this task entails that we do not know which factors are actually relied upon by a Definition Modeling system. In this paper, we present evidence that the task may not involve as much semantics as one might expect: we show how an earlier model from the literature is both rather insensitive to semantic aspects such as explicit polysemy, as well as reliant on formal similarities between headwords and words occurring in its glosses, casting doubt on the validity of the task as a means to evaluate embeddings.

pdf bib abs
Dozens of Translation Directions or Millions of Shared Parameters? Comparing Two Types of Multilinguality in Modular Machine Translation
Michele Boggia | Stig-Arne Grönroos | Niki Loppi | Timothee Mickus | Alessandro Raganato | Jörg Tiedemann | Raúl Vázquez
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

There are several ways of implementing multilingual NLP systems but little consensus as to whether different approaches exhibit similar effects. Are the trends that we observe when adding more languages the same as those we observe when sharing more parameters? We focus on encoder representations drawn from modular multilingual machine translation systems in an English-centric scenario, and study their quality from multiple aspects: how adequate they are for machine translation, how independent of the source language they are, and what semantic information they convey. Adding translation directions in English-centric scenarios does not conclusively lead to an increase in translation quality. Shared layers increase performance on zero-shot translation pairs and lead to more language-independent representations, but these improvements do not systematically align with more semantically accurate representations, from a monolingual standpoint.

pdf bib abs
„Mann“ is to “Donna” as「国王」is to « Reine » Adapting the Analogy Task for Multilingual and Contextual Embeddings
Timothee Mickus | Eduardo Calò | Léo Jacqmin | Denis Paperno | Mathieu Constant
Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)

How does the word analogy task fit in the modern NLP landscape? Given the rarity of comparable multilingual benchmarks and the lack of a consensual evaluation protocol for contextual models, this remains an open question. In this paper, we introduce MATS: a multilingual analogy dataset, covering forty analogical relations in six languages, and evaluate human as well as static and contextual embedding performances on the task. We find that not all analogical relations are equally straightforward for humans, static models remain competitive with contextual embeddings, and optimal settings vary across languages and analogical relations. Several key challenges remain, including creating benchmarks that align with human reasoning and understanding what drives differences across methodologies.

2022

pdf bib abs
Semeval-2022 Task 1: CODWOE – Comparing Dictionaries and Word Embeddings
Timothee Mickus | Kees Van Deemter | Mathieu Constant | Denis Paperno
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

Word embeddings have advanced the state of the art in NLP across numerous tasks. Understanding the contents of dense neural representations is of utmost interest to the computational semantics community. We propose to focus on relating these opaque word vectors with human-readable definitions, as found in dictionaries This problem naturally divides into two subtasks: converting definitions into embeddings, and converting embeddings into definitions. This task was conducted in a multilingual setting, using comparable sets of embeddings trained homogeneously.

pdf bib abs
How to Dissect a Muppet: The Structure of Transformer Embedding Spaces
Timothee Mickus | Denis Paperno | Mathieu Constant
Transactions of the Association for Computational Linguistics, Volume 10

Pretrained embeddings based on the Transformer architecture have taken the NLP community by storm. We show that they can mathematically be reframed as a sum of vector factors and showcase how to use this reframing to study the impact of each component. We provide evidence that multi-head attentions and feed-forwards are not equally useful in all downstream applications, as well as a quantitative overview of the effects of finetuning on the overall embedding space. This approach allows us to draw connections to a wide range of previous studies, from vector space anisotropy to attention weights.

2020

pdf bib abs
What Meaning-Form Correlation Has to Compose With: A Study of MFC on Artificial and Natural Language
Timothee Mickus | Timothée Bernard | Denis Paperno
Proceedings of the 28th International Conference on Computational Linguistics

Compositionality is a widely discussed property of natural languages, although its exact definition has been elusive. We focus on the proposal that compositionality can be assessed by measuring meaning-form correlation. We analyze meaning-form correlation on three sets of languages: (i) artificial toy languages tailored to be compositional, (ii) a set of English dictionary definitions, and (iii) a set of English sentences drawn from literature. We find that linguistic phenomena such as synonymy and ungrounded stop-words weigh on MFC measurements, and that straightforward methods to mitigate their effects have widely varying results depending on the dataset they are applied to. Data and code are made publicly available.

pdf bib abs
Génération automatique de définitions pour le français (Definition Modeling in French)
Timothee Mickus | Mathieu Constant | Denis Paperno
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

La génération de définitions est une tâche récente qui vise à produire des définitions lexicographiques à partir de plongements lexicaux. Nous remarquons deux lacunes : (i) l’état de l’art actuel ne s’est penché que sur l’anglais et le chinois, et (ii) l’utilisation escomptée en tant que méthode d’évaluation des plongements lexicaux doit encore être vérifiée. Pour y remédier, nous proposons un jeu de données pour la génération de définitions en français, ainsi qu’une évaluation des performances d’un modèle de génération de définitions simple selon les plongements lexicaux fournis en entrée.

pdf bib
What do you mean, BERT?
Timothee Mickus | Denis Paperno | Mathieu Constant | Kees van Deemter
Proceedings of the Society for Computation in Linguistics 2020

2019

pdf bib
Distributional Effects of Gender Contrasts Across Categories
Timothee Mickus | Olivier Bonami | Denis Paperno
Proceedings of the Society for Computation in Linguistics (SCiL) 2019

pdf bib abs
Mark my Word: A Sequence-to-Sequence Approach to Definition Modeling
Timothee Mickus | Denis Paperno | Matthieu Constant
Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing

Defining words in a textual context is a useful task both for practical purposes and for gaining insight into distributed word representations. Building on the distributional hypothesis, we argue here that the most natural formalization of definition modeling is to treat it as a sequence-to-sequence task, rather than a word-to-sequence task: given an input sequence with a highlighted word, generate a contextually appropriate definition for it. We implement this approach in a Transformer-based sequence-to-sequence model. Our proposal allows to train contextualization and definition generation in an end-to-end fashion, which is a conceptual improvement over earlier works. We achieve state-of-the-art results both in contextual and non-contextual definition modeling.