Thomas Bailleux

2026

Credal Concept Bottleneck Models for Epistemic–Aleatoric Uncertainty Decomposition
Tanmoy Mukherjee | Thomas Bailleux | Pierre Marquis | Zied Bouraoui
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Concept Bottleneck Models (CBMs) predict through human-interpretable concepts, but they typically output point concept probabilities that conflate epistemic uncertainty (reducible model underspecification) with aleatoric uncertainty (irreducible input ambiguity). This makes concept-level uncertainty hard to interpret and, more importantly, hard to act upon. We introduce (Credal Ensemble Concept Estimation), a CBM framework that decomposes concept uncertainty by construction. represents each concept as a credal prediction (a probability interval), derives epistemic uncertainty from disagreement across diverse concept heads, and estimates aleatoric uncertainty via a dedicated ambiguity output trained to match annotator disagreement when available. The resulting signals support prescriptive decisions: automate low-uncertainty cases, prioritize data collection for high-epistemic cases, route high-aleatoric cases to human review, and abstain when both are high. Across several tasks, we show that epistemic uncertainty is positively associated with prediction errors, whereas aleatoric uncertainty closely tracks annotator disagreement, providing guidance beyond error correlation. Our implementation is available at the following link: https://github.com/Tankiit/Credal_Sets/tree/ensemble-credal-cbm

pdf bib abs

Explanation Quality Assessment as Ranking with Listwise Rewards
Thomas Bailleux | Tanmoy Mukherjee | Emmanuel Lonca | Pierre Marquis | Zied Bouraoui
Findings of the Association for Computational Linguistics: ACL 2026

We reformulate explanation quality assessment as a ranking problem rather than a generation problem. Instead of optimizing models to produce a single “best” explanation token-by-token, we train reward models to discriminate among multiple candidate explanations and learn their relative quality. Concretely, we construct per-instance candidate sets with graded quality levels and train listwise and pairwise ranking models (ListNet, LambdaRank, RankNet) to preserve ordinal structure and avoid score compression typical of pointwise regression or binary preference objectives. We observe three findings: First, ranking losses consistently outperform regression on score separation across all domains tested. Second, the optimal ranking loss depends on data characteristics: listwise objectives excel with well-separated quality tiers, while pairwise methods are more robust to noisy natural annotations. Third, when trained on carefully curated and well-structured data, small encoder models can match models that are orders of magnitude larger, suggesting that data quality matters more than model scale. Finally, when used as rewards in policy optimization, ranking-based scores enable stable convergence in settings where regression-based rewards fail entirely. Code and data are available at: https://github.com/Tankiit/PPO_Learning_to_rank

2025

pdf bib abs

Grouping Entities with Shared Properties using Multi-Facet Prompting and Property Embeddings
Amit Gajbhiye | Thomas Bailleux | Zied Bouraoui | Luis Espinosa-Anke | Steven Schockaert
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Methods for learning taxonomies from data have been widely studied. We study a specific version of this task, called commonality identification, where only the set of entities is given and we need to find meaningful ways to group those entities. While LLMs should intuitively excel at this task, it is difficult to directly use such models in large domains. In this paper, we instead use LLMs to describe the different properties that are satisfied by each of the entities individually. We then use pre-trained embeddings to cluster these properties, and finally group entities that have properties which belong to the same cluster. To achieve good results, it is paramount that the properties predicted by the LLM are sufficiently diverse. We find that this diversity can be improved by prompting the LLM to structure the predicted properties into different facets of knowledge.

pdf bib abs

Connecting Concept Layers and Rationales to Enhance Language Model Interpretability
Thomas Bailleux | Tanmoy Mukherjee | Pierre Marquis | Zied Bouraoui
Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025)

With the introduction of large language models, NLP has undergone a paradigm shift where these models now serve as the backbone of most developed systems. However, while highly effective, they remain opaque and difficult to interpret, which limits their adoption in critical applications that require transparency and trust. Two major approaches aim to address this: rationale extraction, which highlights input spans that justify predictions, and concept bottleneck models, which make decisions through human-interpretable concepts. Yet each has limitations. Crucially, current models lack a unified framework that connects where a model looks (rationales) with why it makes a decision (concepts). We introduce CLARITY, a model that first selects key input spans, maps them to interpretable concepts, and then predicts using only those concepts. This design supports faithful, multi-level explanations and allows users to intervene at both the rationale and concept levels. CLARITY, achieves competitive accuracy while offering improved transparency and controllability.

2024

pdf bib abs

CONTOR: Benchmarking Strategies for Completing Ontologies with Plausible Missing Rules
Na Li | Thomas Bailleux | Zied Bouraoui | Steven Schockaert
Findings of the Association for Computational Linguistics: EMNLP 2024

We consider the problem of finding plausible rules that are missing from a given ontology. A number of strategies for this problem have already been considered in the literature. Little is known about the relative performance of these strategies, however, as they have thus far been evaluated on different ontologies. Moreover, existing evaluations have focused on distinguishing held-out ontology rules from randomly corrupted ones, which often makes the task unrealistically easy and leads to the presence of incorrectly labelled negative examples. To address these concerns, we introduce a benchmark with manually annotated hard negatives and use this benchmark to evaluate ontology completion models. In addition to previously proposed models, we test the effectiveness of several approaches that have not yet been considered for this task, including LLMs and simple but effective hybrid strategies.

Co-authors

Amit Gajbhiye 1

Na Li 1

Emmanuel Lonca 1

Venues

Fix author