Loris Schoenegger

2026

Compact Example-Based Explanations for Language Models
Loris Schoenegger | Benjamin Roth
Findings of the Association for Computational Linguistics: ACL 2026

Training data influence estimation methods quantify the contribution of training documents to a model’s output, making them a promising source of information for example-based explanations.As humans cannot interpret thousands of documents, only a small subset of the training data can be presented as an explanation.Although the choice of which documents to include directly affects explanation quality, previous evaluations of such systems have largely ignored any selection strategies.To address this, we propose a novel *selection relevance score*, a retraining-free metric that quantifies how useful a set of examples is for explaining a model’s output.We validate this score through fine-tuning experiments, confirming that it can predict whether a set of examples supports or undermines the model’s predictions.Using this metric, we further show that common selection strategies often underperform random selection. Motivated by this finding, we propose a strategy that balances influence and representativeness, enabling better use of selection budgets than naively selecting the highest-ranking examples.

2025

pdf bib abs

Influence-driven Curriculum Learning for Pre-training on Limited Data
Loris Schoenegger | Lukas Thoma | Terra Blevins | Benjamin Roth
Proceedings of the First BabyLM Workshop

Curriculum learning, a training technique where data is presented to the model in order of example difficulty (e.g., from simpler to more complex documents), has shown limited success for pre-training language models. In this work, we investigate whether curriculum learning becomes competitive if we replace conventional human-centered difficulty metrics with one that more closely corresponds to example difficulty as observed during model training. Specifically, we experiment with sorting training examples by their training data influence, a score which estimates the effect of individual training examples on the model’s output. Models trained on our curricula are able to outperform ones trained in random order by over 10 percentage points in benchmarks, confirming that curriculum learning is beneficial for language model pre-training, as long as a more model-centric notion of difficulty is adopted.

pdf bib abs

RecombiText: Compositional Data Augmentation for Enhancing LLM Pre-Training Datasets in Low-Resource Scenarios
Alexander Tampier | Lukas Thoma | Loris Schoenegger | Benjamin Roth
Proceedings of the First BabyLM Workshop

We introduce RecombiText Augmentation (RTA), a novel purely statistical NLP method for compositional data augmentation for data-efficient LLM pre-training in low-resource scenarios. RTA identifies lexically and semantically similar sentences within the corpus and generates synthetic sentence pairs from them while preserving underlying patterns from the corpus. We pre-train GPT-2 and RoBERTa language models on a domain-specific, low-resource corpus of 10 million words, with different proportions of augmented data. We compare our RTA-augmented model variants to a baseline model trained on the full original dataset. Zero-shot results show that the language models pre-trained on synthetic data improve in entity tracking, self-paced reading, and morphological generalization benchmarks. In other tasks, the performance is comparable to the baseline model. We demonstrate that it is possible to expand low-resource datasets by two- to four-fold without compromising benchmark performance, solely through statistical processing of the available data.

Co-authors

Venues

BabyLM2
Findings1

Fix author