2025
pdf
bib
abs
Mixtures of In-Context Learners
Giwon Hong
|
Emile Van Krieken
|
Edoardo Ponti
|
Nikolay Malkin
|
Pasquale Minervini
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
In-context learning (ICL) adapts LLMs by providing demonstrations without fine-tuning the model parameters; however, it is very sensitive to the choice of in-context demonstrations, and processing many demonstrations can be computationally demanding. We propose Mixtures of In-Context Learners (MoICL), a novel approach that uses subsets of demonstrations to train a set of experts via ICL and learns a weighting function to merge their output distributions via gradient-based optimisation. In our experiments, we show performance improvements on 5 out of 7 classification datasets compared to a set of strong baselines (e.g., up to +13% compared to ICL and LENS). Moreover, we improve the Pareto frontier of ICL by reducing the inference time needed to achieve the same performance with fewer demonstrations. Finally, MoICL is more robust to out-of-domain (up to +11%), imbalanced (up to +49%) and perturbed demonstrations (up to +38%).
pdf
bib
abs
Self-Training Large Language Models for Tool-Use Without Demonstrations
Ne Luo
|
Aryo Pradipta Gema
|
Xuanli He
|
Emile Van Krieken
|
Pietro Lesci
|
Pasquale Minervini
Findings of the Association for Computational Linguistics: NAACL 2025
Large language models (LLMs) remain prone to factual inaccuracies and computational errors, including hallucinations and mistakes in mathematical reasoning. Recent work augmented LLMs with tools to mitigate these shortcomings, but often requires curated gold tool-use demonstrations. In this paper, we investigate whether LLMs can learn to use tools without demonstrations. First, we analyse zero-shot prompting strategies to guide LLMs in tool utilisation. Second, we propose a self-training method to synthesise tool-use traces using the LLM itself. We compare supervised fine-tuning and preference fine-tuning techniques for fine-tuning the model on datasets constructed using existing Question Answering (QA) datasets, i.e., TriviaQA and GSM8K. Experiments show that tool-use enhances performance on a long-tail knowledge task: 3.7% on PopQA, which is used solely for evaluation, but leads to mixed results on other datasets, i.e., TriviaQA, GSM8K, and NQ-Open. Our findings highlight the potential and challenges of integrating external tools into LLMs without demonstrations.
pdf
bib
abs
Are We Done with MMLU?
Aryo Pradipta Gema
|
Joshua Ong Jun Leang
|
Giwon Hong
|
Alessio Devoto
|
Alberto Carlo Maria Mancino
|
Rohit Saxena
|
Xuanli He
|
Yu Zhao
|
Xiaotang Du
|
Mohammad Reza Ghasemi Madani
|
Claire Barale
|
Robert McHardy
|
Joshua Harris
|
Jean Kaddour
|
Emile Van Krieken
|
Pasquale Minervini
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs. For example, we find that 57% of the analysed questions in the Virology subset contain errors. To address this issue, we introduce a comprehensive framework for identifying dataset errors using a novel error annotation protocol. Then, we create MMLU-Redux, which is a subset of 5,700 manually re-annotated questions across all 57 MMLU subjects. Using MMLU-Redux, we demonstrate significant discrepancies with the model performance metrics that were originally reported. Our results strongly advocate for revising MMLU’s error-ridden questions to enhance its future utility and reliability as a benchmark. Therefore, we open up MMLU-Redux for additional annotation.