Is This LLM Library Learning? Evaluation Must Account For Compute and Behaviour

Ian Berlot-Attwell; Tobias Sesterhenn; Frank Rudzicz; Xujie Si

Is This LLM Library Learning? Evaluation Must Account For Compute and Behaviour

Ian Berlot-Attwell, Tobias Sesterhenn, Frank Rudzicz, Xujie Si

Abstract

The in-context learning (ICL) coding, reasoning, and tool-using ability of LLMs has spurred interest in library learning (i.e., the creation and exploitation of reusable and composable functions, tools, or lemmas). Such systems often promise improved task performance and computational efficiency by caching reasoning (i.e., storing generated tools) - all without finetuning. However, we find strong reasons to be skeptical. Specifically, we identify a serious evaluation flaw present in a large number of ICL library learning works: these works do not correct for the difference in computational cost between baseline and library learning systems. Studying three separately published ICL library learning systems, we find that all of them fail to consistently outperform the simple baseline of prompting the model - improvements in task accuracy often vanish or reverse once computational cost is accounted for. Furthermore, we perform an in-depth examination of one such system, LEGO-Prover, which purports to learn reusable lemmas for mathematical reasoning. We find no evidence of the direct reuse of learned lemmas, and find evidence against the soft reuse of learned lemmas (i.e., reuse by modifying relevant examples).Our findings suggest that a serious re-examination of the effectiveness of ICL LLM-based library learning is required, as is much stronger standards for evaluation. An equal computational budget must be used for baselines, alongside behavioural analysis.

Anthology ID:: 2026.eacl-long.163
Volume:: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3534–3568
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.163/
DOI:
Bibkey:
Cite (ACL):: Ian Berlot-Attwell, Tobias Sesterhenn, Frank Rudzicz, and Xujie Si. 2026. Is This LLM Library Learning? Evaluation Must Account For Compute and Behaviour. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3534–3568, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Is This LLM Library Learning? Evaluation Must Account For Compute and Behaviour (Berlot-Attwell et al., EACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.163.pdf

PDF Cite Search Fix data