How Many Samples Do We Need? A Toolkit for Power-Aware Evaluation Design

Angelo Basile, Areg Mikael Sarvazyan, José Ángel González


Abstract
If datasets are the telescopes of our field, then statistical power is their resolution, i.e., their ability to reveal a true difference in model performance when one exists. Many NLP evaluations are underpowered, leading to overstated claims of improvement. This paper introduces sk-power, an open-source Python library that helps researchers and practitioners design well-powered evaluations. Built with familiar scikit-learn-style abstractions, sk-power enables users to simulate evaluation scenarios, estimate minimum detectable effects, and assess the reliability of reported gains. We also illustrate what can go wrong when power analysis isn’t carried out. Our goal is to position power analysis as a first-class, practical step in evaluation planning.
Anthology ID:
2026.lrec-main.353
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
4507–4513
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.353/
DOI:
Bibkey:
Cite (ACL):
Angelo Basile, Areg Mikael Sarvazyan, and José Ángel González. 2026. How Many Samples Do We Need? A Toolkit for Power-Aware Evaluation Design. International Conference on Language Resources and Evaluation, main:4507–4513.
Cite (Informal):
How Many Samples Do We Need? A Toolkit for Power-Aware Evaluation Design (Basile et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.353.pdf