Towards a Diagnostic and Predictive Evaluation Methodology for Sequence Labeling Tasks

Elena Alvarez-Mellado, Julio Gonzalo


Abstract
Standard evaluation in NLP typically indicates that system A is better on average than system B, but it provides little info on how to improve performance and, what is worse, it should not come as a surprise if B ends up being better than A on outside data. We propose an evaluation methodology for sequence labeling tasks grounded on error analysis that provides both quantitative and qualitative information on where systems must be improved and predicts how models will perform on a different distribution. The key is to create test sets that, contrary to common practice, do not rely on gathering large amounts of real-world in-distribution scraped data, but consists in handcrafting a small set of linguistically motivated examples that exhaustively cover the range of span attributes (such as shape, length, casing, sentence position, etc.) a system may encounter in the wild. We demonstrate this methodology on a benchmark for anglicism identification in Spanish. Our methodology provides results that are diagnostic (because they help identify systematic weaknesses in performance), actionable (because they can inform which model is better suited for a given scenario) and predictive: our method predicts model performance on external datasets with a median correlation of 0.85.
Anthology ID:
2026.lrec-main.472
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
5938–5959
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.472/
DOI:
Bibkey:
Cite (ACL):
Elena Alvarez-Mellado and Julio Gonzalo. 2026. Towards a Diagnostic and Predictive Evaluation Methodology for Sequence Labeling Tasks. International Conference on Language Resources and Evaluation, main:5938–5959.
Cite (Informal):
Towards a Diagnostic and Predictive Evaluation Methodology for Sequence Labeling Tasks (Alvarez-Mellado & Gonzalo, LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.472.pdf