Ryan A. Cook


2025

pdf bib
No Simple Answer to Data Complexity: An Examination of Instance-Level Complexity Metrics for Classification Tasks
Ryan A. Cook | John P. Lalor | Ahmed Abbasi
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Natural Language Processing research has become increasingly concerned with understanding data quality and complexity at the instance level. Instance-level complexity scores can be used for tasks such as filtering out noisy observations and subsampling informative examples. However, there exists a diverse taxonomy of complexity metrics that can be used for a classification task, making metric selection itself a difficult task. We empirically examine the relationship between these metrics and find that simply storing training loss provides similar complexity rankings as other more computationally intensive techniques. Metric similarity allows us to subsample data with higher aggregate complexity along several metrics using a single a priori available meta-feature. Further, this choice of complexity metric does not impact demographic fairness, even in downstream predictions. Researchers should consider metric availability and similarity, as using the wrong metric or sampling strategy may hurt performance.