Abstract
Computational linguistics models commonly target the prediction of discrete—categorical—labels. When assessing how well-calibrated these model predictions are, popular evaluation schemes require practitioners to manually determine a binning scheme: grouping labels into bins to approximate true label posterior. The problem is that these metrics are sensitive to binning decisions. We consider two solutions to the binning problem that apply at the stage of data annotation: collecting either distributed (redundant) labels or direct scalar value assignment. In this paper, we show that although both approaches address the binning problem by evaluating instance-level calibration, direct scalar assignment is significantly more cost-effective. We provide theoretical analysis and empirical evidence to support our proposal for dataset creators to adopt scalar annotation protocols to enable a higher-quality assessment of model calibration.- Anthology ID:
- 2024.tacl-1.7
- Volume:
- Transactions of the Association for Computational Linguistics, Volume 12
- Month:
- Year:
- 2024
- Address:
- Cambridge, MA
- Venue:
- TACL
- SIG:
- Publisher:
- MIT Press
- Note:
- Pages:
- 120–136
- Language:
- URL:
- https://aclanthology.org/2024.tacl-1.7
- DOI:
- 10.1162/tacl_a_00636
- Cite (ACL):
- Zhengping Jiang, Anqi Liu, and Benjamin Van Durme. 2024. Addressing the Binning Problem in Calibration Assessment through Scalar Annotations. Transactions of the Association for Computational Linguistics, 12:120–136.
- Cite (Informal):
- Addressing the Binning Problem in Calibration Assessment through Scalar Annotations (Jiang et al., TACL 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2024.tacl-1.7.pdf