Troy Chen


2025

In hybrid scoring systems, confidence thresholds determine which responses receive human review. This study evaluates a relative (within-batch) thresholding method against an absolute benchmark across ten items. Results show near-perfect agreement and modest distributional differences, supporting the relative method’s validity as a scalable, operationally viable approach for flagging low-confidence responses.