BoN Appetit Team at LeWiDi-2025: Best-of-N Test-time Scaling Can Not Stomach Annotation Disagreements (Yet)

Tomas Ruiz, Siyao Peng, Barbara Plank, Carsten Schwemmer


Abstract
Test-time scaling is a family of techniques to improve LLM outputs at inference time by performing extra computation. To the best of our knowledge, test-time scaling has been limited to domains with verifiably correct answers, like mathematics and coding. We transfer test-time scaling to the LeWiDi-2025 tasks to evaluate annotation disagreements. We experiment with three test-time scaling methods: two benchmark algorithms (Model Averaging and Majority Voting), and a Best-of-N (BoN) sampling method. The two benchmark methods improve LLM performance consistently on the LeWiDi tasks, but the BoN method does not. Our experiments suggest that the BoN method does not currently transfer from mathematics to LeWiDi tasks, and we analyze potential reasons for this gap.
Anthology ID:
2025.nlperspectives-1.14
Volume:
Proceedings of the The 4th Workshop on Perspectivist Approaches to NLP
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Gavin Abercrombie, Valerio Basile, Simona Frenda, Sara Tonelli, Shiran Dudy
Venues:
NLPerspectives | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
153–170
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.nlperspectives-1.14/
DOI:
Bibkey:
Cite (ACL):
Tomas Ruiz, Siyao Peng, Barbara Plank, and Carsten Schwemmer. 2025. BoN Appetit Team at LeWiDi-2025: Best-of-N Test-time Scaling Can Not Stomach Annotation Disagreements (Yet). In Proceedings of the The 4th Workshop on Perspectivist Approaches to NLP, pages 153–170, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
BoN Appetit Team at LeWiDi-2025: Best-of-N Test-time Scaling Can Not Stomach Annotation Disagreements (Yet) (Ruiz et al., NLPerspectives 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.nlperspectives-1.14.pdf