Confidence as a Tie-Breaker: Reassessing Multilingual Hedging Bias in LLM-as-a-Judge Evaluation

Rajashik Datta, Sanjan Baitalik


Abstract
LLM judges are often used to score generated answers, but their decisions may be affected by surface style rather than semantic correctness. We introduce PolyJudge-Uncertain, a controlled benchmark for studying multilingual hedging effects in LLM-as-a-judge evaluation. The benchmark contains 5,120 short factual QA instances across English, Hindi, Hinglish, and Bengali, balancing assertive versus hedged style and correct versus incorrect answers. A small pilot suggested a large pointwise penalty against hedged answers. After repairing multilingual templates and adding quality-control checks, this pointwise effect largely disappears: final pointwise accuracy is 99.8%, with no meaningful assertive-hedged gap. The robust remaining effect is pairwise: when two answers are equally correct and differ only in style, the judge prefers the assertive answer in 1,276 of 1,280 cases. We interpret this as a protocol- and task-specific assertiveness preference, not as a universal bias against hedging. Our findings highlight benchmark auditing as a central requirement for multilingual judge-bias research.
Anthology ID:
2026.acl-srw.33
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Santosh T.Y.S.S., Juan Diego Rodriguez, Ona de Gibert
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
393–402
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-srw.33/
DOI:
Bibkey:
Cite (ACL):
Rajashik Datta and Sanjan Baitalik. 2026. Confidence as a Tie-Breaker: Reassessing Multilingual Hedging Bias in LLM-as-a-Judge Evaluation. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 393–402, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Confidence as a Tie-Breaker: Reassessing Multilingual Hedging Bias in LLM-as-a-Judge Evaluation (Datta & Baitalik, ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-srw.33.pdf