How Reliable is Multilingual LLM-as-a-Judge?

Xiyan Fu; Wei Liu

doi:10.18653/v1/2025.findings-emnlp.587

How Reliable is Multilingual LLM-as-a-Judge?

Abstract

LLM-as-a-Judge has emerged as a popular evaluation strategy, where advanced large language models assess generation results in alignment with human instructions. While these models serve as a promising alternative to human annotators, their reliability in multilingual evaluation remains uncertain. To bridge this gap, we conduct a comprehensive analysis of multilingual LLM-as-a-Judge. Specifically, we evaluate five models from different model families across five diverse tasks involving 25 languages. Our findings reveal that LLMs struggle to achieve consistent judgment results across languages, with an average Fleiss’ Kappa of approximately 0.3, and some models performing even worse. To investigate the cause of inconsistency, we analyze various influencing factors. We observe that consistency varies significantly across languages, with particularly poor performance in low-resource languages. Additionally, we find that neither training on multilingual data nor increasing model scale directly improves judgment consistency. These findings suggest that LLMs are not yet reliable for evaluating multilingual predictions. Our work provides valuable insights into the limitations of multilingual LLM-as-a-Judge, and sheds light on future research.

Anthology ID:: 2025.findings-emnlp.587
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 11040–11053
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.587/
DOI:: 10.18653/v1/2025.findings-emnlp.587
Bibkey:
Cite (ACL):: Xiyan Fu and Wei Liu. 2025. How Reliable is Multilingual LLM-as-a-Judge?. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 11040–11053, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: How Reliable is Multilingual LLM-as-a-Judge? (Fu & Liu, Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.587.pdf
Checklist:: 2025.findings-emnlp.587.checklist.pdf

PDF Cite Search Checklist Fix data