Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?

Md Tahmid Rahman Laskar; Mohammed Saidul Islam; Ridwan Mahbub; Ahmed Masry; Mizanur Rahman; Amran Bhuiyan; Mir Tafseer Nayeem; Shafiq Joty; Enamul Hoque; Jimmy Huang

Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?

Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub, Ahmed Masry, Mizanur Rahman, Amran Bhuiyan, Mir Tafseer Nayeem, Shafiq Joty, Enamul Hoque, Jimmy Huang

Abstract

Charts are ubiquitous as they help people understand and reason with data. Recently, various downstream tasks, such as chart question answering, chart2text, and fact-checking, have emerged. Large Vision-Language Models (LVLMs) show promise in tackling these tasks, but their evaluation is costly and time-consuming, limiting real-world deployment. While using LVLMs as judges to assess chart comprehension capabilities of other LVLMs could streamline evaluation processes, challenges like proprietary datasets, restricted access to powerful models, and evaluation costs hinder their adoption in industrial settings. To this end, we present a comprehensive evaluation of 13 open-source LVLMs as judges for diverse chart comprehension and reasoning tasks. We design both pairwise and pointwise evaluation tasks covering criteria like factual correctness, informativeness, and relevancy. Additionally, we analyze LVLM judges based on format adherence, positional consistency, length bias, and instruction-following. We focus on cost-effective LVLMs (<10B parameters) suitable for both research and commercial use, following a standardized evaluation protocol and rubric to measure the LVLM judge accuracy. Experimental results reveal notable variability: while some open LVLM judges achieve GPT-4-level evaluation performance (about 80% agreement with GPT-4 judgments), others struggle (below ~10% agreement). Our findings highlight that state-of-the-art open-source LVLMs can serve as cost-effective automatic evaluators for chart-related tasks, though biases such as positional preference and length bias persist.

Anthology ID:: 2025.acl-industry.83
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Georg Rehm, Yunyao Li
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1203–1216
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.acl-industry.83/
DOI:
Bibkey:
Cite (ACL):: Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub, Ahmed Masry, Mizanur Rahman, Amran Bhuiyan, Mir Tafseer Nayeem, Shafiq Joty, Enamul Hoque, and Jimmy Huang. 2025. Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pages 1203–1216, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning? (Laskar et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.acl-industry.83.pdf

PDF Cite Search Fix data