Amran Bhuiyan
2026
Lost in Translation: Do LVLM Judges Generalize Across Languages?
Md Tahmid Rahman Laskar | Mohammed Saidul Islam | Mir Tafseer Nayeem | Amran Bhuiyan | Mizanur Rahman | Shafiq Joty | Enamul Hoque | Jimmy Huang
Findings of the Association for Computational Linguistics: ACL 2026
Md Tahmid Rahman Laskar | Mohammed Saidul Islam | Mir Tafseer Nayeem | Amran Bhuiyan | Mizanur Rahman | Shafiq Joty | Enamul Hoque | Jimmy Huang
Findings of the Association for Computational Linguistics: ACL 2026
Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision–language models (LVLMs). Despite their growing importance, these evaluators are almost exclusively assessed on English-centric benchmarks, leaving open the question of how well these evaluators generalize across languages. To answer this question, we introduce MM-JudgeBench, the first large-scale benchmark for multilingual and multimodal judge model evaluation, which includes over 60K pairwise preference instances spanning 25 typologically diverse languages. MM-JudgeBench integrates two complementary subsets: a general vision–language preference evaluation subset extending VL-RewardBench, and a chart-centric visual–text reasoning subset derived from OpenCQA, enabling systematic analysis of reward models (i.e., LVLM judges) across diverse settings. We additionally release a multilingual training set derived from MM-RewardBench, disjoint from our evaluation data, to support domain adaptation. By evaluating 22 LVLMs (15 open-source, 7 proprietary), we uncover substantial cross-lingual performance variance in our proposed benchmark. Our analysis further shows that model size and architecture are poor predictors of multilingual robustness, and that even state-of-the-art LVLM judges exhibit inconsistent behavior across languages. Together, these findings expose fundamental limitations of current reward modeling and underscore the necessity of multilingual, multimodal benchmarks for developing reliable automated evaluators.
2025
Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices
Md Tahmid Rahman Laskar | Mohammed Saidul Islam | Ridwan Mahbub | Mizanur Rahman | Amran Bhuiyan | Israt Jahan | Mir Tafseer Nayeem | Shafiq Joty | Enamul Hoque | Jimmy Xiangji Huang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Md Tahmid Rahman Laskar | Mohammed Saidul Islam | Ridwan Mahbub | Mizanur Rahman | Amran Bhuiyan | Israt Jahan | Mir Tafseer Nayeem | Shafiq Joty | Enamul Hoque | Jimmy Xiangji Huang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large Vision-Language Models (LVLMs) with only 7B parameters have shown promise as automated judges in chart comprehension tasks. However, tiny models (<=2B parameters) still perform poorly as judges, limiting their real-world use in resource-constrained settings. To address this, we propose two approaches to ensure cost‐efficient evaluation: (i) multi-criteria prompting, which combines separate evaluation criteria into a single query, and (ii) domain‐adaptive transfer learning, in which we fine‐tune a 2B‐parameter VLM on synthetic judgments in a chart dataset to create the ChartJudge. Experiments show that multi-criteria prompting exposes robustness gaps, which led to a huge drop in performance for 7B models, including specialized LVLM judges like LLaVA‐Critic. In addition, we find that our tiny LVLM (ChartJudge) can effectively transfer knowledge from one dataset to another to make it a more specialized model. Our fine-grained analysis across chart types and query complexities offers actionable insights into trade-offs between model size, prompt design, and transferability, enabling scalable, low-cost evaluation for chart reasoning tasks. Our code and the data will be made publicly available.
Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?
Md Tahmid Rahman Laskar | Mohammed Saidul Islam | Ridwan Mahbub | Ahmed Masry | Mizanur Rahman | Amran Bhuiyan | Mir Tafseer Nayeem | Shafiq Joty | Enamul Hoque | Jimmy Xiangji Huang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Md Tahmid Rahman Laskar | Mohammed Saidul Islam | Ridwan Mahbub | Ahmed Masry | Mizanur Rahman | Amran Bhuiyan | Mir Tafseer Nayeem | Shafiq Joty | Enamul Hoque | Jimmy Xiangji Huang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Charts are ubiquitous as they help people understand and reason with data. Recently, various downstream tasks, such as chart question answering, chart2text, and fact-checking, have emerged. Large Vision-Language Models (LVLMs) show promise in tackling these tasks, but their evaluation is costly and time-consuming, limiting real-world deployment. While using LVLMs as judges to assess chart comprehension capabilities of other LVLMs could streamline evaluation processes, challenges like proprietary datasets, restricted access to powerful models, and evaluation costs hinder their adoption in industrial settings. To this end, we present a comprehensive evaluation of 13 open-source LVLMs as judges for diverse chart comprehension and reasoning tasks. We design both pairwise and pointwise evaluation tasks covering criteria like factual correctness, informativeness, and relevancy. Additionally, we analyze LVLM judges based on format adherence, positional consistency, length bias, and instruction-following. We focus on cost-effective LVLMs (<10B parameters) suitable for both research and commercial use, following a standardized evaluation protocol and rubric to measure the LVLM judge accuracy. Experimental results reveal notable variability: while some open LVLM judges achieve GPT-4-level evaluation performance (about 80% agreement with GPT-4 judgments), others struggle (below ~10% agreement). Our findings highlight that state-of-the-art open-source LVLMs can serve as cost-effective automatic evaluators for chart-related tasks, though biases such as positional preference and length bias persist.
2024
A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations
Md Tahmid Rahman Laskar | Sawsan Alqahtani | M Saiful Bari | Mizanur Rahman | Mohammad Abdullah Matin Khan | Haidar Khan | Israt Jahan | Amran Bhuiyan | Chee Wei Tan | Md Rizwan Parvez | Enamul Hoque | Shafiq Joty | Jimmy Xiangji Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Md Tahmid Rahman Laskar | Sawsan Alqahtani | M Saiful Bari | Mizanur Rahman | Mohammad Abdullah Matin Khan | Haidar Khan | Israt Jahan | Amran Bhuiyan | Chee Wei Tan | Md Rizwan Parvez | Enamul Hoque | Shafiq Joty | Jimmy Xiangji Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them in real-world applications to ensure they produce reliable performance. Despite the well-established importance of evaluating LLMs in the community, the complexity of the evaluation process has led to varied evaluation setups, causing inconsistencies in findings and interpretations. To address this, we systematically review the primary challenges and limitations causing these inconsistencies and unreliable evaluations in various steps of LLM evaluation. Based on our critical review, we present our perspectives and recommendations to ensure LLM evaluations are reproducible, reliable, and robust.