Masanari Oi
2026
Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models
Masanari Oi | Masahiro Kaneko | Naoaki Okazaki | Nakamasa Inoue
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Masanari Oi | Masahiro Kaneko | Naoaki Okazaki | Nakamasa Inoue
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Vision-language models (VLMs) have shown impressive abilities across a range of multi-modal tasks. However, existing metrics for evaluating the quality of text generated by VLMs typically focus on an overall evaluation for a specific task, such as image captioning. While the overall evaluation is essential for any task, the criteria prioritized can differ depending on the task, making it challenging for current metrics to adapt to multi-task scenarios. To address this limitation, we propose HarmonicEval, a reference-free comprehensive evaluation metric that aggregates criterion-wise scores to produce the overall score in a bottom-up manner. Furthermore, to assess the generalizability of automatic evaluation metrics in multi-task scenarios, we construct the Multi-task Multi-criteria Human Evaluation (MMHE) benchmark, which comprises 18,000 expert human judgments across four multi-modal tasks. Our experiments demonstrate that HarmonicEval achieves higher correlations with human judgments than conventional metrics while providing numerical scores for each criterion. Our code and data will be available publicly.
2024
Likelihood-based Mitigation of Evaluation Bias in Large Language Models
Masanari Oi | Masahiro Kaneko | Ryuto Koike | Mengsay Loem | Naoaki Okazaki
Findings of the Association for Computational Linguistics: ACL 2024
Masanari Oi | Masahiro Kaneko | Ryuto Koike | Mengsay Loem | Naoaki Okazaki
Findings of the Association for Computational Linguistics: ACL 2024
Large Language Models (LLMs) are widely used to evaluate natural language generation tasks as automated metrics.However, the likelihood, a measure of LLM’s plausibility for a sentence, can vary due to superficial differences in sentences, such as word order and sentence structure.It is therefore possible that there might be a likelihood bias if LLMs are used for evaluation: they might overrate sentences with higher likelihoods while underrating those with lower likelihoods.In this paper, we investigate the presence and impact of likelihood bias in LLM-based evaluators.We also propose a method to mitigate the likelihood bias.Our method utilizes highly biased instances as few-shot examples for in-context learning.Our experiments in evaluating the data-to-text and grammatical error correction tasks reveal that several LLMs we test display a likelihood bias.Furthermore, our proposed method successfully mitigates this bias, also improving evaluation performance (in terms of correlation of models with human scores) significantly.