Large-scale Vision-Language Models (LVLMs) process both images and text, excelling in multimodal tasks such as image captioning and description generation. However, while these models excel at generating factual content, their ability to generate and evaluate texts reflecting perspectives on the same image, depending on the context, has not been sufficiently explored. To address this, we propose IRR: Image Review Rank, a novel evaluation framework designed to assess critic review texts from multiple perspectives. IRR evaluates LVLMs by measuring how closely their judgments align with human interpretations. We validate it using a dataset of images from 15 categories, each with five critic review texts and annotated rankings in both English and Japanese, totaling over 2,000 data instances. Our results indicate that, although LVLMs exhibited consistent performance across languages, their correlation with human annotations was insufficient, highlighting the need for further advancements. These findings highlight the limitations of current evaluation methods and the need for approaches that better capture human reasoning in Vision & Language tasks.
Large language models (LLMs) achieve high performance through instruction-tuning, which involves learning various tasks using instruction templates. However, these templates often contain task-specific expressions, which are words that frequently appear in certain contexts but do not always convey the actual meaning of that context, even if they seem closely related to the target task. Biases inherent in such instruction templates may be learned by LLMs during training, potentially degrading performance when the models encounter superficial expressions. In this study, we propose a method that incorporates additional instructions to FLAN templates, without altering the base instruction to produce “superfluous instructions”. This allows us to investigate the vulnerabilities of LLMs caused by overfitting to task-specific expressions embedded in instruction templates. The experimental results revealed that the inclusion of superficial words strongly related to each task in the instruction text can alter the output, regardless of the intended meaning.
The proportion of responses to a question and its options, known as the response distribution, enables detailed analysis of human society. Recent studies highlight the use of Large Language Models (LLMs) for predicting response distributions as a cost-effective survey method. However, the reliability of these predictions remains unclear. LLMs often generate answers by blindly following instructions rather than applying rational reasoning based on pretraining-acquired knowledge. This study investigates whether LLMs can rationally estimate distributions when presented with explanations of “artificially generated distributions” that are against commonsense. Specifically, we assess whether LLMs recognize counterintuitive explanations and adjust their predictions or simply follow these inconsistent explanations. Results indicate that smaller or less human-optimized LLMs tend to follow explanations uncritically, while larger or more optimized models are better at resisting counterintuitive explanations by leveraging their pretraining-acquired knowledge. These findings shed light on factors influencing distribution prediction performance in LLMs and are crucial for developing reliable distribution predictions using language models.