Vision-Language Models under Cultural and Inclusive Considerations

Antonia Karamolegkou, Phillip Rust, Ruixiang Cui, Yong Cao, Anders Søgaard, Daniel Hershcovich


Abstract
Large Vision Language Models can be used to assist visually impaired individuals by describing images they capture in their daily lives. Current evaluation datasets may not reflect the diverse cultural user backgrounds nor the situational context of this use case. To address this problem, we create a survey to determine caption preferences and propose a culture-centric evaluation benchmark by filtering VizWiz, an existing dataset with images taken by people who are blind. We then evaluate different models and prompts, investigating their reliability as visual assistants. While the evaluation results for state-of-the-art models seem promising, we identified some weak spots such as hallucinations and problems with conventional evaluation metrics. Our survey, data, code, and model outputs will be publicly available.
Anthology ID:
2024.hucllm-1.5
Volume:
Proceedings of the 1st Human-Centered Large Language Modeling Workshop
Month:
August
Year:
2024
Address:
TBD
Editors:
Nikita Soni, Lucie Flek, Ashish Sharma, Diyi Yang, Sara Hooker, H. Andrew Schwartz
Venues:
HuCLLM | WS
SIG:
Publisher:
ACL
Note:
Pages:
53–66
Language:
URL:
https://aclanthology.org/2024.hucllm-1.5
DOI:
Bibkey:
Cite (ACL):
Antonia Karamolegkou, Phillip Rust, Ruixiang Cui, Yong Cao, Anders Søgaard, and Daniel Hershcovich. 2024. Vision-Language Models under Cultural and Inclusive Considerations. In Proceedings of the 1st Human-Centered Large Language Modeling Workshop, pages 53–66, TBD. ACL.
Cite (Informal):
Vision-Language Models under Cultural and Inclusive Considerations (Karamolegkou et al., HuCLLM-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2024.hucllm-1.5.pdf