WISE: Weak-Supervision-Guided Step-by-Step Explanations for Multimodal LLMs in Image Classification

Yiwen Jiang; Deval Mehta; Siyuan Yan; Yaling Shen; Zimu Wang; Zongyuan Ge

WISE: Weak-Supervision-Guided Step-by-Step Explanations for Multimodal LLMs in Image Classification

Yiwen Jiang, Deval Mehta, Siyuan Yan, Yaling Shen, Zimu Wang, Zongyuan Ge

Abstract

Multimodal Large Language Models (MLLMs) have shown promise in visual-textual reasoning, with Multimodal Chain-of-Thought (MCoT) prompting significantly enhancing interpretability. However, existing MCoT methods rely on rationale-rich datasets and largely focus on inter-object reasoning, overlooking the intra-object understanding crucial for image classification. To address this gap, we propose WISE, a Weak-supervision-guided Step-by-step Explanation method that augments any image classification dataset with MCoTs by reformulating the concept-based representations from Concept Bottleneck Models (CBMs) into concise, interpretable reasoning chains under weak supervision. Experiments across ten datasets show that our generated MCoTs not only improve interpretability by 37% but also lead to gains in classification accuracy when used to fine-tune MLLMs. Our work bridges concept-based interpretability and generative MCoT reasoning, providing a generalizable framework for enhancing MLLMs in fine-grained visual understanding.

Anthology ID:: 2025.emnlp-main.741
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14685–14696
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.741/
DOI:
Bibkey:
Cite (ACL):: Yiwen Jiang, Deval Mehta, Siyuan Yan, Yaling Shen, Zimu Wang, and Zongyuan Ge. 2025. WISE: Weak-Supervision-Guided Step-by-Step Explanations for Multimodal LLMs in Image Classification. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14685–14696, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: WISE: Weak-Supervision-Guided Step-by-Step Explanations for Multimodal LLMs in Image Classification (Jiang et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.741.pdf
Checklist:: 2025.emnlp-main.741.checklist.pdf

PDF Cite Search Checklist Fix data