Jianghong Ma

2026

AIM-CoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning
Xiping Li | Jianghong Ma
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Interleaved-Modal Chain-of-Thought (I-MCoT) advances vision-language reasoning, such as Visual Question Answering (VQA). This paradigm integrates specially selected visual evidence from the input image into the context of Vision-Language Models (VLMs), enabling them to ground their reasoning logic in these details. Accordingly, the efficacy of an I-MCoT framework relies on identifying *what* to see (evidence selection) and *when* to see it (triggering of insertions). However, existing methods fall short in both aspects. First, for selection, they rely on attention signals, which are unreliable—particularly under severe granularity imbalance between the brief textual query and the informative image. Second, for triggering, they adopt static triggers, which fail to capture the VLMs’ dynamic needs for visual evidence. To this end, we propose a novel I-MCoT framework, **A**ctive **I**nformation-driven **M**ulti-modal **C**hain-**o**f-**T**hought (**AIM-CoT**), which aims to improve both evidence selection and insertion triggering via: (1) **Context-enhanced Attention-map Generation (CAG)** to mitigate granularity imbalance via textual context enhancement; (2) **Active Visual Probing (AVP)** to proactively select the most informative evidence via an information foraging process; and (3) **Dynamic Attention-shift Trigger (DAT)** to precisely activate insertions when VLM’s attention shifts from text to visual context. Experiments across three benchmarks and four backbones demonstrate AIM-CoT’s consistent superiority. Our code is available at https://anonymous.4open.science/r/AIMCoT.

2023

pdf bib abs

Asymmetric feature interaction for interpreting model predictions
Xiaolei Lu | Jianghong Ma | Haode Zhang
Findings of the Association for Computational Linguistics: ACL 2023

In natural language processing (NLP), deep neural networks (DNNs) could model complex interactions between context and have achieved impressive results on a range of NLP tasks. Prior works on feature interaction attribution mainly focus on studying symmetric interaction that only explains the additional influence of a set of words in combination, which fails to capture asymmetric influence that contributes to model prediction. In this work, we propose an asymmetric feature interaction attribution explanation model that aims to explore asymmetric higher-order feature interactions in the inference of deep neural NLP models. By representing our explanation with an directed interaction graph, we experimentally demonstrate interpretability of the graph to discover asymmetric feature interactions. Experimental results on two sentiment classification datasets show the superiority of our model against the state-of-the-art feature interaction attribution methods in identifying influential features for model predictions.

Co-authors

Venues

ACL1
Findings1

Fix author