Wan Zhang


2021

pdf
LRRA:A Transparent Neural-Symbolic Reasoning Framework for Real-World Visual Question Answering
Wan Zhang | Chen Keming | Zhang Yujie | Xu Jinan | Chen Yufeng
Proceedings of the 20th Chinese National Conference on Computational Linguistics

The predominant approach of visual question answering (VQA) relies on encoding the imageand question with a ”black box” neural encoder and decoding a single token into answers suchas ”yes” or ”no”. Despite this approach’s strong quantitative results it struggles to come up withhuman-readable forms of justification for the prediction process. To address this insufficiency we propose LRRA[LookReadReasoningAnswer]a transparent neural-symbolic framework forvisual question answering that solves the complicated problem in the real world step-by-steplike humans and provides human-readable form of justification at each step. Specifically LRRAlearns to first convert an image into a scene graph and parse a question into multiple reasoning instructions. It then executes the reasoning instructions one at a time by traversing the scenegraph using a recurrent neural-symbolic execution module. Finally it generates answers to the given questions and makes corresponding marks on the image. Furthermore we believe that the relations between objects in the question is of great significance for obtaining the correct answerso we create a perturbed GQA test set by removing linguistic cues (attributes and relations) in the questions to analyze which part of the question contributes more to the answer. Our experimentson the GQA dataset show that LRRA is significantly better than the existing representative model(57.12% vs. 56.39%). Our experiments on the perturbed GQA test set show that the relations between objects is more important for answering complicated questions than the attributes ofobjects.Keywords:Visual Question Answering Relations Between Objects Neural-Symbolic Reason-ing.