Jusang Oh
2026
MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering
Sieun Hyeon | Jusang Oh | Sunghwan Steve Cho | Jaeyoung Do
Findings of the Association for Computational Linguistics: ACL 2026
Sieun Hyeon | Jusang Oh | Sunghwan Steve Cho | Jaeyoung Do
Findings of the Association for Computational Linguistics: ACL 2026
Recent advances in Large Language Models (LLMs) have significantly improved table understanding tasks such as Table Question Answering (TableQA), yet challenges remain in ensuring reliability, scalability, and efficiency, especially in resource-constrained or privacy-sensitive environments. In this paper, we introduce MATA, a multi-agent TableQA framework that leverages multiple complementary reasoning paths and a set of tools built with small language models. MATA generates candidate answers through diverse reasoning styles for a given table and question, then refines or selects the optimal answer with the help of these tools. Furthermore, it incorporates an algorithm designed to minimize expensive LLM agent calls, enhancing overall efficiency. MATA maintains strong performance with small, open-source models and adapts easily across various LLM types. Extensive experiments on two benchmarks of varying difficulty with ten different LLMs demonstrate that MATA achieves state-of-the-art accuracy and highly efficient reasoning while avoiding excessive LLM inference. Our results highlight that careful orchestration of multiple reasoning pathways yields scalable and reliable TableQA. The code is available at https://github.com/AIDASLab/MATA.
2024
Towards Efficient Visual-Language Alignment of the Q-Former for Visual Reasoning Tasks
Sungkyung Kim | Adam Lee | Junyoung Park | Andrew Chung | Jusang Oh | Jay-Yoon Lee
Findings of the Association for Computational Linguistics: EMNLP 2024
Sungkyung Kim | Adam Lee | Junyoung Park | Andrew Chung | Jusang Oh | Jay-Yoon Lee
Findings of the Association for Computational Linguistics: EMNLP 2024
Recent advancements in large language models have demonstrated enhanced capabilities in visual reasoning tasks by employing additional encoders for aligning different modalities. While the Q-Former has been widely used as a general encoder for aligning several modalities including image, video, audio, and 3D with large language models, previous works on its efficient training and the analysis of its individual components have been limited. In this work, we investigate the effectiveness of parameter efficient fine-tuning (PEFT) the Q-Former using InstructBLIP with visual reasoning benchmarks ScienceQA and IconQA. We observe that applying PEFT to the Q-Former achieves comparable performance to full fine-tuning using under 2% of the trainable parameters. Additionally, we employ AdaLoRA for dynamic parameter budget reallocation to examine the relative importance of the Q-Former’s sublayers with 4 different benchmarks. Our findings reveal that the self-attention layers are noticeably more important in perceptual visual-language reasoning tasks, and relative importance of FFN layers depends on the complexity of visual-language patterns involved in tasks. The code is available at https://github.com/AttentionX/InstructBLIP_PEFT.