RITT: A Retrieval-Assisted Framework with Image and Text Table Representations for Table Question Answering

Wei Zhou, Mohsen Mesgar, Heike Adel, Annemarie Friedrich


Abstract
Tables can be represented either as text or as images. Previous works on table question answering (TQA) typically rely on only one representation, neglecting the potential benefits of combining both. In this work, we explore integrating textual and visual table representations using multi-modal large language models (MLLMs) for TQA. Specifically, we propose RITT, a retrieval-assisted framework that first identifies the most relevant part of a table for a given question, then dynamically selects the optimal table representations based on the question type. Experiments demonstrate that our framework significantly outperforms the baseline MLLMs by an average of 13 Exact Match and surpasses two text-only state-of-the-art TQA methods on four TQA benchmarks, highlighting the benefits of leveraging both textual and visual table representations.
Anthology ID:
2025.trl-1.8
Volume:
Proceedings of the 4th Table Representation Learning Workshop
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Shuaichen Chang, Madelon Hulsebos, Qian Liu, Wenhu Chen, Huan Sun
Venues:
TRL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
86–97
Language:
URL:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.trl-1.8/
DOI:
Bibkey:
Cite (ACL):
Wei Zhou, Mohsen Mesgar, Heike Adel, and Annemarie Friedrich. 2025. RITT: A Retrieval-Assisted Framework with Image and Text Table Representations for Table Question Answering. In Proceedings of the 4th Table Representation Learning Workshop, pages 86–97, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
RITT: A Retrieval-Assisted Framework with Image and Text Table Representations for Table Question Answering (Zhou et al., TRL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.trl-1.8.pdf