MatViX: Multimodal Information Extraction from Visually Rich Articles

Ghazal Khalighinejad, Sharon Scott, Ollie Liu, Kelly L. Anderson, Rickard Stureborg, Aman Tyagi, Bhuwan Dhingra


Abstract
Multimodal information extraction (MIE) is crucial for scientific literature, where valuable data is often spread across text, figures, and tables. In materials science, extracting structured information from research articles can accelerate the discovery of new materials. However, the multimodal nature and complex interconnections of scientific content present challenges for traditional text-based methods. We introduce MatViX, a benchmark consisting of 324 full-length research articles and 1,688 complex structured JSON files, carefully curated by domain experts in polymer nanocomposites and biodegradation. These JSON files are extracted from text, tables, and figures in full-length documents, providing a comprehensive challenge for MIE. We introduce a novel evaluation method to assess the accuracy of curve similarity and the alignment of hierarchical structures. Additionally, we benchmark vision-language models (VLMs) in a zero-shot manner, capable of processing long contexts and multimodal inputs. Our results demonstrate significant room for improvement in current models.
Anthology ID:
2025.naacl-long.185
Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3636–3655
Language:
URL:
https://preview.aclanthology.org/moar-dois/2025.naacl-long.185/
DOI:
10.18653/v1/2025.naacl-long.185
Bibkey:
Cite (ACL):
Ghazal Khalighinejad, Sharon Scott, Ollie Liu, Kelly L. Anderson, Rickard Stureborg, Aman Tyagi, and Bhuwan Dhingra. 2025. MatViX: Multimodal Information Extraction from Visually Rich Articles. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3636–3655, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
MatViX: Multimodal Information Extraction from Visually Rich Articles (Khalighinejad et al., NAACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/moar-dois/2025.naacl-long.185.pdf