A Hybrid LLM and Supervised Model Pipeline for Polymer Property Extraction from Tables in Scientific Literature

Van-Thuy Phi, Dinh-Truong Do, Hoang-An Trieu, Yuji Matsumoto


Abstract
Extracting structured information from tables in scientific literature is a critical yet challenging task for building domain-specific knowledge bases. This paper addresses extraction of 5-ary polymer property tuples: (POLYMER, PROP_NAME, PROP_VALUE, CONDITION, CHAR_METHOD). We introduce and systematically compare two distinct methodologies: (1) a novel two-stage Hybrid Pipeline that first utilizes Large Language Models (LLMs) for table-to-text conversion, which is then processed by specialized text-based extraction models; and (2) an end-to-end Direct LLM Extraction approach. To evaluate these methods, we employ a systematic, domain-aligned evaluation setup based on the expert-curated PoLyInfo database. Our results demonstrate the clear superiority of the hybrid pipeline. When using Claude Sonnet 4.5 for the linearization stage, the pipeline achieves a score of 67.92% F1@PoLyInfo, significantly outperforming the best direct LLM extraction approach (Claude Sonnet 4.5 at 56.66%). This work establishes the effectiveness of a hybrid architecture that combines the generative strengths of LLMs with the precision of specialized supervised models for structured data extraction.
Anthology ID:
2025.wasp-main.11
Volume:
Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications
Month:
December
Year:
2025
Address:
Mumbai, India and virtual
Editors:
Alberto Accomazzi, Tirthankar Ghosal, Felix Grezes, Kelly Lockhart
Venues:
WASP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
94–102
Language:
URL:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.wasp-main.11/
DOI:
Bibkey:
Cite (ACL):
Van-Thuy Phi, Dinh-Truong Do, Hoang-An Trieu, and Yuji Matsumoto. 2025. A Hybrid LLM and Supervised Model Pipeline for Polymer Property Extraction from Tables in Scientific Literature. In Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications, pages 94–102, Mumbai, India and virtual. Association for Computational Linguistics.
Cite (Informal):
A Hybrid LLM and Supervised Model Pipeline for Polymer Property Extraction from Tables in Scientific Literature (Phi et al., WASP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.wasp-main.11.pdf