A Hybrid LLM and Supervised Model Pipeline for Polymer Property Extraction from Tables in Scientific Literature
Van-Thuy Phi, Dinh-Truong Do, Hoang-An Trieu, Yuji Matsumoto
Abstract
Extracting structured information from tables in scientific literature is a critical yet challenging task for building domain-specific knowledge bases. This paper addresses extraction of 5-ary polymer property tuples: (POLYMER, PROP_NAME, PROP_VALUE, CONDITION, CHAR_METHOD). We introduce and systematically compare two distinct methodologies: (1) a novel two-stage Hybrid Pipeline that first utilizes Large Language Models (LLMs) for table-to-text conversion, which is then processed by specialized text-based extraction models; and (2) an end-to-end Direct LLM Extraction approach. To evaluate these methods, we employ a systematic, domain-aligned evaluation setup based on the expert-curated PoLyInfo database. Our results demonstrate the clear superiority of the hybrid pipeline. When using Claude Sonnet 4.5 for the linearization stage, the pipeline achieves a score of 67.92% F1@PoLyInfo, significantly outperforming the best direct LLM extraction approach (Claude Sonnet 4.5 at 56.66%). This work establishes the effectiveness of a hybrid architecture that combines the generative strengths of LLMs with the precision of specialized supervised models for structured data extraction.- Anthology ID:
- 2025.wasp-main.11
- Volume:
- Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications
- Month:
- December
- Year:
- 2025
- Address:
- Mumbai, India and virtual
- Editors:
- Alberto Accomazzi, Tirthankar Ghosal, Felix Grezes, Kelly Lockhart
- Venues:
- WASP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 94–102
- Language:
- URL:
- https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.wasp-main.11/
- DOI:
- Cite (ACL):
- Van-Thuy Phi, Dinh-Truong Do, Hoang-An Trieu, and Yuji Matsumoto. 2025. A Hybrid LLM and Supervised Model Pipeline for Polymer Property Extraction from Tables in Scientific Literature. In Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications, pages 94–102, Mumbai, India and virtual. Association for Computational Linguistics.
- Cite (Informal):
- A Hybrid LLM and Supervised Model Pipeline for Polymer Property Extraction from Tables in Scientific Literature (Phi et al., WASP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.wasp-main.11.pdf