Extracting Material Property Measurement Data from Scientific Articles
Gihan Panapitiya, Fred Parks, Jonathan Sepulveda, Emily Saldanha
Abstract
Machine learning-based prediction of material properties is often hampered by the lack of sufficiently large training data sets. The majority of such measurement data is embedded in scientific literature and the ability to automatically extract these data is essential to support the development of reliable property prediction methods. In this work, we describe a methodology for developing an automatic property extraction framework using material solubility as the target property. We create a training and evaluation data set containing tags for solubility-related entities using a combination of regular expressions and manual tagging. We then compare five entity recognition models leveraging both token-level and span-level architectures on the task of classifying solute names, solubility values, and solubility units. Additionally, we explore a novel pretraining approach that leverages automated chemical name and quantity extraction tools to generate large datasets that do not rely on intensive manual tagging. Finally, we perform an analysis to identify the causes of classification errors.- Anthology ID:
- 2021.emnlp-main.438
- Volume:
- Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2021
- Address:
- Online and Punta Cana, Dominican Republic
- Editors:
- Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5393–5402
- Language:
- URL:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2021.emnlp-main.438/
- DOI:
- 10.18653/v1/2021.emnlp-main.438
- Cite (ACL):
- Gihan Panapitiya, Fred Parks, Jonathan Sepulveda, and Emily Saldanha. 2021. Extracting Material Property Measurement Data from Scientific Articles. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5393–5402, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Cite (Informal):
- Extracting Material Property Measurement Data from Scientific Articles (Panapitiya et al., EMNLP 2021)
- PDF:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2021.emnlp-main.438.pdf
- Data
- S2ORC