Extracting Material Property Measurement Data from Scientific Articles

Gihan Panapitiya, Fred Parks, Jonathan Sepulveda, Emily Saldanha


Abstract
Machine learning-based prediction of material properties is often hampered by the lack of sufficiently large training data sets. The majority of such measurement data is embedded in scientific literature and the ability to automatically extract these data is essential to support the development of reliable property prediction methods. In this work, we describe a methodology for developing an automatic property extraction framework using material solubility as the target property. We create a training and evaluation data set containing tags for solubility-related entities using a combination of regular expressions and manual tagging. We then compare five entity recognition models leveraging both token-level and span-level architectures on the task of classifying solute names, solubility values, and solubility units. Additionally, we explore a novel pretraining approach that leverages automated chemical name and quantity extraction tools to generate large datasets that do not rely on intensive manual tagging. Finally, we perform an analysis to identify the causes of classification errors.
Anthology ID:
2021.emnlp-main.438
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5393–5402
Language:
URL:
https://preview.aclanthology.org/build-pipeline-with-new-library/2021.emnlp-main.438/
DOI:
10.18653/v1/2021.emnlp-main.438
Bibkey:
Cite (ACL):
Gihan Panapitiya, Fred Parks, Jonathan Sepulveda, and Emily Saldanha. 2021. Extracting Material Property Measurement Data from Scientific Articles. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5393–5402, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Extracting Material Property Measurement Data from Scientific Articles (Panapitiya et al., EMNLP 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/build-pipeline-with-new-library/2021.emnlp-main.438.pdf
Video:
 https://preview.aclanthology.org/build-pipeline-with-new-library/2021.emnlp-main.438.mp4
Data
S2ORC