Foundation Models of Scientific Knowledge for Chemistry: Opportunities, Challenges and Lessons Learned
Sameera Horawalavithana, Ellyn Ayton, Shivam Sharma, Scott Howland, Megha Subramanian, Scott Vasquez, Robin Cosbey, Maria Glenski, Svitlana Volkova
Abstract
Foundation models pre-trained on large corpora demonstrate significant gains across many natural language processing tasks and domains e.g., law, healthcare, education, etc. However, only limited efforts have investigated the opportunities and limitations of applying these powerful models to science and security applications. In this work, we develop foundation models of scientific knowledge for chemistry to augment scientists with the advanced ability to perceive and reason at scale previously unimagined. Specifically, we build large-scale (1.47B parameter) general-purpose models for chemistry that can be effectively used to perform a wide range of in-domain and out-of-domain tasks. Evaluating these models in a zero-shot setting, we analyze the effect of model and data scaling, knowledge depth, and temporality on model performance in context of model training efficiency. Our novel findings demonstrate that (1) model size significantly contributes to the task performance when evaluated in a zero-shot setting; (2) data quality (aka diversity) affects model performance more than data quantity; (3) similarly, unlike previous work, temporal order of the documents in the corpus boosts model performance only for specific tasks, e.g., SciQ; and (4) models pre-trained from scratch perform better on in-domain tasks than those tuned from general-purpose models like Open AI’s GPT-2.- Anthology ID:
- 2022.bigscience-1.12
- Volume:
- Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models
- Month:
- May
- Year:
- 2022
- Address:
- virtual+Dublin
- Editors:
- Angela Fan, Suzana Ilic, Thomas Wolf, Matthias Gallé
- Venue:
- BigScience
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 160–172
- Language:
- URL:
- https://aclanthology.org/2022.bigscience-1.12
- DOI:
- 10.18653/v1/2022.bigscience-1.12
- Cite (ACL):
- Sameera Horawalavithana, Ellyn Ayton, Shivam Sharma, Scott Howland, Megha Subramanian, Scott Vasquez, Robin Cosbey, Maria Glenski, and Svitlana Volkova. 2022. Foundation Models of Scientific Knowledge for Chemistry: Opportunities, Challenges and Lessons Learned. In Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models, pages 160–172, virtual+Dublin. Association for Computational Linguistics.
- Cite (Informal):
- Foundation Models of Scientific Knowledge for Chemistry: Opportunities, Challenges and Lessons Learned (Horawalavithana et al., BigScience 2022)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2022.bigscience-1.12.pdf
- Code
- eleutherai/gpt-neox
- Data
- BLUE, BoolQ, CORD-19, LAMBADA, MathQA, OpenBookQA, PIQA, PubMedQA, S2ORC, SciQ, The Pile, WSC, WebText, WiC