Abstract
Knowledge graphs (KGs) have emerged as a powerful tool for organizing and integrating complex information, making it a suitable format for scientific knowledge. However, translating scientific knowledge into KGs is challenging as a wide variety of styles and elements to present data and ideas is used. Although efforts for KG extraction (KGE) from scientific documents exist, evaluation remains challenging and field-dependent; and existing benchmarks do not focuse on scientific information. Furthermore, establishing a general benchmark for this task is challenging as not all scientific knowledge has a ground-truth KG representation, making any benchmark prone to ambiguity. Here we propose Graph of Organic Synthesis Benchmark (GOSyBench), a benchmark for KG extraction from scientific documents in chemistry, that leverages the native KG-like structure of synthetic routes in organic chemistry. We develop KG-extraction algorithms based on LLMs (GPT-4, Claude, Mistral) and VLMs (GPT-4o), the best of which reaches 73% recovery accuracy and 59% precision, leaving a lot of room for improvement. We expect GOSyBench can serve as a valuable resource for evaluating and advancing KGE methods in the scientific domain, ultimately facilitating better organization, integration, and discovery of scientific knowledge.- Anthology ID:
- 2024.langmol-1.9
- Volume:
- Proceedings of the 1st Workshop on Language + Molecules (L+M 2024)
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Carl Edwards, Qingyun Wang, Manling Li, Lawrence Zhao, Tom Hope, Heng Ji
- Venues:
- LangMol | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 74–84
- Language:
- URL:
- https://aclanthology.org/2024.langmol-1.9
- DOI:
- 10.18653/v1/2024.langmol-1.9
- Cite (ACL):
- Andres M Bran, Zlatko Jončev, and Philippe Schwaller. 2024. Knowledge Graph Extraction from Total Synthesis Documents. In Proceedings of the 1st Workshop on Language + Molecules (L+M 2024), pages 74–84, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- Knowledge Graph Extraction from Total Synthesis Documents (M Bran et al., LangMol-WS 2024)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2024.langmol-1.9.pdf