Improving Chemical Understanding of LLMs via SMILES Parsing

Yunhui Jang, Jaehyung Kim, Sungsoo Ahn


Abstract
Large language models (LLMs) are increasingly recognized as powerful tools for scientific discovery, particularly in molecular science. A fundamental requirement for these models is the ability to accurately understand molecular structures, commonly encoded in the SMILES representation. However, current LLMs struggle to interpret SMILES, even failing to carry out basic tasks such as counting molecular rings. To address this limitation, we introduce CLEANMOL, a novel framework that formulates SMILES parsing into a suite of clean and deterministic tasks explicitly designed to promote graph-level molecular comprehension. These tasks span from subgraph matching to global graph matching, providing structured supervision aligned with molecular structural properties. We construct a molecular pretraining dataset with adaptive difficulty scoring and pre-train open-source LLMs on these tasks. Our results show that CLEANMOL not only enhances structural comprehension but also achieves the best or competes with the baseline on the Mol-Instructions benchmark.
Anthology ID:
2025.emnlp-main.791
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15694–15709
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.791/
DOI:
Bibkey:
Cite (ACL):
Yunhui Jang, Jaehyung Kim, and Sungsoo Ahn. 2025. Improving Chemical Understanding of LLMs via SMILES Parsing. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15694–15709, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Improving Chemical Understanding of LLMs via SMILES Parsing (Jang et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.791.pdf
Checklist:
 2025.emnlp-main.791.checklist.pdf