Molecular String Representation Preferences in Pretrained LLMs: A Comparative Study in Zero- & Few-Shot Molecular Property Prediction

George Baker; Mario Sanz-Guerrero; Katharina von der Wense

Molecular String Representation Preferences in Pretrained LLMs: A Comparative Study in Zero- & Few-Shot Molecular Property Prediction

George Arthur Baker, Mario Sanz-Guerrero, Katharina von der Wense

Abstract

Large Language Models (LLMs) have demonstrated capabilities for natural language formulations of molecular property prediction tasks, but little is known about how performance depends on the representation of input molecules to the model; the status quo approach is to use SMILES strings, although alternative chemical notations convey molecular information differently, each with their own strengths and weaknesses. To learn more about molecular string representation preferences in LLMs, we compare the performance of four recent models—GPT-4o, Gemini 1.5 Pro, Llama 3.1 405b, and Mistral Large 2—on molecular property prediction tasks from the MoleculeNet benchmark across five different molecular string representations: SMILES, DeepSMILES, SELFIES, InChI, and IUPAC names. We find statistically significant zero- and few-shot preferences for InChI and IUPAC names, potentially due to representation granularity, favorable tokenization, and prevalence in pretraining corpora. This contradicts previous assumptions that molecules should be presented to LLMs as SMILES strings. When these preferences are taken advantage of, few-shot performance rivals or surpasses many previous conventional approaches to property prediction, with the advantage of explainable predictions through chain-of-thought reasoning not held by task-specific models.

Anthology ID:: 2025.emnlp-main.56
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1071–1085
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.56/
DOI:
Bibkey:
Cite (ACL):: George Arthur Baker, Mario Sanz-Guerrero, and Katharina von der Wense. 2025. Molecular String Representation Preferences in Pretrained LLMs: A Comparative Study in Zero- & Few-Shot Molecular Property Prediction. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1071–1085, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Molecular String Representation Preferences in Pretrained LLMs: A Comparative Study in Zero- & Few-Shot Molecular Property Prediction (Baker et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.56.pdf
Checklist:: 2025.emnlp-main.56.checklist.pdf

PDF Cite Search Checklist Fix data