Rethinking NLP for Chemistry: A Critical Look at the USPTO Benchmark
Derin Ozer, Nicolas Gutowski, Benoit Da Mota, Thomas Cauchy, Sylvain Lamprier
Abstract
Natural Language Processing (NLP) has catalyzed a paradigm shift in Computer-Aided Synthesis Planning (CASP), reframing chemical synthesis prediction as a sequence-to-sequence modeling problem over molecular string representations like SMILES. This framing has enabled the direct application of language models to chemistry, yielding impressive benchmark scores on the USPTO dataset, a large text corpus of reactions extracted from US patents. However, we show that USPTO’s patent-derived data are both industrially biased and incomplete. They omit many fundamental transformations essential for practical real-world synthesis. Consequently, models trained exclusively on USPTO perform poorly on simple, pharmaceutically relevant reactions despite high benchmark scores. Our findings highlight a broader concern in applying standard NLP pipelines to scientific domains without rethinking data and evaluation: models may learn dataset artifacts rather than domain reasoning. We argue for the development of chemically meaningful benchmarks, greater data diversity, and interdisciplinary dialogue between the NLP community and domain experts to ensure real-world applicability.- Anthology ID:
- 2025.findings-emnlp.1242
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2025
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 22813–22825
- Language:
- URL:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1242/
- DOI:
- 10.18653/v1/2025.findings-emnlp.1242
- Cite (ACL):
- Derin Ozer, Nicolas Gutowski, Benoit Da Mota, Thomas Cauchy, and Sylvain Lamprier. 2025. Rethinking NLP for Chemistry: A Critical Look at the USPTO Benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 22813–22825, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Rethinking NLP for Chemistry: A Critical Look at the USPTO Benchmark (Ozer et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1242.pdf