Nicolas Gutowski
2025
Rethinking NLP for Chemistry: A Critical Look at the USPTO Benchmark
Derin Ozer
|
Nicolas Gutowski
|
Benoit Da Mota
|
Thomas Cauchy
|
Sylvain Lamprier
Findings of the Association for Computational Linguistics: EMNLP 2025
Natural Language Processing (NLP) has catalyzed a paradigm shift in Computer-Aided Synthesis Planning (CASP), reframing chemical synthesis prediction as a sequence-to-sequence modeling problem over molecular string representations like SMILES. This framing has enabled the direct application of language models to chemistry, yielding impressive benchmark scores on the USPTO dataset, a large text corpus of reactions extracted from US patents. However, we show that USPTO’s patent-derived data are both industrially biased and incomplete. They omit many fundamental transformations essential for practical real-world synthesis. Consequently, models trained exclusively on USPTO perform poorly on simple, pharmaceutically relevant reactions despite high benchmark scores. Our findings highlight a broader concern in applying standard NLP pipelines to scientific domains without rethinking data and evaluation: models may learn dataset artifacts rather than domain reasoning. We argue for the development of chemically meaningful benchmarks, greater data diversity, and interdisciplinary dialogue between the NLP community and domain experts to ensure real-world applicability.
2023
Byte Pair Encoding for Symbolic Music
Nathan Fradet
|
Nicolas Gutowski
|
Fabien Chhel
|
Jean-Pierre Briot
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
When used with deep learning, the symbolic music modality is often coupled with language model architectures. To do so, the music needs to be tokenized, i.e. converted into a sequence of discrete tokens. This can be achieved by different approaches, as music can be composed of simultaneous tracks, of simultaneous notes with several attributes. Until now, the proposed tokenizations rely on small vocabularies of tokens describing the note attributes and time events, resulting in fairly long token sequences, and a sub-optimal use of the embedding space of language models. Recent research has put efforts on reducing the overall sequence length by merging embeddings or combining tokens. In this paper, we show that Byte Pair Encoding, a compression technique widely used for natural language, significantly decreases the sequence length while increasing the vocabulary size. By doing so, we leverage the embedding capabilities of such models with more expressive tokens, resulting in both better results and faster inference in generation and classification tasks. The [source code is shared on Github](https://github.com/Natooz/bpe-symbolic-music), along with a [companion website](https://Natooz.github.io/BPE-Symbolic-Music). Finally, BPE is directly implemented in [MidiTok](https://github.com/Natooz/MidiTok), allowing the reader to easily benefit from this method.
Search
Fix author
Co-authors
- Jean-Pierre Briot 1
- Thomas Cauchy 1
- Fabien Chhel 1
- Benoit Da Mota 1
- Nathan Fradet 1
- show all...