Incorporating Domain Knowledge into Materials Tokenization

Yerim Oh, Jun-Hyung Park, Junho Kim, SungHo Kim, SangKeun Lee


Abstract
While language models are increasingly utilized in materials science, typical models rely on frequency-centric tokenization methods originally developed for natural language processing. However, these methods frequently produce excessive fragmentation and semantic loss, failing to maintain the structural and semantic integrity of material concepts. To address this issue, we propose MATTER, a novel tokenization approach that integrates material knowledge into tokenization. Based on MatDetector trained on our materials knowledge base and re-ranking method prioritizing material terms in token merging, MATTER maintains the structural integrity of identified materials concepts and prevents fragmentation during tokenization, ensuring their semantic meaning remains intact. The experimental results demonstrate that MATTER outperforms existing tokenization methods, achieving an average performance gain of 4% and 2% in the generation and classification tasks, respectively. These results underscore the importance of domain knowledge for tokenization strategies in scientific text processing.
Anthology ID:
2025.acl-long.474
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9623–9644
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.474/
DOI:
Bibkey:
Cite (ACL):
Yerim Oh, Jun-Hyung Park, Junho Kim, SungHo Kim, and SangKeun Lee. 2025. Incorporating Domain Knowledge into Materials Tokenization. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9623–9644, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Incorporating Domain Knowledge into Materials Tokenization (Oh et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.474.pdf