Tokenization Granularity and Medical Term Representations in Language Models

Vojtech Lanz, Pavel Pecina


Abstract
We investigate how tokenization granularity affects the representation of medical terminology in language models. Prior work links tokenization granularity to downstream performance under contextualized settings for specifically pretrained and fine-tuned models. We instead ask whether this relationship already emerges at the level of isolated term representations across existing pretrained models. We introduce an intrinsic definition retrieval task using UMLS term-definition pairs, with comparison to WordNet. We show that despite substantially heavier fragmentation of medical terminology, the models remain relatively robust in maintaining semantic alignment between medical terms and their definitions. At the same time, tokenization granularity still correlates with retrieval performance, indicating that effects previously observed in downstream biomedical tasks are already reflected at the level of isolated term representations. Encoder models benefit primarily from whole-token preservation, while for decoder LLMs, tokenization effects emerge mainly at deeper retrieval ranks.
Anthology ID:
2026.bionlp-1.45
Volume:
BioNLP 2026
Month:
July
Year:
2026
Address:
San Diego, California
Editors:
Dina Demner-Fushman, Sophia Ananiadou, Kirk Roberts, Junichi Tsujii
Venues:
BioNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
559–571
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.bionlp-1.45/
DOI:
Bibkey:
Cite (ACL):
Vojtech Lanz and Pavel Pecina. 2026. Tokenization Granularity and Medical Term Representations in Language Models. In BioNLP 2026, pages 559–571, San Diego, California. Association for Computational Linguistics.
Cite (Informal):
Tokenization Granularity and Medical Term Representations in Language Models (Lanz & Pecina, BioNLP 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.bionlp-1.45.pdf