Functional Lexicon in Subword Tokenization

Zachary William Hopton, Yves Scherrer, Tanja Samardzic


Abstract
The distinction between function and content units of the lexicon has been somewhat neglected in recent NLP work, but it could still be useful when working with low-resource languages, and, in particular, to improve cross-lingual transfer. In this paper, we investigate to what extent BPE subword tokenization can be used to identify units of the functional lexicon in a language without any annotated data. We analyze subword tokens in terms of their productivity and attempt to find thresholds that best distinguish function from content tokens. On a sample of seven diverse languages, we find that the best results are obtained with 50 BPE merges. We also show that this subword tokenization setting can be beneficial for the interlinear glossing task.
Anthology ID:
2025.naacl-long.398
Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7839–7853
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.398/
DOI:
Bibkey:
Cite (ACL):
Zachary William Hopton, Yves Scherrer, and Tanja Samardzic. 2025. Functional Lexicon in Subword Tokenization. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7839–7853, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Functional Lexicon in Subword Tokenization (Hopton et al., NAACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.398.pdf