Nada Basit

2025

pdf bib abs
How many words does it take to understand a low-resource language?
Emily Chang | Nada Basit
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

When developing language technology, researchers have routinely turned to transfer learning to resolve the data scarcity conundrum presented in low-resource languages. As far as we know, this study is the first to evaluate the amount of documentation needed for transfer learning, specifically the smallest vocabulary size needed to create a sentence embedding space. In adopting widely spoken languages as a proxy for low-resource languages, our experiments show that the relationship between a sentence embedding’s vocabulary size and performance is logarithmic with performance leveling at a vocabulary size of 25,000. It should be noted that this relationship cannot be replicated across all languages and this level of documentation does not exist for many low-resource languages. We do observe, however, that performance accelerates at a vocabulary size of ≤ 1000, a quantity that is present in most low-resource language documentation. These results can aid researchers in understanding whether a low-resource language has enough documentation necessary to support the creation of a sentence embedding and language model.

Co-authors

Emily Chang 1

Venues

naacl1
ws1

Fix data