Abstract
The relation between the length of a text and the number of unique words is investigated using several Swedish language corpora. We consider a number of existing measures of vocabulary richness, show that they are not length-independent, and try to improve on some of them based on statistical evidence. We also look at the spectrum of values over text lengths, and find that genres have characteristic shapes.- Anthology ID:
- 2023.nodalida-1.56
- Volume:
- Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
- Month:
- May
- Year:
- 2023
- Address:
- Tórshavn, Faroe Islands
- Editors:
- Tanel Alumäe, Mark Fishel
- Venue:
- NoDaLiDa
- SIG:
- Publisher:
- University of Tartu Library
- Note:
- Pages:
- 565–573
- Language:
- URL:
- https://aclanthology.org/2023.nodalida-1.56
- DOI:
- Cite (ACL):
- Niklas Zechner. 2023. Length Dependence of Vocabulary Richness. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 565–573, Tórshavn, Faroe Islands. University of Tartu Library.
- Cite (Informal):
- Length Dependence of Vocabulary Richness (Zechner, NoDaLiDa 2023)
- PDF:
- https://preview.aclanthology.org/teach-a-man-to-fish/2023.nodalida-1.56.pdf