The Zipfian Challenge: Learning the statistical fingerprint of natural languages

Christian Bentz


Abstract
Human languages are often claimed to fundamentally differ from other communication systems. But what is it exactly that unites them as a separate category? This article proposes to approach this problem – here termed the Zipfian Challenge – as a standard classification task. A corpus with textual material from diverse writing systems and languages, as well as other symbolic and non-symbolic systems, is provided. These are subsequently used to train and test binary classification algorithms, assigning labels “writing” and “non-writing” to character strings of the test sets. The performance is generally high, reaching 98% accuracy for the best algorithms. Human languages emerge to have a statistical fingerprint: large unit inventories, high entropy, and few repetitions of adjacent units. This fingerprint can be used to tease them apart from other symbolic and non-symbolic systems.
Anthology ID:
2023.conll-1.3
Volume:
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)
Month:
December
Year:
2023
Address:
Singapore
Editors:
Jing Jiang, David Reitter, Shumin Deng
Venue:
CoNLL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
27–37
Language:
URL:
https://aclanthology.org/2023.conll-1.3
DOI:
10.18653/v1/2023.conll-1.3
Bibkey:
Cite (ACL):
Christian Bentz. 2023. The Zipfian Challenge: Learning the statistical fingerprint of natural languages. In Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), pages 27–37, Singapore. Association for Computational Linguistics.
Cite (Informal):
The Zipfian Challenge: Learning the statistical fingerprint of natural languages (Bentz, CoNLL 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-1/2023.conll-1.3.pdf
Software:
 2023.conll-1.3.Software.pdf
Video:
 https://preview.aclanthology.org/nschneid-patch-1/2023.conll-1.3.mp4